Register Now

Login

Lost Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Login

Register Now

Lorem ipsum dolor sit amet, consectetur adipiscing elit.Morbi adipiscing gravdio, sit amet suscipit risus ultrices eu.Fusce viverra neque at purus laoreet consequa.Vivamus vulputate posuere nisl quis consequat.

PSA: You do not want fancy stuff to do good work.

I have been studying numerous posts on r/datascience and several other appear to orbit the topic of easy methods to use the newest device or tweak, I perceive that it may be straightforward to get caught up within the whirlwind of instruments, frameworks, and cutting-edge applied sciences. Whereas these developments can undoubtedly improve our work, it is necessary to keep in mind that knowledge science is not about utilizing essentially the most superior or costly instruments; it is about extracting worthwhile insights from knowledge to drive knowledgeable decision-making.

Information Assortment and Categorization

Earlier than diving into superior machine studying algorithms or statistical fashions, we have to begin with the fundamentals: gathering and organizing knowledge. Thankfully, each Python and R provide a wealth of libraries that make it straightforward to gather knowledge from a wide range of sources, together with net scraping, APIs, and studying from recordsdata. Key libraries in Python embody [requests](https://requests.readthedocs.io/en/newest/), [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/newest/), and [pandas](https://pandas.pydata.org/), whereas R has [httr](https://cran.r-project.org/net/packages/httr/index.html), [rvest](https://rvest.tidyverse.org/), and [dplyr](https://dplyr.tidyverse.org/).

These libraries not solely make it straightforward to gather knowledge but in addition to scrub and construction it for evaluation. With only a few traces of code, you’ll be able to filter, kind, and rework knowledge right into a format that is prepared for exploration and modeling.

Information Evaluation and Visualization

As soon as your knowledge is collected and arranged, the subsequent step is to research and visualize it. Each Python and R excel on this space, offering a variety of libraries and packages for exploratory knowledge evaluation and visualization.

Python’s pandas, [NumPy](https://numpy.org/), and [SciPy](https://scipy.org/) libraries provide highly effective performance for knowledge manipulation, whereas [matplotlib](https://matplotlib.org/), [seaborn](https://seaborn.pydata.org/), and [plotly](https://plotly.com/) present versatile instruments for creating visualizations. Equally, in R, you should use dplyr, [tidyverse](https://www.tidyverse.org/), and [data.table](https://cran.r-project.org/net/packages/knowledge.desk/vignettes/datatable-intro.html) for knowledge manipulation, and [ggplot2](https://ggplot2.tidyverse.org/), [lattice](https://cran.r-project.org/net/packages/lattice/index.html), and [shiny](https://shiny.rstudio.com/) for visualization. These packages allow you to create insightful visualizations and carry out statistical analyses with out counting on costly or proprietary software program.

Modeling and Prediction

Lastly, in terms of constructing fashions and making predictions, Python and R have a plethora of choices out there. Libraries like [scikit-learn](https://scikit-learn.org), [statsmodels](https://www.statsmodels.org/steady/index.html), and [TensorFlow](https://www.tensorflow.org/)in Python, or [caret](https://topepo.github.io/caret/), [randomForest](https://cran.r-project.org/net/packages/randomForest/randomForest.pdf), and [xgboost](https://xgboost.readthedocs.io/en/steady/)in R, present highly effective machine studying algorithms and statistical fashions that may be utilized to a variety of issues. What’s extra, these libraries are open-source and have intensive documentation and group help, making it straightforward to be taught and apply new strategies while not having specialised coaching or costly software program licenses.

Simplicity is vital, embrace it and you will be taught so much sooner than making an attempt to glean insights from some poorly skilled AI mannequin.

​

ps. Any “IDE” extra intensive than VIM/EMACS/~~nano~~ are pointless 🙂

Comments ( 25 )

  1. Tools are created to solve problems. If you haven’t encountered those problems it isn’t obvious why you should use a (more advanced) tool over an obvious approach.

    People tend to go in different directions:

    * camp A follows the flavour of the month and they’re stuck with a microservices architecture running on managed k8s in the cloud that creates *no* value whatsoever. They’re doing it for the sake of doing it.
    * Camp B wants to do everything in the simplest way possible which means they’re forever doing the same thing be it Excel, notebooks, data warehousing or dashboards (depending on their maturity) even when there’s room to grow in IT/data maturity and create more value.

    Both are bad.

    **Something that I think about is the correlation between big tech and their fancy stuff.** Is the relationship causal in the sense that: big tech -> fancy stuff *or* is it fancy stuff -> big tech.

    You can rule the latter out because managed K8S doesn’t magically bring you to FAANG’s revenue. What we can say however is that big tech isn’t shy of using advanced tools/methods when they’re necessary so the approach of camp B is certainly wrong as well.

    I think it’s all down to realising that you’re in *science* which means that you need to experiment within your domain and also (on a small scale) with your tools.

  2. As long time emacs user (I primarily use VScode now), I think you discredit yourself by arguing that simplicity is key and, oh yeah, the average DS should use VIM or emacs 😀.

    More seriously, I think the part that is missing is investing in domain knowledge. Then, sometimes a quick analysis is needed and you can use simple tools; but sometimes though your part of a much larger org with more complicated needs and you need to invest in serious tooling.

  3. Counterpoint, I know how the other shit works. So I’m reaching out here to figure out how the new shit works.

  4. Really? I thought we could just throw everything into a LLM and it will solve all our problems!

  5. >Any “IDE” more extensive than VIM/EMACS/nano are unnecessary 🙂

    “640K ought to be enough for anybody.”

  6. You need to be on LLMs to survive the coming wave

  7. Reads like a gpt generated post

  8. What’s valuable, and not as easy: a department or company-wide curated *internal* tools library to solve particular problems.

    Command line or API, but with a higher standard than one-offs. This means source code control, builds, releases, code reviews, tests, automated CI if appropriate, documentation and internal help mechanisms.

    This needs to live longer than any one employee’s tenure of course. It needs management support and resources.

  9. Tend to agree. This community has a tendency to gravitate toward high-complexity solutions for low-complexity needs.

  10. W8, I thought that was the fancy stuff :O

  11. Broadly agree, though in case anyone used this as a “what to learn” guide I would swap caret for Tidymodels as it’s a successor that improves on the original in most ways, including ease.

  12. True but you definitely need better than Excel

  13. I will just add: Why are you jumping to a neural network before you even bothered to try a linear regression?!

  14. To piggy back on the ide comment….it is one of the weirdest things to get stuck on. It’s so miniscule to data science as a profession that I’m just confused why you felt the need to throw it in

  15. httpx > requests

    polars > pandas

  16. I only know y = mx + c . Please give me DS job….

  17. Lol vscode & jupyter-notebooks are very much required.

  18. Data science depends on good data. There’s a lot needed to make that happen. For customer modeling you often need a CDP. For understanding customer behavior, you need the click stream data processed as signals ready for modeling, for big data systems you need features to be engineered at scale.

    All of these are dependencies on other data solutions doing it’s job. A lot of the data collection steps you described are positioned as solvable using libraries in python. For most organizations that’s not possible. Agree with a lot of what you’re saying but want to highlight these precursors for it to be possible.

  19. It can be so frustrating to work with people who want to show off. I get young recent grads wanting to prove themselves, but if a simple tools works better, use the simple tool. Over-fitting is unacceptable, and everyone doing it knows that they are doing it. Furthermore, it is just best practice to keep things simple when possible (and not unrealistically bring down variance).

    When I was young I was so frustrated that more experienced analysts were “holding us back,” and now I realize how tolerant and wise my superiors actually were.

  20. caret? tidymodels or mlr3 these days. Plus torch.

  21. >ps. Any “IDE” more extensive than VIM/EMACS/nano are unnecessary 🙂

    I do all my DS work in assembly, damn kids and their fancy “VIMS”

  22. Totally agree with you, beside one fact. One day, I did a job interview with the data head of a company. He was not so found of data treatment using Python. For him, problems come when it’s necessary to refresh some code : it takes an amount of time related to the length of the code and sometimes you are stuck for days on a problem. Whereas if you use an ETL, the code review is more obvious and faster (according to him). What do you think of this?

Leave a reply