A List of The First Things you Should Learn as an Aspiring Data Scientist

Anderson L. Amaral
4 min readMay 27, 2018

Since my first data-related job ,eleven 11 years ago , working for one of the biggest Statistical company in the world , the government-owned Brazilian Institute of Geography and Statistics(IBGE) , I have witnessed a considerable progress in the sector. At that time, honestly, I had no idea how important data-related tools would have become so important . The Data Analyst job title has given origin to many other “job titles”, from Data architect to data engineers and data scientists . By that, on the following lines I have listed what I believe are the first things one should learn as an aspirant data scientist:

  • First of all : absolutely go for MOOCs . The amount of knowledge you can acquire in these courses is amazing. Edx, Coursera, Udacity, Udemy, LinkedIn Learning, DataCamp. Then, one should put into practice with real open data, more precisely, into real cases (Preferably, by publishing studies on Data.world , Medium , Kaggle, LinkedIn and so on).
  • Case studies : this is the second on the list because I have no doubt by far that this is one of the most important skills. There is no point in knowing many “clichês” without knowing real case studies. Including ML deployment. Also, it is worth to remind: a data scientist life is not based on an easy and organized csv file like in Kaggle ready for you to apply machine learning models, as this is just the least complex part.
  • Getting to know different sorts of data and data issues such as duplicated rows, missing values, date splits, strings to be converted into numericals or float data ant etc.
  • How to identify important metrics : R² , RMSE , Recall, f1-score and many others
  • The most commonly used programming languages such as Python , R , Julia, Go and libraries such as Scikit-Learn, TensorFlow/Keras , PyTorch, GGplot, Matplotlib, Seaborn, Pandas.
  • An overview of how algorithms work : weak learners, tendency to overfitting, underfitting , a good understanding of bias-variance trade-off .
  • In some companies (especially startups with low-budget) is still required to be able to know when to use Random Forest, XGboost, Neural Nets, SVC and the advantages of one in relation to the other in a general view (which is inaccurate most of the time). However, this is no longer an issue with the advent of augmented machine Learning tools, as I mention below.
  • Nowadays, in 2018, if you know some Automated/ Augmented Machine Learning tools such the ones from Amazon, Google and Microsoft, as well as small (but important ) such as AutoSklearn, H20, BigML, Weka, DataRobot, MLJar and Predicsis, you are a step ahead of most of the data scientists in the market only using Python, R and Matlab.
  • Many statistical analysis are inaccurate . Being able to identify those inaccuracies and “fake” studies is paramount .
  • understanding of basic statistical ideas such as p-value, confidence interval, probability theory , time series.
  • Knowledge about data preprocessing data tools such as Power BI, Alterys, Pentaho Data Integration including obviously SQL and NoSQL skills .
  • Project scope identification skills and understanding of the lifecycle of data science projects in general, as it varies a lot depending on the type of the data set , company, budget, staff and so on.
  • An important, if not the most important skill : being able to communicate results to non experts in order to understand requests from decision-makers. Business acumen skills are getting more and more important with the arrival of many data science tools as aforementioned. That’s why having a PhD in the STEM field and trying a data- related position on which busicmen acumen is strictly necessary is very likely a recipe for failure: no matter how good technically you are, you must be able to communicate your ideas properly

An the last one, but not less important : study every single day. Deep Reinforcement Learning , Deep adversarial learning, IoT and many of the hottest data science topics are popping out every day. If I were going to summarize the list above, the main skill for an aspirant data sciencist surely it would be : never stop learning! Being able to lear new tools is essential in the absurdly fast-pace subjetct.

--

--

Anderson L. Amaral

Data Science Consultant , Brazilian Jiu-Jitsu brown-belt. Writing for fun !