With a median pay almost twice the average salary in the US, there is a lot of incentive to become a data scientist today. However, there is no data science without math, which creates a dilemma for those who’ve struggled with this subject. How much math does it take to become a data scientist, and just how hard does it get?
In this article, we examine math’s place in data science. We’ll give you an idea of how much math you’ll need to know, the main mathematical concepts applied in this career, their importance, and how you can get better at the math you’ll need as a data scientist.
The short answer is that it’s yes, it is important. And if you’re interested in a data science career with no math background, you’ll likely struggle. But the good news is that the most fundamental mathematical concepts in data science can be learned (even if you don’t have much math experience).
But we wanted to answer that question for our readers - do I need to know math for data science? Is it a requirement in becoming a data scientist? - and provide an overview of some of the most useful mathematical concepts practitioners use.
The bottom line is this: Math - or more broadly, the quantitative reasoning skills a math background provides - is essential for many day-to-day job tasks.
The amount of math you’ll need to know depends on the type of data science role you want. Some roles only require understanding the fundamentals, while others require using advanced mathematical concepts.
Additionally, you don’t need to be familiar with every sub-field of mathematics. Although some applications require knowledge of less common concepts, the mathematical foundations of data science come from three main subfields, i.e.:
Math shows up in data science in both obvious and less-obvious ways. When statistical techniques are used for exploratory data analysis, the role of math is evident. However, when a data scientist uses tools such as Numpy, PyTorch, or Matplotlib, most of the math is done behind the scenes.
When data scientists analyze data, they try to identify the mathematical rules or formulas that govern relationships between data points. These formulas enable them to take model solutions and use them to solve problems in the real world. Math also helps them correctly interpret the analysis output to draw more accurate conclusions.
A good understanding of math also helps when building tools used to collect, process, analyze, and visualize data. The same principles used to create graphs, charts, and other visualizations on paper are used in their software equivalents. Math is also necessary for anyone who wants to modify or optimize data science tools built using mathematical algorithms.
Data science uses a broad range of mathematical concepts, including geometry. However, certain fields in mathematics are considered fundamental for data scientists.
Before we had data science, we had statistics and probability. These two fields gave statisticians and other experts the tools needed to make data-based predictions in the early days, forming the foundation of what we now call data science.
Statistical techniques enable data scientists to properly collect, process, and analyze data before drawing conclusions from it. On the other hand, probability enables them to quantify uncertainty, e.g., estimating the value of missing information, assessing the likelihood that a result/outcome is correct, etc. Some commonly used statistical concepts are:
Example of Statistics in Data Science: You have been given employee data and are required to identify if there is a bias in recruitment based on factors such as age, race, and gender.
This seems like a straightforward problem, but jumping to analysis could result in incorrect conclusions. A disproportionate number of women in low-paying positions could conceal a bias in recruiting women in higher-paying positions. The same can happen when investigating gender bias in compensation when a few executives earn extremely high salaries or bonuses.
Using statistics to examine the data distribution can help identify outliers that would skew the result. This allows you to account for them when performing further analysis or making conclusions.
Linear algebra is used to perform many computations simultaneously, also known as array programming. Python libraries such as NumPy are especially good at handling such tasks efficiently. Array programming is very useful considering the large datasets data scientists often work with. Although you can use a tool like NumPy without understanding all the underlying algorithms, understanding them can be helpful when performing advanced operations.
Linear algebra is also used in image recognition applications. In such cases, the image is transformed into a matrix where each element is a number that represents a shade of color. This is how deep neural networks are able to work with photos.
Additionally, linear algebra is employed to represent data points with two or more dimensions (independent variables) and to reduce the number of dimensions to simplify a problem. Some linear algebra concepts you’ll need to become familiar with are:
Example of Linear Algebra in Data Science: Come up with a model that accurately estimates the relationship between the value of a car in the used car market and factors such as its age, mileage, location, color, etc.
Except in rare cases, e.g., when dealing with limited editions, the expectation is that a car’s value depreciates with age and mileage. Color, location, and other factors can also have some effect on this valuation.
This type of problem is usually solved using linear regression. Linear regression attempts to find the single line closest to all data points when plotted. It can be done for one independent variable (simple linear regression) or multiple independent variables (multiple linear regression). Linear regression problems are solved using matrices because of the amount of computation needed when dealing with many data points. This example shows how it is done in Python.
Calculus is used by data scientists when optimizing machine learning algorithms. Optimization refers to the process of reducing errors or finding the optimal value of a parameter.
One ML algorithm that uses calculus is gradient descent. It relies on derivatives to find parameter values that result in the smallest possible error between the actual and predicted dependent variables. Calculus is also used in statistical modeling, where integration is used to find the area underneath a curve. This can be seen when working with probability distribution functions. Core calculus concepts a data scientist should be familiar with include:
How much calculus a data scientist actually needs to know is difficult to define because most roles don’t require you to manually perform these operations. However, understanding the fundamentals is necessary to know when an algorithm is needed and its limits.
Machine learning tools like PyTorch and scikit-learn already have different calculus algorithms built into them, and you can use them without a deep understanding of the underlying algorithms. However, if your work will be on the heavier side of deep learning or involves coming up with new algorithms, research work, or transforming algorithms in research papers into code, you’ll need more than a basic understanding of calculus.
Perceptions of math in data science are highly subjective. Those with a STEM background or a natural inclination toward mathematical reasoning find data science math manageable or even easy, while others find it intimidating, especially at the beginning.
The math fundamentals data scientists regularly use can be challenging, but most people can make sense of it with a little bit of effort. Data science tools also handle the most challenging aspects of math, and you often just need to know enough to know which tools to use, when to use them, and how best to interpret the output from a mathematical point of view. If you can follow along and understand the solution of a linear regression problem from start to finish, you should be okay.
Even if you have struggled with it, you can master the math fundamentals you need in data science. The first and most crucial step is to be open to the possibility that you can become good at these fundamentals.
You should also identify the concepts you struggle with instead of assuming all math is hard. When you know what you find challenging, you can take the necessary steps to improve.
This is also an excellent opportunity to try a different teacher or learning approach. There are many online resources you can use to learn math fundamentals and more for free. The tutors range from professors at top colleges to industry professionals who teach math using real-world examples so you can see how the principles are applied to solve data science problems.
Math is often considered the most intimidating hurdle in data science, but this perception is rarely objective. Although it’s a key part of this field, most roles don’t need a deep understanding of math. Additionally, only specific areas of math are regularly applied in data science, and some have their algorithms already built into easy-to-use tools. Some applications require more math, but these are the exception, not the norm. If you struggled with math in the past, there are also more resources available today to facilitate an easy learning experience.
Interview Query strives to facilitate this easy learning experience for aspiring data scientists. Our statistics and A/B testing and probability learning paths introduce essential math concepts in a data science context. You can also access interview questions on these topics to see which math fundamentals employers are interested in. If you’d like to know more about how math is tested during interviews, we can connect you with industry experts who can coach you in the areas you need to work on before an interview.
Although you may find math intimidating, you now have the knowledge and resources to tackle the math needed to become a data scientist!