1. Reads
  2. Deep Dives

What is Regression in Data Mining? A Deep Dive Into Regression Analysis and its Use in Data Science

Community of Thinkers & Writers
Oct 30, 2020 5:45 AM 6 min read

Immanuel Kant, a key influential philosopher combined rationalism and empiricism perspective to progress the way humankind understood the world. Rationalism emphasised on the human mind to achieve objective truth whereas empiricists focused on experiments to prove a hypothesis. Kant is known for bringing both together from a philosophical standpoint. Similarly, econometrics cannot be studied in isolation, it is an integral tool to understand any discipline with substantial evidence. We have come a long way indeed in making the tool more dynamic and applying it in multidisciplinary subjects.

This series is set on a few important concepts of econometrics that remain fundamental to any decision-making process. A sound econometric approach simply reiterates the belief in science and logic. While economics is the heart, I believe econometrics is the skeleton and explicates a framework in understanding the ideas and theories better.

Application of regression has become integral to modern-day data analytics and the process of decision-making. In order to know when to use regression analysis, we must first understand what it does. Here is a simple answer that pops up when you Google its use:

Regression analysis is used when you want to predict a continuous dependent variable from a number of independent variables. If the dependent variable is dichotomous, then logistic regression should be used.

Don’t worry if any of these terms sound unfamiliar. This article will help you understand their meaning, application and interpretation. 

Data Science is Ideologically Neutral 

Progressive society holds ideological debates on various things – a data-driven framework would help us manifest those ideas into meaningful policy decisions. To illustrate, there is an ongoing debate about gender pay gaps. A recent study shows that women get paid 34% less than men. But this number is an absolute value comparison and it definitely needs more inquiries to comprehend the factors that actually affect the difference in pay. Although the categories are broadly defined into genders, we are aware that a labour market price-setting is dependent on various other things. Concluding the difference in pay only based on gender by using the above-mentioned information will be oversimplification of the problem and econometricians would definitely not recommend it. 

As a neutral party we want to understand whether gender plays a role in determining an individual’s pay. If it does, to what extent does it affect the compensation, is the next plausible interest. The first step is defining the scope of our model. It is practically possible that there could be numerous factors affecting a phenomena. To perform a robust analysis, we need to restrict our equation with the most prominent factors that would influence the compensation. 


Interplay of Variables

In this example, data is collected pertaining to education, work experience and gender of the said individuals. These are essentially the independent variables. We are interested in finding out the amount of salary received by individuals - dependent on these independent variables. Remember that there has to be no correlation between education, work experience and gender. There are multiple statistical tests that can be used to check if there are correlations between the variables. In the following graph, there is an apparent correlation between salary and consumption of coffee. An individual’s coffee consumption can possibly move in the same direction as that of the salary. However, that does not mean an increase in salary has led to an increase in coffee consumption. A confusion between the correlation and causation could result in a misleading picture that we want to avoid in our analysis. Correlation effect does not mean there exists a causational effect.




Similarly, independent variables in the model must have some correlational characteristics. Education and work experience in our data-set should ideally not exhibit any statistical relation (movement in the same direction) because this would create a noise while estimating the causational effect. As researchers, we are curious to know the causational relationship between independent variables (education, work experience and gender) and dependent variable (compensation).


How to Construct A Viable Model For Analysis

We can use the data only when the conditions are satisfied, a model is constructed to find out the extent to which education, work experience and gender affect the compensation received by the individuals. The sample size also plays a crucial role in determining our results. Imagine, if you collect data only from a particular region – our results would be limited to that region. This would mean that there is a risk of higher standard error if the results obtained are generalised. In effect, the scope of the analysis would have limitations in making them universally applicable.




Now, we make a regression model with the available data. How does an equation help us get a deeper understanding of the issue at hand? It is actually the coefficients that help us understand the impacts better. 

Dependent variable gets its value from the independent ones. Therefore, to put this mathematically,

Compensation Package = Base Value + Education + Experience + Gender

What regression does, is put values that show the impact of these variables on the dependent variable (compensation package). So our result would look like:

Compensation package (Y) = β0 + β1 (Education)+ β2 (Experience) + β3 (Gender)

+ Ui (Standard error)

Where, β values talks about the impact a variable has on Y i.e. your compensation package. 


It is important that we interpret the β values correctly, so as to understand our data and the bigger picture better. Here’s the general way of reading the results:

For a unit change in the independent variables, Y changes by β units. So, in our example, we can say, given that education and experience are fixed, gender has compensation changes by β3 units.

To verify and take a stance, we can put the same values for education and experience, to understand if people with the same education and experience are paid differently. The only differing input value here, is the gender. Since we cannot input words like “Male”, “Female” and “Other” in our model, one can assign mathematical values for them.



If the output, Y (compensation package) differs for the same education and experience, then we can say that gender has a role to play in determining an individual’s compensation package. Further, it’s important to note the sign of the coefficient. A+ sign denotes a positive relation. As the variable changes by a unit, the Y-value moves in the positive direction and vice-versa for the negative. The standard error refers to how wrong the regression model is on an average while using the units of response. The estimates usually try to minimize the error term to get a model that fits better. 


What is Logistic Regression?

Moving on to the second part of the Google result above that reads, “If the dependent variable is dichotomous, then logistic regression should be used.” Well, to understand this imagine that you’re a bank, and want to analyse which applicant is likely to default in loan repayments. This is where the credit score comes into picture. Credit score is one of the many parameters that are taken into account to determine an applicant’s creditworthiness. Analysts need a model that simply highlights the likelihood of an individual to default. In this model, we get two outcomes, 0 and 1. 0 indicates no default, while 1 indicates a default. This is the dichotomous dependent variable that the result talks about. 

Logistic regression model is used when the data collected is assigned with binary values such as yes or no. The data would provide us details on what is the likelihood of an individual defaulting under the given parameters. Further, based on the results a bank could take policy decisions on setting the credit score flooring. Regression model types are chosen accordingly based on what the researcher wants to attain and interpret. Logistic regression is used to classify items, and hence often known as a ‘classifier model’.

There are various other models that can be adopted based on the type of data and nature of the study. Data science has evolved into other disciplines and significantly progressed in recent times. The conceptual understanding of regression analysis remains to be the foundation for proceeding with applications of Big data, Machine Learning and programming of Artificial Intelligence. In our upcoming series of articles, types of data, methods of regression analysis, different econometric models and their applications will be discussed.


This article was originally published on Econfinity.


Congratulations! You've made it to the end. Looking for more takes on Business, Finance, Markets, and Investing? Subscribe to our Wrap Up Newsletter for informative and insightful daily news updates, smartly curated from the top sources, delivered straight to your inbox.