Can Machine Learning help to forecast COVID-19 infections – Part 2 – Linear Regression

The article works with data from 25th of November.

My last article was written about the ability to forecast new Corona infections with the help of Machine Learning. The first article was “setting the scene” by introducing the topic and the framework for the project. This article shall be continuing by presenting the first model – Linear Regression.

Unfortunately, the pandemic is still in full mode and new infections are at an all-time-high. We also must deal with new problems such as reinfected patients (Cruickshank, 2020) or mutations of the Corona virus (Callaway, 2020).

Unfortunately, also, the numbers do not look encouraging, despite several lockdowns, social distancing and wearing face-masks – the amount of new infections are still increasing. The Virus has reached today around 219 countries, more than 53 Million people have been infected and more 1.3 Million people have died (, 2020).

Let’s turn our attention to the question for this article:

Can Linear Regression help to forecast COVID-19 new infections?

What is Linear Regression?

Linear Regression is (as the name suggests) a regression model which is widely used by all sorts of professional in various industries. The term regression originates from Francis Galton in the 19th century. He investigated that the heights of descendants of tall ancestors tend to regress down towards a normal average. This is known as regression to the mean (Galton, 1989).  

Linear Regression is one of the simplest but also very effective Machine Learning algorithms. In theory it works like this: “Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is an explanatory variable, and the other is to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.” (Department of Statistics, Yale University, 1998)

Regression is often referred to as problems where you must predict a continuous variable such as weight or height. This is only half the truth since there are also regression models which can predict categorical variables for example Logistic Regression by using a log function.

There are two forms of Linear Regression. Simple Linear Regression where there is only one input variable (x) to predict the output (y) and Multiple Linear Regression where we have multiple variables (x1, x2…) to predict y.

Can Linear Regression help to forecast COVID-19 infections?

For the first (simple) example we assume that we only have one Variable and we select “Days since outbreak global” as our independent variable (X) and Confirmed new infections as our dependent variable (Y)

The plot looks like this – for worldwide:

That looks good, each dot represents one day. The data becomes more spread out – the variance increases over time.

For each Machine Learning Algorithm, we start with the model evaluation. In order to achieve that, we need to split our dataset into a train and test part. If we assume one week as our train and test split, the data will look like this:

So, the algorithm needs to fit a line in order to predict the yellow points as accurate as possible. The line is denoted with y = intercept + coefficient * x. The intercept is the line on the Y-Axis through which the line passes. The coefficient describes the slope of the line. By applying Python, we found the following:

So the 200th day of the pandemic would be described with y = -71,924 + 1,704 * 200 = 268,876 new infections.

After fitting the line, the chart looks like this:

But stop! What is happening in “”? Let’s dive in to understand it better.

Basically, the command is using the Ordinary Least Squares (OLS) to fit a line to the data which will show a minimum sum of squares. The sum of squares is the sum of differences between the predicted and current data (Carvalho, 2020). The differences are called “residuals” and examples have been marked in the chart above.

What we are now doing is basically extend the fitted line of the training data into the future. We are utilizing the values (Days since the outbreak global – 301, 302, 303) from the Test set. By doing that we can compare the test set points to the predicted data points.

If we compare Test and the predicted values, we see already that the line is not fitting well. The data looks more curvilinear shaped.  The prediction (red line) is far off the test values (orange line).

In order to compare the models, the following measures have been using: Mean Average Error (MAE), Root Mean Squared Error (RMSE) and Root Mean Squared Logistics Error (RMSLE):

Cool, so I can apply Linear Regression all the time?

We need to take care that there are certain conditions (Hariharan, 2020) are satisfied. For our example we introduced a new variable “Weekday (1-7)” for x. We are now performing multiple Linear Regression.

Linearity: The error terms (or residuals) of X and Y must be linear shaped. Let’s visualize this to see it.

No perfect linear relationship, it tends to underestimate by the begin of the pandemic and underestimates towards the end.

Normality of the error terms: OLS assumes that the error terms are normally distributed. The easiest way to test is a visual:

Or Anderson-Darling test for normality (Macaluso, 2018). The AD test is above our threshold (p=0.05).

This means both have failed and we can see that the model is biased towards underestimation. A way to fix this is by applying nonlinear transformations such as a quadratic function.

Multicollinearity: The independent variables should not be correlated. It can cause problems while interpreting the model. A high VIF (Variance inflation factor) can be resolved by centering the variables (subtracting the mean from the value) (Frost, 2017). You can find a more information here. In our case we are in acceptable limits for the two variables, but we need to check again with more variables.  

Autocorrelation: There should be no autocorrelation of the error terms. It can happen if we miss some information which is being recognized by the model. We can test this by doing the “Durbin Watson Test”. It results in a score of 0.3. That means there is positive autocorrelation. We can fix this by adding a lagged variable (Macaluso, 2018). By adding a lagged variable (7-day lagged Confirmed new infections), we have increased the score to 1.3 which comes close to the area of no autocorrelation.

Homoscedasticity: A rather complicated term to describe that there should be a consistent variance of our error terms. We already know that this will be a challenge with our data since it is time series (Carvalho, 2020).  We know that this data is not linear shaped and underestimating most cases.

In the end we receive a very handy Stats-Report which shows us a lot of information regarding our model. We have now three variables and a very high R2 score but strong signs of Multicollinearity. That makes it very challenging to deduct any conclusions from the model.

Also, we find that only one of the variables is below the p value threshold of 0.05. We could get rid of the other variable since the high p value indicates that changes in the predictor variable are not related to the response variable (Minitab, LLC, 2013).


This article showed if Linear Regression can forecast New COVID-19 infections. We investigated a simple Linear Regression as well as the conditions which need to be satisfied for Linear Regression. In the end it can be concluded that a prediction is possible but there are strong doubts regarding the interpretability of the model.

It would be dangerous to draw conclusions from this model if it would be used to predict future data.

Nevertheless, the model showed great performance while using multiple variables as input. An extension would be model fine tuning in the form of variable elimination, non-linear transformations, value centralization and robustness tests such as Cross-Validation.

A test with more variables showed a good performance with reservation to all the points mentioned above:

At this point you should have understood the model and its implications. We will use the next article how Machine Learning can help to tweak the parameters of the model in order to find the best fit – called Gradient Descent.

Until then: Stay healthy!

You find the full Github repo here and the Jupyter Notebook here. Also check out the Dashboard which is being updated regularly


Callaway, E. (2020, September 8). The coronavirus is mutating — does it matter? Retrieved November 14, 2020, from

Carvalho, T. (2020, May 25). OLS Linear Regression Basics with Python’s Scikit-learn. Retrieved November 15, 2020, from

Cruickshank, S. (2020, October 16). Coronavirus reinfection cases: what we know so far – and the vital missing clues. Retrieved November 14, 2020, from

Department of Statistics, Yale University. (1998). Linear Regression. Retrieved November 14, 2020, from

Frost, J. (2017, April 2). Multicollinearity in Regression Analysis: Problems, Detection, and Solutions. Retrieved November 21, 2020, from

Galton, F. (1989). Regression analysis. Statistical Science., pp. 81-86. doi:10.1214/ss/1177012581

Hariharan, S. (2020, January 12). Linear Regression: A complete story. Retrieved November 20, 2020, from

Macaluso, J. (2018, May 27). Testing Linear Regression Assumptions in Python . Retrieved November 24, 2020, from

Minitab, LLC. (2013, July 1). How to Interpret Regression Analysis Results: P-values and Coefficients. (T. M. Blog, Editor) Retrieved November 26, 2020, from (2020, November 14). COVID-19 Coronavirus Pandemic. Retrieved November 27, 2020, from

Der Beitrag spiegelt die Meinung des Autors wider und ist keine allgemeingültige Meinung des TDWI. Als TDWI bieten wir die Plattform, alle Themen und Sichtweisen zu diskutieren. *

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert