Corona Virus – it has reached already 215 countries. Even remote locations such as islands and ships are affected (Worldometers.info 2020). Over 7.9 million people have been infected, 433,212 have died and 4.1 million recovered (14th of June). Every day more and more people are infected with COVID-19. These new infections have a very high impact and governments are binding their lockdown or loosening of these restrictions to this and other metrics.
This article will describe how a data-driven prediction was created and how it is being updated regularly. The result is visualized in a dashboard – I encourage you to read the article first before studying the dashboard. The link will be in the article.
Please note: This article tries time-series algorithms on already present data, there will be no forward prediction. In my opinion, professional epidemiologists should make forward predictions.
Another note: This project is far from being finished. Over time I plan to add more features, adjust algorithms, enhance data, etc. These changes will be reflected in the articles.
Previous works and inspiration
This topic has been a lot within the news. I was inspired to do this by a Kaggle challenge: https://www.kaggle.com/c/covid19-global-forecasting-week-1. In this competition, the goal was not to create accurate forecasts but rather to identify the variables which contribute to the outbreak (Kaggle 2020).
In order to gain more knowledge on the field of time-series, I decided to create my own prediction project. It uses the following algorithms:
- Linear regression
- Holt Winters (Non-optimized and optimized)
- XGBoost (Non-optimized and optimized)
Many previous prediction models utilized the SIR-Model. This model divides people into (S)usceptible to infection, (I)nfected, or (R)emoved (recovery, immunity, quarantine or death) (Brockmann 2020).
The SIR model framework looks as follows (Sasaki 2020):
is controlling how much the disease will be transmitted and how much can be removed by death or recovery. Once people are healed, they are considered immune (in the model environment).
There are challenges with the data.
- It is not clear if you are immune to Corona when you had it once. The virus is so new that there is no reliable data. The question should not be if we are immune but rather how long we get immunity. An article from the BBC states that people infected with other corona viruses can be re-infected within a year (Gallagher 2020).
- Determining the parameters is another challenge:
- Some countries have reported new COVID-19 infections where experts questioned these results. Furthermore, the testing capacities have a high impact on the infections. Please see the increased amount of infected people for France below.
- On the other hand, recoveries: When is someone recovered? This is a classic data-problem, if you ask many people you get many answers. The counting of recoveries is not unified and can vary from country to country. In Germany, there are strict rules for being recovered (MDR THÜRINGEN/sar 2020) but what about the other countries?
I decided to try out the time-series algorithms on the data. Since it is based only on the confirmed infections, we should keep in mind that it faces the same challenges – highly affected by the amount of testing by country.
It was for me very important to transform this project into an end-to-end machine-learning project. This means that the model and the dashboard are being updated regularly. For the visualization of the project I used Miro – the next gen free collaborative whiteboard platform.
The board is available here.
The source data is being provided by Johns Hopkins University on Github. The repository is being updated every day. For the project, the time series confirmed infection data – global has been used (CSV). It comes in the following form:
The transform and modelling part
- Data preparation: Transformation into list form, calculation of new cases from cumulative number
- Machine Learning Parameters: Train/Test Split = 7 days from today. It means that the algorithms will be trained until 7 days before today and tested against the prediction within the 7 day time-frame.
- Days since outbreak global
- Days since outbreak in country
- Date features (Date, Day, Weekday, Week number, Quarter, Month number)
- Confirmed_lag_7 – Confirmed new infections 7 days ago
- Test Metrics:
- Mean Absolute Error (MAE) – Calculation of average error . It is the sum of absolute differences between the predicted and actual infections. It does not consider positive or negative direction (Cbap 2002).
- Root Mean Squared Error (RMSE) – Standard deviation of the residuals (prediction errors) . Therefore, it indicates the spread of the residual errors. It penalizes large errors more than MAE. It does account for positive or negative values (Cbap 2002).
- Root Mean Squared Log Error (RMSLE) – RMSE with a log base . It does account for positive or negative values. RMSLE penalize lower errors (Cbap 2002).
There are countless discussions on Kaggle and other sites about which metric is the most suitable. It is important to use several metrics in order to evaluate the models.
This framework is known as Supervised Learning. In Supervised Learning we know what we want to predict. In this case, the algorithm tries to predict the new infections from the data he has learned (Train). The algo uses some input values to predict the future value, in this example the future date features, days since outbreak and infections from 7 days ago. The predicted value will be measured against the Test set. The whole framework will be updated once per day. The notebook runs in Google Colab, automated via Selenium. The output is an Excel file which will be saved in Google Drive. The Dashboard pulls the Excel file once per day automatically.
Since you read everything until here – The Dashboard 😀
As a framework, I used Tableau Public. A very good choice if you would like to share dashboards with others via a web link.
- Selector for Location and Selector for model
- Main section. Each point represents new infections. This model (Linear Regression) shows therefore new infections worldwide.
- The dark blue line shows the training set
- The light blue line is a 7-day moving average and has been added as well
- The green line shows the prediction
- The orange line shows the test set
- On the right you find the error metrics (MAE, RMSE and RMSLE for the current country – but for all models
- On the bottom all the models are grouped by the error metrics (MAE, RMSE and RMSLE) – find out here which model is the most suitable for the selected country.
This brings me to the end of my article. Please comment or send me a message for any questions. In the upcoming articles, I want to provide more detail how (and why) the time-series algorithms are behaving during the prediction.
*Der Beitrag spiegelt die Meinung des Autors wider und ist keine allgemeingültige Meinung des TDWI. Als TDWI bieten wir die Plattform, alle Themen und Sichtweisen zu diskutieren.*
Brockmann, Dirk. 2020. The model. Accessed May 22, 2020. http://rocs.hu-berlin.de/corona/docs/forecast/model/.
Cbap, Akhilendra Singh. 2002. Evaluation Metrics for Regression models- MAE Vs MSE Vs RMSE vs RMSLE. Accessed June 12, 2020. https://akhilendra.com/evaluation-metrics-regression-mae-mse-rmse-rmsle/.
Gallagher, James. 2020. Coronavirus immunity: Can you catch it twice? Edited by BBC News. 28 April. Accessed May 22, 2020. https://www.bbc.com/news/health-52446965.
Kaggle. 2020. COVID19 Global Forecasting (Week 1). March. Accessed May 22, 2020. https://www.kaggle.com/c/covid19-global-forecasting-week-1/overview.
MDR THÜRINGEN/sar. 2020. So zählen wir die Corona-Fälle und Genesenen in Thüringen. 15 May. Accessed May 22, 2020. https://www.mdr.de/thueringen/coronavirus-covid-genesene-zaehlung-100.html.
Sasaki, Kai. 2020. COVID-19 dynamics with SIR model. 11 March. Accessed May 22, 2020. https://www.lewuathe.com/covid-19-dynamics-with-sir-model.html.
Worldometers.info. 2020. COVID-19 Coronavirus Pandemic. 14 June. Accessed June 27, 2020. https://www.worldometers.info/coronavirus/.