Jinming Cao1, Bin Zhao2 *
1School of Information and Mathematics, Yangtze University, Jingzhou, Hubei, China 2School of Science, Hubei University of Technology, Wuhan, Hubei, China
*Corresponding author: Dr. Bin Zhao, School of Science, Hubei University of Technology, Wuhan, Hubei, China Tel./Fax: +86 130 2851 7572
Abstract
Background
Since the first appearance of novel coronavirus (COVID-19) in Wuhan in December 2019, it has quickly swept the world and become a major security accident facing humanity nowadays. While threatening people’s lives, the economies of various countries have also been severely damaged because of the epidemic. Because of the epidemic, it leading to the closure of a large number of companies, employment is becoming more difficult and people’s lives have been greatly affected. So, to the Hubei Province, where the COVID-19 first broke out, and the United States, the most severely affected area, we establish time series models to analyze the spread of the new coronavirus and short-term forecasts. This will help countries better understand the development trend of the epidemic, and make better preparations, timely intervention and treatment to prevent the further spread of thevirus.
Methods
The data that collected from Hubei Province from 20 January, 2020 to 28 April, 2020 includes the cumulative confirmed diagnoses, death and cure [1]. We use Excel to organize the data first, and then use SPSS to establish time series models and statistical analysis. Because there is no problem of missing data, so we define the day as the time variable, make time series graphs and observe the overall change rule. We remove the outliers, then use the SPSS expert modeler to automatically find the best fitting model for each dependent sequence, and predict by designating the independent variable and setting the width of the confidence interval to 95%. ACF and PACF graphs of residuals and Q test are used to determine whether the residual is a white noise sequence and whether the model is an appropriate model. The Holt model is used for the cumulative confirmed diagnoses [2] in Hubei Province, and the ARIMA (1,2,0) model is used for cumulative cures and deaths [3] in Hubei Province. Because the outbreak in the United States is later than China, we collect data from 29 February, 2020 to 28 April, 2020 [4], which also includes the cumulative confirmed diagnoses, deaths and cures. The ARIMA (2,2,6) model is used for cumulative diagnoses in U.S., the ARIMA (0,2,0) model is used for cumulative deaths in U.S., and the ARIMA (0,2,1) model is used for cumulative cures in U.S.
Findings
From our modeling of the data, the time series diagrams of the real the fitted data almost overlap, so the fitting effect of the Holt model and the ARIMA model we use is very suitable. We compare the predicted values with the real values of the same period and found that the epidemic situation in Hubei Province has basically ended after May, but the epidemic situation in the United States has become more severe after May, so the Holt model and the ARIMA model are also very appropriate in predicting the epidemic situation in short-term.
Interpretation
Because the Chinese government has always put the safety of people’s lives in the first place, when the epidemic broke out, it decisively closed the city of Hubei Province.
One side is in trouble, all sides support, they concentrate all resources of whole country to save Hubei Province at the expense of the economy only in order to save more people. Now we can clearly see that the epidemic has been controlled in China and the whole country is developing in a good direction. In contrast, the epidemic in the United States, because of the government's lack of control, unwillingness to sacrifice the economy, premature return to work, and failure to call on people to wear masks, will lead to the epidemic in the United States has been going in a bad direction.
Keywords: COVID-19; Time series analysis; Holt model; ARIMA model
1. Introduction
As a serious respiratory infectious disease, the COVID-19 [5] has been spreading around the world since January, 2020, which seriously threatens people’s lives and normal lives. It has spread in China since December 2019 and has been basically controlled in May. However, during this period, the COVID-19 had a great impact on people’s lives and national economic development. COVID-19 is a large family of viruses known to cause colds and more serious diseases such as Middle East Respiratory Syndrome (MERS) and Severe Acute Respiratory Syndrome (SARS). The COVID-19 is a new strain of coronavirus that has never been found in humans before.
Common signs of people infected with COVID-19 include respiratory symptoms, fever, cough, shortness of breath, difficulty breathing and so on. In more severe cases, infection can cause pneumonia, severe acute respiratory syndrome, kidney failure, and even death. Unfortunately, there is currently no specific treatment for diseases caused by COVID-19. However, many symptoms can be managed, so it needs to be treated according to the patient's clinical situation. In addition, auxiliary care for infected persons may be very effective.
In the global fight against the epidemic in the process, we can see that China is the best country to do so. It has controlled the spread of the novel coronavirus in the most efficient way, providing other countries with valuable Chinese experience and Chinese pattern in the fight against the epidemic. By establishing COVID-19 spread models between China and the United States, we can clearly see the different results of different policies and measures on the control of the epidemic. This is also the significance of our model. Through accurate data and rational analysis, we can provide the future development trend and control strategy research for the world in the fight against the epidemic [6].
2. Methods
2.1 Data
The data of Hubei Province is derived from the Health Commission of Hubei Province on its official platform from 20 January 2020. The Hubei Province data collected in this paper includes cumulative deaths, cumulative cures and cumulative diagnoses from 20 January 2020 to 28 April 2020. The data of the United States comes from the domestic data platform, the Real-time Big Data Report of the Epidemic. The data collected in this paper includes cumulative deaths, cumulative cures and cumulative number of diagnoses from 29 February 2020 to 28 April 2020.
2.2 The model
Through the collected data, we conduct a time series analysis of the novel coronavirus [7-9]. Because there is no data missing, we import the data into SPSS software, define the day as a time variable, remove the outliers, and make a time series graph. The most suitable fitting models are automatically found by the expert modeler, which include the cumulative deaths, cumulative cures and cumulative diagnoses.
2.3 TS Model-based method for estimation in Hubei Province
Based on the given data of cumulative number of diagnoses in Hubei Province, we use the expert modeler to process the data and obtain the Holt model to describe it. The corresponding equation set shown below.
The related mathematical symbols used above are listed in the following table 1.
Table 1: Mathematical symbols used in equation set ①.
Classes |
Meanings for each classes |
T |
Current period |
Xt |
Actual observations in period t |
St |
Estimated level at period t |
Tt |
Predicted trend at period t |
X ^ t (n) |
Estimated value before period m |
Α |
Horizontal smoothing parameter |
Β |
Trend smoothing parameter |
In order to determine that the selected Holt model can correctly describe the cumulative number of diagnoses, we use white noise10 for residual test.
As can be seen from figure 1, the ACF and PACF graphs of the residuals, the autocorrelation coefficients and partial correlation coefficients of all lag orders are not significantly different from 0; From figure 2, it can also be seen that the P value obtained by performing the Q test on the residual is 1, which is, we cannot reject the null hypothesis, confirming that the residual is a white noise sequence.
Figure 1: The ACF and PACF graphs of the residuals in cumulative number of diagnoses in Hubei Province case
Figure 2: The Q test on the residual of the case of cumulative number of diagnoses in Hubei Province case
From the analysis above, we can conclude that the Holt model can describe the cumulative number of diagnoses well.
Similarly, the expert modeler evaluates that the cumulative number of cures conforms the ARIMA (1, 2, 0) model. The related equations are written followed.
We also use white noise to test this model for residuals. As figure 3 below shows, the ACF and PACF graphs of the residuals can be seen that the autocorrelation coefficients and partial correlation coefficients of all lag orders are not significantly different from 0. Therefore, the ARIMA (1, 2, 0) model can well describe the cumulative number of cured people.
Figure 3: The ACF and PACF graphs of the residuals in cumulative number of cures in Hubei Province case
Table 2: Mathematical symbols used in equations ➁ ➂.
Classes |
Meanings for each classes |
p |
Number of autoregressive items |
q |
Moving average number of items |
L |
Lag operator |
|
|
P 1 t Σ αi Li i=1 |
AR(p) model |
q 1 + Σ þi Li i=1 |
MA(q) model |
(1 t L)2 |
2nd order difference |
|
Number of people on day t |
Similar to the cumulative number of people cured in Hubei Province, the cumulative number of deaths in Hubei Province also conforms to the ARIMA (1, 2, 0) model. The corresponding equations and the related symbol meanings are the same as equations ➁. Next, we use white noise for the residual test.
From figure 4, the ACF and PACF graphs of the residuals can be seen that the autocorrelation coefficients and partial correlation coefficients of all lag orders are not significantly different from 0. We can find out that the ARIMA (1, 2, 0) model can also describe the cumulative death toll well.
Figure 4: The ACF and PACF graphs of the residuals in cumulative number of deaths in Hubei Province case
2.4 TS Model-based method for estimation in the United States
Based on the given data of America, we use the expert modeler to process the data and we find that all of the cumulative number of diagnoses, deaths and cures conform the ARIMA model. However, the parameters setting of each group of them are notidentical. After processing, it is found that the cumulative diagnoses in the United States applies the ARIMA (2, 2, 6) model. The corresponding equations are the same with equations ➁.
Next, we performed a residual test on the model based on white noise. As can be seen from figure 5, the ACF and PACF graphs of the residuals, the autocorrelation coefficients and partial correlation coefficients of all lag orders are not significantly different from 0. From figure 6, it can also be seen that the P value obtained from the Q test of the residual is 0.304, that is, we cannot reject the null hypothesis, and think that the residual is a white noise sequence.
Therefore, the ARIMA (2,2,6) model can well describe the cumulative number of diagnoses.
Figure 5: The ACF and PACF graphs of the residuals in cumulative number of diagnoses in U.S. case
Figure 6: The Q test on the residual of the case of cumulative number of diagnoses in U.S. case
The cumulative number of cures conforms ARIMA (0, 2, 0) model, which is equals to 2nd order difference equation. The related
equation set is similar tothe equations ➁.
As usual, we should use white noise to perform a residual test. As the figure 7 shows below, the ACF and PACF graphs of the residuals can be seen that the autocorrelation coefficients and partial correlation coefficients of all lag orders are not significantly different from 0.
Figure 7: The ACF and PACF graphs of the residuals in cumulative number of cures in U.S. case
At last, we use expert modeler to process the data of cumulative deaths in U.S., then it is found that it conforms the ARIMA (0, 2, 1) model. The related equation set still conforms with the equations ➁. Then we perform a white noise residual test. As can be seen from figure 8, the ACF and PACF graphs of the residuals, the autocorrelation coefficients and partial correlation coefficients of all lag orders are not significantly different from 0.
Figure 8: The ACF and PACF graphs of the residuals in cumulative number of deaths in U.S. case
3. Results
3.1 The TS Model-based method results in Hubei Province
We set the width of the confidence interval to 95 %, then use the Holt and ARIMA model to fit and predict the cumulative number of people diagnosed, cumulatively cured and cumulatively died in Hubei Province respectively. The obtained results shown in the following figures.
Figure 9: The cumulative number of diagnoses in Hubei Province
Figure 10: The cumulative number of cures in Hubei Province
Figure 11: The cumulative number of deaths in Hubei Province
It can be seen from figures 9, 10, and 11 that the timing charts of the real data and the fitted data almost overlap, and the Holt model and the ARIMA model fit the original data very well.
At the same time, after 28 April, the epidemic situation in Hubei Province has been controlled, and the cumulative number of diagnoses will basically, not increase dramatically. The Holt model and ARIMA model can also well predict the cumulative diagnoses, cumulative cures and cumulative deaths. The Table 3 listed below contains the predicted values of cumulative number of diagnoses, cures and deaths in Day 101 to 105 by using the equation ② and the (figures 9-11).
Table 3: short-term predicted values in Hubei Province
Day |
Cumulative diagnoses |
Cumulative cures |
Cumulative deaths |
101 |
68140 |
63616 |
4512 |
102 |
68152 |
63616 |
4512 |
103 |
68164 |
63616 |
4512 |
104 |
68176 |
63616 |
4512 |
105 |
68188 |
63616 |
4512 |
3.2 The TS Model-based method results in the United States
Our analysis of the U.S. epidemic situation also set the width of the confidence interval to 95 %. The results obtained by fitting and predicting the number of cumulative diagnoses, cumulative cures and cumulative deaths using the ARIMA model are shown in the following figures.
Figure 12: The cumulative number of diagnoses in U.S
Figure 13: The cumulative number of cures in U.S
Figure 14: The cumulative number of deaths in U.S.
It can be seen from figures 12, 13, and 14 that the timing charts of the observed data and the fitted data almost overlap, and the ARIMA model fits the original data very well.
At the same time, after 28 April, the cumulative number of diagnoses, cumulative deaths and cumulative cures will continue to increase substantially in the short term, which is related to the policies adopted by the US government to combat the epidemic. Not only has the epidemic situation not been controlled, but the situation has become more severe. This shows that the ARIMA model can also predict the cumulative diagnoses, cumulative deaths and cumulative cures.
The specific predicted values of cumulative number of diagnoses, cures and deaths in Day 61 to 65 are listed in the table 4 below.
Table 4: short-term predicted values in U.S.
Day |
Cumulative diagnoses |
Cumulative cures |
Cumulative deaths |
61 |
1049197 |
151006 |
60312 |
62 |
1051063 |
161873 |
62006 |
63 |
1080926 |
172741 |
63722 |
64 |
1122025 |
183609 |
65458 |
65 |
1169266 |
194477 |
67217 |
4. Discussion
When the virus breaks out, everyone lives in panic, worrying about the safety of themselves and their families, and also worrying about the safety of the country.
We use SPSS [11] to accurately get the models we need, such as Holt model and ARIMA model, and then use these models to fit the sequences, and estimate the model parameters based on the sequence values. Finally, we perform residual tests on the models with white noise to check whether the model is applicable.
By analyzing the COVID-19 through time series, the development process, direction and trend of the epidemic are obtained. And we predict how the epidemic situation might develop in the future as well. This will give us some guidance in our lives, such as what measures should be taken to intervene in the development of the epidemic in order to save morelives.
Not only can time series be used for the analysis of infectious diseases, but also can be used in many disciplines in society such as measurement [12] and economics [13]. The data sequence and data size in the time series both contain information about the objective world, its changes and represent dynamic processes. Therefore, the main purpose of time series analysis is to understand the dynamic system under consideration, predict future events and control future events through intervention [14,15].
5. Limitation
When establishing the model, we regard the data with large fluctuations as outliers. In fact, there are many more complex models that can catch these outliers. At the same time, when we make predictions, the ARIMA model is only suitable for short-term prediction. Over a certain period of time, the predicted value will not change any more, due to the essence of the model. So, when solving this problem, we can assume the predicted values as the observed values, thus the long-term prediction would be possible, but the difference between the truly observed data might be more and more large.
Since the epidemic is only predicted in the short term, it can be seen from the analysis chart of epidemic situation in the United States that the cumulative number of people who are cured, died and diagnosed, cases are all moving in an increasing direction. However, in practice, the number of people in these categories should reach a stable value in the end.
6. Conflict of interest: We have no conflict of interests to disclose, and the manuscript has been read and approved by all named authors.
7. Acknowledgement: This work was supported by the Philosophical and Social Sciences Research Project of Hubei Education Department (19Y049), and the Staring
References
- Health Commission of Hubei Province [Accessed 08 Feb 2020].
- https://voice.baidu.com/act/newpneumonia/ne wpneumonia/?city=America-America
- https://baike.baidu.com/item/Holt%E6%A8% A1%E5%9E%8B/22192556?fr=aladdin
- https://baike.baidu.com/item/ARIMA%E6%A 8%A1%E5%9E%8B/10611682?fr=aladdin
- Baidu (2019) 2019 new coronavirus.
- Ladi Wang. Study on infectious disease model and control strategy [M]. Beijing: China science and technology press, 2005.
- Xuan Zhou (2019) Epidemiological characteristic and time series analysis of hand foot and mouth disease in Wuzhou city from 2014 to 2018 [D]. Guangxi: Guangxi medical university, 2019.
- Guirong Yu, JianhuaZhang (2005) The application of time series analysis in the study Research Foundation for the Ph.D. of Hubei University of Technology (BSQD2019054), Hubei Province, China. of epidemic Disease [N]. Journal of liaoning university.
- Jonathan D Cryer (2009) Time Series Analysis with Applications in R [M]. Beijing: Machinery Industry Press.
- https://baike.baidu.com/item/%E7%99%BD%E5%99%AA%E9%9F%B3/10280741?fr=alad din
- https://baike.baidu.com/item/spss/2351375?fr= aladdin
- Tonghe Sun (2013) An Application of Time Series Analysis and Its Application in Measurement [J]. Surveying and Spatial Geographic Information 36(003): 12-13.
- Lan Gu (1994) Application of time series analysis in economy [M]. Beijing: China Statistics Press.
- Yongdao Zhou (2015) Time series analysis and application [M]. Beijing: Higher Education Press.
- Shuyuan He (2007) Applied time series analysis [M]. Beijing: Peking University Press.