Main

COVID-19, caused by SARS-CoV-2, was detected in Wuhan in December 20196. The high population density, together with increased social activities before the Chinese New Year, catalysed the outbreak; the spread of the outbreak was expedited by massive human movement during the Chunyun holiday travel season from 10 January 20203. Shortly after the confirmation of human-to-human transmission, the Chinese authorities implemented an unprecedented cordon sanitaire of Wuhan on 23 January to contain the geographical spread of the disease, followed by a series of non-pharmaceutical interventions—including suspension of all intra- and inter-city transportation, compulsory mask wearing in public places, cancellation of social gatherings and the home quarantine of individuals with presumed infections, those with COVID-19 related symptoms and their close contacts1—to reduce virus transmission. From 2 February, a strict stay-at-home policy for all residents, and the centralized isolation and quarantine of all patients, individuals suspected to have contracted the virus and their close contacts were implemented to stop household and community transmission. In addition, a city-wide door-to-door universal survey of symptoms was carried out during 17–19 February by designated community workers, to identify previously undetected symptomatic cases. These interventions—together with improved medical resources and the redeployment of healthcare personnel from all over the country—have crushed the epidemic curve and reduced the attack rate in Wuhan, with the potential to shed light on global efforts to control outbreaks of COVID-191.

Recent studies have revealed important transmission features of COVID-19, including the infectiousness of asymptomatic7,8,9,10 and presymptomatic2,11,12 individuals. Furthermore, the number of ascertained cases was much smaller than that estimated using international cases exported from Wuhan before the travel suspension3,13,14, which implies a substantial number of unascertained cases. Using reported cases from 375 cities in China, a previous modelling study concluded that a sizeable number of unascertained cases—despite having lower transmissibility—had facilitated the rapid spreading of COVID-1915. In addition, accounting for unascertained cases has refined the estimation of case fatality risk of COVID-1916. Modelling both ascertained and unascertained cases is important for interpreting transmission dynamics and epidemic trajectories.

On the basis of comprehensive epidemiological data from Wuhan1, we delineated the full dynamics of COVID-19 in the epicentre by extending the susceptible–exposed–infectious–recovered (SEIR) model to include presymptomatic infectiousness (P), unascertained cases (A) and case isolation in the hospital (H), generating a model that we name SAPHIRE (Fig. 1, Methods, Extended Data Tables 1, 2). We modelled the outbreak from 1 January 2020 across 5 time periods that were defined on the basis of key events and interventions: 1–9 January (before Chunyun), 10–22 January (Chunyun), 23 January to 1 February (cordon sanitaire), 2–16 February (centralized isolation and quarantine) and 17 February to 8 March (community screening). We assumed a constant population size of 10 million with equal numbers of daily inbound and outbound travellers (500,000 before Chunyun, 800,000 during Chunyun and 0 after cordon sanitaire)3. Furthermore, we assumed that the transmission rate and ascertainment rate did not change in the first two periods (because few interventions were implemented before 23 January), whereas these rates were allowed to vary in later periods to reflect the strengths of different interventions. We estimated these rates across periods by Markov Chain Monte Carlo (MCMC) and further converted the transmission rate into the effective reproduction number (Re) (Methods).

Fig. 1: Illustration of the SAPHIRE model.
figure 1

We extended the classic SEIR model to include seven compartments: susceptible (S), exposed (E), presymptomatic infectious (P), ascertained infectious (I), unascertained infectious (A), isolation in hospital (H) and removed (R). a, Relationship between different compartments. Two parameters of interest are r (ascertainment rate) and b (transmission rate), which are assumed to vary across time periods. b, Schematic disease course of symptomatic individuals. In this model, the unascertained compartment A includes asymptomatic and some mildly symptomatic individuals who were not detected. Although there is no presymptomatic phase for asymptomatic individuals, we treated asymptomatic as a special case of mildly symptomatic and modelled both with a ‘presymptomatic’ phase for simplicity.

We first simulated epidemic curves with two periods to validate our parameter estimation procedure (Methods, Extended Data Fig. 1). Our method could accurately estimate Re and the ascertainment rates when the model was correctly specified, and was robust to misspecification of the duration from the onset of symptoms to isolation and of the relative transmissibility of unascertained versus ascertained cases. As expected, estimates of Re were positively correlated with the specified latent and infectious periods, and the estimated ascertainment rates were positively correlated with the specified ascertainment rate in the initial state.

Using confirmed cases exported from Wuhan to Singapore (Extended Data Table 3), we conservatively estimated the ascertainment rate during the early outbreak in Wuhan to be 0.23 (95% confidence interval 0.14–0.42; unless specified otherwise, all parenthetical ranges refer to the 95% credible interval) (Methods). We then fit the daily incidences in Wuhan from 1 January to 29 February, assuming the initial ascertainment rate was 0.23, and predicted the trend from 1 March to 8 March (Methods). Our model fit the observed data well, except for the outlier on 1 February; this outlier might be due to the approximate-date records of many patients admitted to the field hospitals set up after 1 February (Fig. 2a). After a series of multifaceted public health interventions, Re decreased from 3.54 (3.40–3.67) and 3.32 (3.19–3.44) in the first two periods to 1.18 (1.11–1.25), 0.51 (0.47–0.54) and 0.28 (0.23–0.33) in the later three periods (Fig. 2b, Extended Data Tables 4, 5). We estimated the cumulative number of infections, including unascertained cases, up until 8 March to be 258,728 (204,783–320,145) if the trend of the fourth period was assumed (Fig. 2c), 818,724 (599,111–1,096,850) if the trend of the third period was assumed (Fig. 2d) or 6,302,694 (6,275,508–6,327,520) if the trend of the second period was assumed (Fig. 2e), in comparison to the estimated total infections of 249,187 (198,412–307,062) obtained by fitting data from all 5 periods (Fig. 2a). Correspondingly, these numbers translate into a 3.7%, 69.6% and 96.0% reduction of infections by the measures taken in the fifth period, the fourth and the fifth periods combined, and the last three periods combined, respectively.

Fig. 2: Modelling the COVID-19 epidemic in Wuhan.
figure 2

Parameters were estimated by fitting data from 1 January to 29 February. a, Prediction using parameters from period 5 (17 February–29 February). b, Distribution of Re estimates from 10,000 MCMC samples. In each violin plot, the white dot represents the median, the thick bar represents the interquartile range and the thin bar represents the minimum and the maximum. The mean and the 95% credible interval (in parentheses) are labelled below or above. c, Prediction using parameters from period 4 (2 February–16 February). d, Prediction using parameters from period 3 (23 January to 1 February). e, Prediction using parameters from period 2 (10 January to 22 January). The shaded areas in a, ce are 95% credible intervals, and the coloured points are the mean values based on 10,000 MCMC samples. f, Estimated number of active infectious cases in Wuhan from 1 January to 8 March.

We estimated low ascertainment rates throughout: 0.15 (0.13–0.17) for the first two periods, and 0.14 (0.11–0.17), 0.10 (0.08–0.12), and 0.16 (0.13–0.21) for the remaining three periods (Extended Data Table 6). Even with the universal screening of the community for symptoms that was implemented from 17 February to 19 February, the ascertainment rate was raised only to 0.16. On the basis of the fitted model using data from 1 January to 29 February, we projected the cumulative number of ascertained cases to be 32,577 (30,216–34,986) by 8 March, close to the reported number of 32,583. This was equivalent to an overall ascertainment rate of 0.13 (0.11–0.16) given the estimated total infections of 249,187 (198,412–307,062). The model also projected that the number of daily active infections (including presymptomatic, ascertained and unascertained infections) peaked at 55,879 (43,582–69,571) on 2 February and dropped afterwards to 701 (436–1,043) on 8 March (Fig. 2f). If the trend remained unchanged, the number of ascertained infections would have first become zero on 27 March (95% credible interval 20 March to 5 April), and the clearance of all infections would have occurred on 21 April (8 April to 12 May) (Extended Data Table 7). The first day of zero ascertained cases in Wuhan was reported on 18 March, indicating enhanced interventions in March.

We used stochastic simulations to investigate the implications of unascertained cases for continuing surveillance and interventions17 (Methods). Because of latent, presymptomatic and unascertained cases, the source of infection would not be completely cleared shortly after the first day of zero ascertained cases. We found that if control measures were lifted 14 days after the first day of zero ascertained cases, the probability of resurgence, defined as the number of active ascertained cases greater than 100, could be as high as 0.97, and the surge was predicted to occur on day 34 (27–47) after lifting controls (Fig. 3). If we were to impose a more-stringent criterion of lifting controls after observing no ascertained cases in a consecutive period of 14 days, the probability of resurgence would drop to 0.32, with possible resurgence delayed to day 42 (33–55) after lifting controls (Fig. 3). These results highlight the risk of ignoring unascertained cases in switching intervention strategies, despite our use of a simplified model.

Fig. 3: Risk of resurgence after lifting controls.
figure 3

We consider the main model (M) and the sensitivity analysis (S8) (Methods). In model M, we assume the initial ascertainment rate r0 = 0.23, and thus an overall ascertainment rate of 0.13. In S8, we assume no unascertained cases initially and thus an overall ascertainment rate of 0.47. For each model, we simulated epidemic curves on the basis of 10,000 sets of parameters from MCMC, and set the transmission rate (b), ascertainment rate (r) and population movement (n) to their values in the first period after lifting controls. Resurgence was defined as reaching over 100 active ascertained infections. a, Illustration of a simulated curve under the main model, with control measures lifted 14 days after the first day of no ascertained cases. The inset is an enlarged plot from 16 March to 28 May. b, Probability of resurgence if control measures were lifted t days after the first day of no ascertained cases, or after observing zero ascertained cases for t days consecutively. c, Expectation of time to resurgence, conditional on the occurrence of resurgence. We grouped the final 10 days (t = 21 to 30) to calculate the expected time to resurgence because of their low probability of resurgence. Key applies to both b and c.

We performed a series of sensitivity analyses to test the robustness of our results by smoothing the outlier data point on 1 February, as well as varying the lengths of latent and infectious periods, the duration from the onset of symptoms to isolation, the ratio of transmissibility in unascertained versus ascertained cases, and the initial ascertainment rate (Extended Data Tables 47, Supplementary Information). Our major findings, of a marked decrease in Re after interventions and the existence of a substantial number of unascertained cases, were robust. Consistent with simulations, the estimated ascertainment rates were positively correlated with the specified initial ascertainment rate. When we specified the initial ascertainment rate as 0.14 or 0.42, the estimated overall ascertainment rate was 0.08 (0.07–0.10) and 0.23 (0.16–0.28), respectively. If we assume an extreme scenario with no unascertained cases in the early outbreak (which we term model ‘S8’ (Supplementary Information)), the estimated ascertainment rate would be 0.47 (0.39–0.58) overall, which would represent an upper bound of the ascertainment rate. Because of the higher ascertainment rate (compared to the main analysis) in this model, we estimated a lower probability of resurgence (0.06) when lifting controls after 14 days of no ascertained cases, and the resurgence was expected to occur on day 38 (29–52) after lifting controls (Fig. 3). A simplified model that assumes complete ascertainment at any time performed substantially worse than the full model (Extended Data Table 4, Supplementary Information).

Understanding the proportion of unascertained cases and their transmissibility is critical for the prioritization of the surveillance and control measures17. Our finding of a large fraction of unascertained cases—despite the high level of surveillance in Wuhan—indicates the existence of many asymptomatic or mildly symptomatic individuals. It was previously estimated that asymptomatic individuals accounted for 18% of the infections on board the Diamond Princess Cruise ship8 and 31% of the infected Japanese individuals who were evacuated from Wuhan9. In addition, in a cohort of 210 women admitted for delivery between 22 March and 4 April in New York City (USA), 29 of 33 (88%) pregnant women infected with SARS-CoV-2 were asymptomatic10. Several reports have also highlighted the difficulty of detecting cases of COVID-19: the detection capacity varied from 11% in low-surveillance countries to 40% in high-surveillance countries18,19, and the modelling of epidemics outside of Wuhan has suggested that the ascertainment rate was 24.4% in China (excluding Hubei province)14 and 14% in Wuhan before the travel ban15. Consistent with these studies and emerging serological studies that show that seroprevalence is much higher than the reported case prevalence in cities and countries worldwide20,21,22, our analyses of data from Wuhan indicated an overall ascertainment rate between 8% and 23% (Extended Data Table 6, excluding the extreme scenario of model S8).

Our Re estimate of 3.54 (3.40–3.67) before any interventions is at the higher end of the range of the estimated R0 values of other studies that used early epidemic data from Wuhan6,23. This discrepancy might be due to the modelling of unascertained cases, more-complete case records in our analysis and/or to the different time periods analysed. If we modelled from the first case of COVID-19 reported in Wuhan, we would estimate a lower Re of 3.38 (3.28–3.48) before interventions (Extended Data Fig. 2), which remains much higher than those of SARS and MERS4,5.

Our modelling study has delineated the full-spectrum dynamics of the COVID-19 outbreak in Wuhan, and highlighted two key features of the outbreak: high covertness and high transmissibility. These two features have synergistically propelled the COVID-19 pandemic, and imposed considerable challenges to attempts to control the outbreak. However, the Wuhan case study demonstrates the effectiveness of vigorous and multifaceted containment efforts. In particular, despite the relatively low ascertainment rates (owing to mild or absent symptoms of many infected individuals), the outbreak was controlled by interventions such as wearing face masks, social distancing and quarantining close contacts1, which block transmission that stems from unascertained cases.

Given the limitations of our model as discussed below, further investigations—such as a survey of the seroprevalence of SARS-CoV-2-specific antibodies—are needed to confirm our estimates. First, owing to the delay in laboratory tests, we might have missed some cases and therefore underestimated the ascertainment rate (especially for the last period). Second, we excluded clinically diagnosed cases without laboratory confirmation to reduce false-positive diagnoses; however, this leads to an underestimation of ascertainment rates—especially for the third and fourth periods, during which many clinically diagnosed cases were reported1. The variation in the estimated ascertainment rates across periods reflects a combined effect of the evolving surveillance, interventions, medical resources and case definitions across time periods1,24. Third, our model assumes homogeneous transmission within the population and ignores heterogeneity between groups by sex, age, geographical region and socioeconomic status25. Furthermore, individual variation in infectiousness—such as superspreading events26—is known to result in a higher probability of stochastic extinction given a fixed population Re (ref. 27). We might therefore have overestimated the probability of resurgence. Finally, we could not evaluate the effect of individual interventions on the basis of an epidemic curve from a single city, because many interventions were applied simultaneously. Future work that models heterogeneous transmission between different groups, and joint analysis with data from other cities, will provide deeper insights into the effectiveness of different control strategies28,29.

Methods

Data of cases of COVID-19 in Wuhan

We analysed the daily incidence data of COVID-19, presented in figure 1 of ref. 1. In brief, information on cases of COVID-19 from 8 December 2019 to 8 March 2020 were extracted from the municipal Notifiable Disease Report System on 9 March 2020. The date of the onset of symptoms (the self-reported date of developing symptoms, such as a fever, cough or other respiratory symptoms) and the date of confirmed diagnosis were collected. For the consistency of case definition throughout the periods, we included only 32,583 individuals who had a laboratory-confirmed positive test for SARS-CoV-2 by the real-time reverse-transcription polymerase-chain-reaction (RT–PCR) assay or high-throughput sequencing of nasal and pharyngeal swab specimens. SAS software (version 9.4) was used in data collection.

Estimation of initial ascertainment rate using cases exported to Singapore

As of 10 May 2020, a total of 24 confirmed cases of COVID-19 in Singapore were reported to be imported from China, among which 16 were imported from Wuhan before the cordon sanitaire on 23 January; the first case arrived in Singapore on 18 January (Extended Data Table 3). Based on VariFlight Data (https://data.variflight.com/en/), the total number of passengers who travelled from Wuhan to Singapore between 18 January and 23 January 2020 was 2,722. Therefore, the infection rate among these passengers was 0.59% (95% confidence interval 0.30–0.88%). These individuals had an onset of symptoms between 21 January and 30 January 2020. In Wuhan, a total of 12,433 confirmed cases involved individuals who were reported to have experienced an onset of symptoms in the same period—equivalent to a cumulative infection rate of 0.124% (95% confidence interval 0.122–0.126%), assuming a population size of 10 million for Wuhan. By further assuming complete ascertainment of early cases in Singapore (which is well-known for its high level of surveillance18,19), the ascertainment rate during the early outbreak in Wuhan was estimated to be 0.23 (95% confidence interval 0.14–0.42), corresponding to 0.77 (95% confidence interval 0.58–0.86) of the infections being unascertained. This represents a conservative estimate for two reasons: (1) the assumption of perfect ascertainment in Singapore ignored potential asymptomatic individuals;8,9 and (2) the number of imported cases in which individuals experienced symptom onset between 21 January and 30 January was underestimated owing to the suspension of flights after lockdown in Wuhan. Without direct information to estimate the initial ascertainment rate before 1 January 2020, we used these results based on Singapore data to set the initial value and the prior distribution of ascertainment rates in our model, and performed sensitivity analyses under various assumptions.

The SAPHIRE model

We extended the classic SEIR model to a SAPHIRE model (Fig. 1, Extended Data Table 1), which incorporates three additional compartments to account for presymptomatic infectious individuals (P), unascertained cases (A) and cases isolated in the hospital (H). We chose to analyse data from 1 January 2020, when the Huanan Seafood Market was disinfected, and thus did not model the zoonotic force of infection3. We assumed a constant population size (N) = 10,000,000, with equal numbers of daily inbound and outbound travellers (n), in which n = 500,000 for 1–9 January, 800,000 for 10–22 January (owing to Chunyun) and 0 after the cordon sanitaire from 23 January3. We divided the population into susceptible (S), exposed (E), P, A, ascertained infectious (I), H and removed (R) individuals. We introduced compartment H because ascertained cases would have a shorter effective infectious period owing to isolation, especially when medical resources were improved1. We use italicized letters to denote the number of individuals in each corresponding compartment. The dynamics of these compartments across time (t) are described by the following set of ordinary differential equations:

$$\frac{{\rm{d}}S}{{\rm{d}}t}=n-\frac{bS(\alpha P+\alpha A+I)}{N}-\frac{nS}{N}$$
(1)
$$\frac{{\rm{d}}E}{{\rm{d}}t}=\frac{bS(\alpha P+\alpha A+I)}{N}-\frac{E}{{D}_{{\rm{e}}}}-\frac{nE}{N}$$
(2)
$$\frac{{\rm{d}}P}{{\rm{d}}t}=\frac{E}{{D}_{{\rm{e}}}}-\frac{P}{{D}_{{\rm{p}}}}-\frac{nP}{N}$$
(3)
$$\frac{{\rm{d}}A}{{\rm{d}}t}=\frac{(1-r)P}{{D}_{{\rm{p}}}}-\frac{A}{{D}_{{\rm{i}}}}-\frac{nA}{N}$$
(4)
$$\frac{{\rm{d}}I}{{\rm{d}}t}=\frac{rP}{{D}_{{\rm{p}}}}-\frac{I}{{D}_{{\rm{i}}}}-\frac{I}{{D}_{{\rm{q}}}}$$
(5)
$$\frac{{\rm{d}}H}{{\rm{d}}t}=\frac{I}{{D}_{{\rm{q}}}}-\frac{H}{{D}_{{\rm{h}}}}$$
(6)
$$\frac{{\rm{d}}R}{{\rm{d}}t}=\frac{A+I}{{D}_{{\rm{i}}}}+\frac{H}{{D}_{{\rm{h}}}}-\frac{nR}{N}$$
(7)

in which b is the transmission rate for ascertained cases (defined as the number of individuals that an ascertained case can infect per day); α is the ratio of the transmission rate of unascertained cases to that of ascertained cases; r is ascertainment rate; De is the latent period; Dp is the presymptomatic infectious period; Di is the symptomatic infectious period; Dq is the duration from illness onset to isolation; and Dh is the isolation period in hospital. Re could be computed as

$${R}_{{\rm{e}}}=\alpha b{\left({D}_{{\rm{p}}}^{-1}+\frac{n}{N}\right)}^{-1}+(1-r)\alpha b{\left({D}_{{\rm{i}}}^{-1}+\frac{n}{N}\right)}^{-1}+rb{({D}_{{\rm{i}}}^{-1}+{D}_{{\rm{q}}}^{-1})}^{-1}$$
(8)

in which the three terms represent infections contributed by presymptomatic individuals, unascertained cases and ascertained cases, respectively. We adjusted the infectious periods of each type of case by taking population movement \(\left(\frac{n}{N}\right)\) and isolation (\({D}_{{\rm{q}}}^{-1}\)) into account.

Parameter settings and initial states

Parameter settings for the main analysis are summarized in Extended Data Table 2. We set α = 0.55 according to ref. 15, assuming lower transmissibility for unascertained cases. Compartment P contains both ascertained and unascertained cases in the presymptomatic phase. We set the transmissibility of P to be the same as unascertained cases, because it has previously been reported that the majority of cases are unascertained15. We assumed an incubation period of 5.2 days and a presymptomatic infectious period of Dp = 2.3 days2,6. Thus, the latent period was De = 5.2 − 2.3 = 2.9 days. Because presymptomatic infectiousness was estimated to account for 44% of the total infections from ascertained cases2, we set the mean of total infectious period as \(({D}_{{\rm{p}}}+{D}_{{\rm{i}}})=\frac{{D}_{{\rm{p}}}}{0.44}=5.2\) days, assuming constant infectiousness across the presymptomatic and symptomatic phases of ascertained cases12—thus, the mean symptomatic infectious period was Di = 2.9 days. We set a long isolation period of Dh = 30 days, but this parameter has no effect on our fitting procedure and the final parameter estimates. The duration from the onset of symptoms to isolation was estimated to be Dq = 21, 15, 10, 6 and 3 days as the median time length from onset to confirmed diagnosis in period 1–5, respectively1.

On the basis of the settings above, we specified the initial state of the model on 31 December 2019 (Extended Data Table 1). The initial number of ascertained symptomatic cases I(0) was specified as the number of ascertained cases in which individuals experienced symptom onset during 29–31 December 2019. We assumed the initial ascertainment rate was r0, and thus the initial number of unascertained cases was \(A(0)={r}_{0}^{-1}(1-{r}_{0})I(0)\). We denoted PI(0) and EI(0) as the numbers of ascertained cases in which individuals experienced symptom onset during 1–2 January 2020 and 3–5 January 2020, respectively. Then, the initial numbers of exposed and presymptomatic individuals were set as \(E(0)={r}_{0}^{-1}{E}_{{\rm{I}}}(0)\) and \(P(0)={r}_{0}^{-1}{P}_{{\rm{I}}}(0)\), respectively. We assumed r0 = 0.23 in our main analysis, on the basis of the point estimate using the Singapore data (described in ‘Estimation of initial ascertainment rate using cases exported to Singapore’).

Estimation of parameters in the SAPHIRE model

Considering the time-varying strength of control measures, we assumed b = b12 and r = r12 for the first two periods, b = b3 and r = r3 for period 3, b = b4 and r = r4 for period 4, and b = b5 and r = r5 for period 5. We assumed that the observed number of ascertained cases in which individuals experienced symptom onset on day d—denoted as xd—follows a Poisson distribution with rate \({\lambda }_{d}=r{P}_{d-1}{D}_{{\rm{p}}}^{-1}\), in which Pd −1 is the expected number of presymptomatic individuals on day (d − 1). We fit the observed data from 1 January to 29 February (d = 1, 2, …, D, and D = 60) and used the fitted model to predict the trend from 1 March to 8 March. Thus, the likelihood function is

$$L({b}_{12},{b}_{3},{b}_{4},{b}_{5},{r}_{12},{r}_{3},{r}_{4},{r}_{5})=\mathop{\prod }\limits_{d=1}^{D}\frac{{{\rm{e}}}^{-{\lambda }_{d}}{\lambda }_{d}^{{x}_{d}}}{{x}_{d}!}$$
(9)

We estimated b12, b3, b4, b5, r12, r3, r4 and r5 by MCMC with the delayed rejection adaptive metropolis algorithm implemented in the R package BayesianTools (version 0.1.7)30. We used a non-informative flat prior of Unif(0,2) for b12, b3, b4 and b5. For r12, we used an informative prior of Beta(7.3,24.6) by matching the first two moments of the estimate using Singapore data (described in ‘Estimation of initial ascertainment rate using cases exported to Singapore’). We reparameterized r3, r4 and r5 by

$${\rm{logit}}({r}_{3})={\rm{logit}}({r}_{12})+{\delta }_{3}$$
$${\rm{logit}}({r}_{4})={\rm{logit}}({r}_{3})+{\delta }_{4}$$
$${\rm{logit}}({r}_{5})={\rm{logit}}({r}_{4})+{\delta }_{5}$$

in which \({\rm{logit}}(r)=\,\log \left(\frac{r}{1-r}\right)\). In the MCMC, we sampled δ3, δ4 and δ5 from the prior of N(0,1). We set a burn-in period of 40,000 iterations and continued to run 100,000 iterations with a sampling step size of 10 iterations. We repeated MCMC with three different sets of initial values and assessed the convergence by the trace plot and the multivariate Gelman–Rubin diagnostic31 (Supplementary Information). Estimates of parameters were presented as posterior means and 95% credible intervals from 10,000 MCMC samples. All of the analyses were performed in R (version 3.6.2) and the Gelman–Rubin diagnostic was calculated using the gelman.diag function in the R package coda (version 0.19.3).

Stochastic simulations

We used stochastic simulations to obtain the 95% credible interval of a fitted or predicted epidemic curve. Given a set of parameter values from MCMC, we performed the following multinomial random sampling:

$$({U}_{{\rm{S}}\to {\rm{E}}},{U}_{{\rm{S}}\to {\rm{O}}},{U}_{{\rm{S}}\to {\rm{S}}}) \sim {\rm{Multinomial}}({S}_{t-1};{p}_{{\rm{S}}\to {\rm{E}}},{p}_{{\rm{O}}},1-{p}_{{\rm{S}}\to {\rm{E}}}-{p}_{{\rm{O}}})$$

\(({U}_{{\rm{E}}\to {\rm{P}}},{U}_{{\rm{E}}\to {\rm{O}}},{U}_{{\rm{E}}\to {\rm{E}}}) \sim {\rm{Multinomial}}({E}_{t-1};{p}_{{\rm{E}}\to {\rm{P}}},{p}_{{\rm{O}}},1-{p}_{{\rm{E}}\to {\rm{P}}}-{p}_{{\rm{O}}})\)

$$({U}_{{\rm{P}}\to {\rm{I}}},{U}_{{\rm{P}}\to {\rm{A}}},{U}_{{\rm{P}}\to {\rm{O}}},{U}_{{\rm{P}}\to {\rm{P}}}) \sim {\rm{Multinomial}}({P}_{t-1};{p}_{{\rm{P}}\to {\rm{I}}},{p}_{{\rm{P}}\to {\rm{A}}},{p}_{{\rm{O}}},1-{p}_{{\rm{P}}\to {\rm{I}}}-{p}_{{\rm{P}}\to {\rm{A}}}-{p}_{{\rm{O}}})$$
$$({U}_{{\rm{I}}\to {\rm{H}}},{U}_{{\rm{I}}\to {\rm{R}}},{U}_{{\rm{I}}\to {\rm{I}}}) \sim {\rm{Multinomial}}({I}_{t-1};{p}_{{\rm{I}}\to {\rm{H}}},{p}_{{\rm{I}}\to {\rm{R}}},1-{p}_{{\rm{I}}\to {\rm{H}}}-{p}_{{\rm{I}}\to {\rm{R}}})$$
$$({U}_{{\rm{A}}\to {\rm{R}}},{U}_{{\rm{A}}\to {\rm{O}}},{U}_{{\rm{A}}\to {\rm{A}}}) \sim {\rm{Multinomial}}({A}_{t-1};{p}_{{\rm{A}}\to {\rm{R}}},{p}_{{\rm{O}}},1-{p}_{{\rm{A}}\to {\rm{R}}}-{p}_{{\rm{O}}})$$
$$({U}_{{\rm{H}}\to {\rm{R}}},{U}_{{\rm{H}}\to {\rm{H}}}) \sim {\rm{Multinomial}}({H}_{t-1};{p}_{{\rm{H}}\to {\rm{R}}},1-{p}_{{\rm{H}}\to {\rm{R}}})$$
$$({U}_{{\rm{R}}\to {\rm{O}}},{U}_{{\rm{R}}\to {\rm{R}}}) \sim {\rm{Multinomial}}({R}_{t-1};{p}_{{\rm{O}}},1-{p}_{{\rm{O}}})$$

in which O denotes the status of outflow population, \({p}_{{\rm{O}}}=n{N}^{-1}\) denotes the outflow probability and other quantities are status transition probabilities, including\(\,{p}_{{\rm{S}}\to {\rm{E}}}=b(\alpha {P}_{t-1}+\alpha {A}_{t-1}+{I}_{t-1}){N}^{-1}\), \({p}_{{\rm{E}}\to {\rm{P}}}={D}_{{\rm{e}}}^{-1}\), \({p}_{{\rm{P}}\to {\rm{I}}}=r{D}_{{\rm{p}}}^{-1}\), \({p}_{{\rm{P}}\to {\rm{A}}}=(1-r){D}_{{\rm{p}}}^{-1}\), \({p}_{{\rm{I}}\to {\rm{H}}}={D}_{{\rm{q}}}^{-1}\), \({p}_{{\rm{I}}\to {\rm{R}}}={p}_{{\rm{A}}\to {\rm{R}}}={D}_{{\rm{i}}}^{-1}\)  and \(\,{p}_{{\rm{H}}\to {\rm{R}}}={D}_{{\rm{h}}}^{-1}\). The SAPHIRE model described by equations (1)–(7) is equivalent to the following stochastic dynamics:

$${S}_{t}-{S}_{t-1}=n-{U}_{{\rm{S}}\to {\rm{E}}}-{U}_{{\rm{S}}\to {\rm{O}}}$$
(10)
$${E}_{t}-{E}_{t-1}={U}_{{\rm{S}}\to {\rm{E}}}-{U}_{{\rm{E}}\to {\rm{P}}}-{U}_{{\rm{E}}\to {\rm{O}}}$$
(11)
$${P}_{t}-{P}_{t-1}={U}_{{\rm{E}}\to {\rm{P}}}-{U}_{{\rm{P}}\to {\rm{A}}}-{U}_{{\rm{P}}\to {\rm{I}}}-{U}_{{\rm{P}}\to {\rm{O}}}$$
(12)
$${A}_{t}-{A}_{t-1}={U}_{{\rm{P}}\to {\rm{A}}}-{U}_{{\rm{A}}\to {\rm{R}}}-{U}_{{\rm{A}}\to {\rm{O}}}$$
(13)
$${I}_{t}-{I}_{t-1}={U}_{{\rm{P}}\to {\rm{I}}}-{U}_{{\rm{I}}\to {\rm{H}}}-{U}_{{\rm{I}}\to {\rm{R}}}$$
(14)
$${H}_{t}-{H}_{t-1}={U}_{{\rm{I}}\to {\rm{H}}}-{U}_{{\rm{H}}\to {\rm{R}}}$$
(15)
$${R}_{t}-{R}_{t-1}={U}_{{\rm{A}}\to {\rm{R}}}+{U}_{{\rm{I}}\to {\rm{R}}}+{U}_{{\rm{H}}\to {\rm{R}}}-{U}_{{\rm{R}}\to {\rm{O}}}$$
(16)

We repeated the stochastic simulations for all 10,000 sets of parameter values sampled by MCMC to construct the 95% credible interval of the epidemic curve by the 2.5 and 97.5 percentiles at each time point.

Prediction of epidemic ending date and the risk of resurgence

Using the stochastic simulations described in ‘Stochastic simulations’, we predicted the first day of no new ascertained cases and the date of clearance of all active infections in Wuhan, assuming continuation of the same control measures as the last period (that is, same parameter values).

We also evaluated the risk of outbreak resurgence after lifting control measures. We considered lifting all controls (1) at t days after the first day of zero ascertained cases, or (2) after a consecutive period of t days with no ascertained cases. After lifting controls, we set the transmission rate b, ascertainment rate r and population movement n to be the same as the first period, and continued the stochastic simulation to the stationary state. Time to resurgence was defined as the number of days from lifting controls to when the number of active ascertained cases (I) reached 100. We performed 10,000 simulations with 10,000 sets of parameter values sampled from MCMC (as described in ‘Estimation of parameters in the SAPHIRE model’). We calculated the probability of resurgence as the proportion of simulations in which resurgence occurred, as well as the time to resurgence conditional on the occurrence of resurgence.

Simulation study for method validation

To validate the method, we performed two-period stochastic simulations (equations (10) to (16)) with transmission rate b = b1 = 1.27, ascertainment rate r = r1 = 0.2, daily population movement n = 500,000, and duration from illness onset to isolation Dq = 20 days for the first period (so that Re = 3.5 according to equation (8)), and b = b2 = 0.41, r = r2 = 0.4, n = 0 and Dq = 5 for the second period (so that Re = 1.2 according to equation (8)). Lengths of both periods were set to 15 days, and the initial ascertainment rate was set to r0 = 0.3, and the other parameters and initial states were set as those in our main analysis (Extended Data Tables 1, 2). We repeated stochastic simulations 100 times to generate 100 datasets. For each dataset, we applied our MCMC method to estimate b1, b2, r1 and r2, and set all other parameters and initial values to be the same as the true values. We translated b1 and b2 into (Re)1 and (Re)2 according to equation (8), and focused on evaluating the estimates of (Re)1, (Re)2, r1 and r2. We also tested the robustness to misspecification of the latent period De, presymptomatic infectious period Dp, symptomatic infectious period Di, duration from illness onset to isolation Dq, ratio of transmissibility between unascertained and ascertained cases α, and initial ascertainment rate r0. In each test, we changed the specified value of a parameter (or initial state) to be 20% lower or higher than its true value, and kept all other parameters unchanged. When we changed the value of r0, we adjusted the initial states A(0), P(0) and E(0) according to Extended Data Table 1.

For each simulated dataset, we ran the MCMC method with 20,000 burn-in iterations and an additional 30,000 iterations. We sampled parameter values from every 10 iterations, resulting in 3,000 MCMC samples. We took the mean across 3,000 MCMC samples as the final estimates and display results for 100 repeated simulations.

Sensitivity analyses for the real data

We designed nine sensitivity analyses to test the robustness of our results from real data. For each of the sensitivity analyses, we fixed parameters and initial states to be the same as the main analysis except for those mentioned below. For analysis (S1), we adjust the reported incidences from 29 January to 1 February to their average. We suspect the spike of incidences on 1 February might be caused by approximate-date records among some patients admitted to the field hospitals after 2 February. The actual dates for illness onset for these patients were likely to be spread between 29 January and 1 February. For analysis (S2), we assume an incubation period of 4.1 days (lower 95% confidence interval from ref. 6) and a presymptomatic infectious period of 1.1 days (the lower 95% confidence interval from ref. 2 is 0.8 days, but our discrete stochastic model requires Dp > 1), equivalent to set De = 3 and Dp = 1.1, and adjust P(0) and E(0) accordingly. For analysis (S3), we assume an incubation period of 7 days (upper 95% confidence interval from ref. 6) and a presymptomatic infectious period of 3 days (upper 95% confidence interval from ref. 2), equivalent to set De = 4 and Dp = 3, and adjust P(0) and E(0) accordingly. For analysis (S4), we assume the transmissibility of the unascertained cases is α = 0.46 (lower 95% confidence interval from ref. 15) of the ascertained cases. For analysis (S5), we assume the transmissibility of the unascertained cases is α = 0.62 (upper 95% confidence interval from ref. 15) of the ascertained cases. For analysis (S6), we assume the initial ascertainment rate is r0 = 0.14 (lower 95% confidence interval of the estimate using Singapore data) and adjust A(0), P(0) and E(0) accordingly. For analysis (S7), we assume the initial ascertainment rate is r0 = 0.42 (upper 95% confidence interval of the estimate using Singapore data) and adjust A(0), P(0) and E(0) accordingly. For analysis (S8), we assume the initial ascertainment rate is r0 = 1 (theoretical upper limit) and adjust A(0), P(0) and E(0) accordingly. For analysis (S9), we assume no unascertained cases by fixing r0 = r12 = r3 = r4 = r5 = 1. We compared this simplified model to the full model using the Bayes factor.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.