Difference between revisions of "Gomes"

From Covid-19
Jump to navigation Jump to search
 
(33 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
= About =
 +
 +
[[Main_Page|Back to main page]]
 +
 +
<blockquote style="background-color: #ececec; padding: 10px; border:solid thin grey">
 +
The post below was written on 22 May 2020. This boxed comment is written from a later perspective (5 December 2020), and slightly more opinionated than the post from May which was intended to be neutral in tone.
 +
 +
At the time of writing the post, such low HITs seemed rather unlikely (to me anyway) based on the dynamics of the first waves around the world, but now I believe the evidence is almost conclusive that such low HITs are impossible. We have had second waves in places with high attack rates from a previous first wave. For example, New York City is now experiencing [https://www.nytimes.com/interactive/2020/nyregion/new-york-city-coronavirus-cases.html a resurgence] despite [https://www.nbcnewyork.com/news/local/cuomo-outlines-reopening-roadmap-for-new-york-as-daily-deaths-hit-lowest-level-in-weeks/2390949/ an estimated 24.7%] having been infected by late April, and despite [https://www.bbc.co.uk/news/world-us-canada-54906483 ongoing suppression measures], together suggesting the HIT is well about 25% for NYC. There is a similar story in London, Stockholm and other places. It's possible to imagine an argument that the HIT is as low as 25% in NYC though it would be quite strained, relying on assuming the present restrictions are having no suppressive effect at all, and the outbreaks are happening in different areas from before. It seems to me more likely that the HIT for NYC is considerably higher.
 +
 +
Since the May 2020 paper, two of the original authors have published [https://arxiv.org/abs/2008.00098 a new paper] with a more streamlined presentation of their argument where they find the analytic formula noted in the appendix of the post below.
 +
 +
Since then, these authors and others have published [https://www.medrxiv.org/content/10.1101/2020.07.23.20160762v3 a new paper] (Aguas et al, November 2020) seeking to deduce HITs for European countries by fitting their model to case numbers, using certain values for the level of NPIs (non-pharmaceutical interventions, disease suppression by distancing, hygiene, stay-at-home etc.). They conclude that the HITs for the studied countries are around 10-20%, which I do not find believable for the reasons given above (and others). There is [https://www.medrxiv.org/content/10.1101/2020.12.01.20242289v1 a paper] (Fox et al, December 2020) which reanalyses Aguas et al using different assumptions (which I find more persuasive) about the suppression levels, and arrives at estimates of 60-80% for HITs. The key difference is that Aguas et al assumes that suppression measures have had no effect in Europe since August 2020, whereas Fox et al assumes that there has been fairly strong suppression all the time since March 2020.
 +
 +
As of 3 December 2020, one of the authors has [https://twitter.com/mgmgomes1/status/1334572853056983043 reaffirmed her belief in a HIT of around 20%].
 +
 +
'''Additional note''': To summarise the main objection to the paper, its mechanism hinges on there being a wide variation in ''susceptibility'' (propensity to catch the disease), but the existence of superspreaders (people who transmit the disease a lot more than average) only proves there is a wide variation in ''transmissibility''. Unless superspreaders are also supercatchers, there is no reason to expect them to be preferentially removed from the susceptible pool, which means the existence of superspreaders alone gives no reason to expect a lower HIT by this mechanism. So the evidence for superspreaders (which is fairly solid) only becomes evidence for this low-HIT mechanism if superspreaders arise because they are generally more sociable (in which case you'd expect them to catch the disease more), not if they are just emitting more (or more potent) viral particles. Since we don't know a priori (at least, as of May 2020) which of these is true, it becomes an empirical question - it can only be decided by observation, not just mathematics, and as far as I can see, observation has come down on the side of a HIT that is not a great deal lower than the naive $$1-1/R_0$$.
 +
</blockquote>
 +
 
= Article =
 
= Article =
  
 
Note on [https://www.medrxiv.org/content/10.1101/2020.04.27.20081893v2 "Individual variation in susceptibility or exposure to SARS-CoV-2 lowers the herd immunity threshold"] by M. Gabriella M. Gomes et al.
 
Note on [https://www.medrxiv.org/content/10.1101/2020.04.27.20081893v2 "Individual variation in susceptibility or exposure to SARS-CoV-2 lowers the herd immunity threshold"] by M. Gabriella M. Gomes et al.
 +
 +
This paper made a media splash early in May 2020 with
 +
[https://www.spectator.co.uk/article/herd-immunity-may-only-need-a-10-per-cent-infection-rate headlines] such as "Herd immunity may only need 10-20 per cent of people to be infected". The proposed mechanism is that people who are more susceptible to the infection are those more involved in spreading it, but will also become immune at a faster rate than others, so the disease naturally tends to make the key players immune more so than others.
  
 
= Summary =
 
= Summary =
 +
(Opinion-based) I believe that studying heterogeneities will be important because the herd immunity threshold may indeed be lower than the simplistically-calculated value, and I also believe that the Gomes model is a potentially useful way to introduce heterogeneities that could be applied in general as a meta-method to more realistic/non-toy models. However, for reasons given below, I believe that the actual numerical results in this paper can only be taken as illustrative and should not be taken as representing reality.
 +
 +
= Discussion =
 +
 +
* This paper uses two variants of a standard SEIR model. The first is the "susceptibility model", where the population is divided into classes according to a parameter that governs to what extent they are likely to become infected. Passing on the infection is still assumed to occur uniformly. The second is the "connectivity model" which takes this further by assuming that infectivity is proportional to susceptibility. I suppose the latter model is called "connectivity" because it is equivalent to changing the connectivity of the underlying graph - making it random subject to a degree distribution. Note that throughout, the term "susceptibility" refers to the potential to catch the disease, not the severity of outcome.
 +
 +
* Varying the infectivity alone would just reduce to a standard SEIR model because everyone (all infectivity classes) would be being infected at the same rate. By contrast, in the two above models the susceptibility classes are being depleted at different rates from each other.
 +
 +
* As the authors conclude, there is an effect where the herd immunity threshold (HIT) in the two models above is lower than that predicted by simple homogeneous modelling ($$1-1/R_0$$). This is a valuable point to make given that some seem to be taking as read that the herd immunity threshold must be 60-70%, and we obviously very much need to know when herd immunity arises, or at least to what extent we should expect to feel the effects of partial herd immunity. (Though the idea that the HIT is lower in the presence of heterogeneity will of course be familiar to epidemiologists.)
 +
 +
* I doubt that a HIT as low as 10% is at all likely. It arises in this paper's model from an extreme distribution of susceptibilities where there is a big concentration at the low end. I couldn't find any evidence presented in the paper for a particular distribution, or a particular variance, for any given susceptibility distibution. There are references to sources which give evidence or arguments for a certain distribution (or dispersion) of individual infectivities, and it seems the authors of the present paper intend this to inform the distribution of individual susceptibilities, but I think this is a different thing. Of course, the present authors are aware of this, but as there is scant information to go on they are just making the best use of what there is. (More on this below.)
  
This paper made a media splash early in May 2020 with
+
* Contrary to the practice in the paper, I believe the Coefficient of Variation (CV), or other dispersion parameter that is basically a function of the mean and variance, is not, for these purposes, a suitable parameter to use to measure how much the susceptibility distribution varies from a point value (constant). To demonstrate this we can choose different (hypothetical) distributions with the same CV and get a huge range of HITs.
[https://www.spectator.co.uk/article/herd-immunity-may-only-need-a-10-per-cent-infection-rate headlines] such as "Herd immunity may only need 10-20 per cent of people to be infected". The proposed mechanism is that people who are more involved in the infection (through susceptibility or infectiousness) will be preferentially infected, and so become immune at a faster rate than those who don't take part so much.
 
  
My opinion:
+
* There is actually a nice clean way to evaluate the HIT directly from the susceptibility distribution under the given models.
* There is likely to be such an effect: a reduced herd immunity threshold (HIT) due to people being differently susceptible/infectious compared with what simple homogeneous modelling would suggest.
 
* This is a valuable point to make given that many people seem to be taking as read that the herd immunity threshold must be 60-70%, and we obviously very much need to know when herd immunity arises and to what extent we are feeling its effects now. However,
 
* I doubt that a HIT as low as 10-20% is likely. It arises in this paper's model from an extreme distribution of susceptibilities where there is a big concentration at the low end.
 
* Contrary to the practice in the paper, I believe the Coefficient of Variation (CV) is not a suitable parameter to use to measure how much the susceptibility distribution varies from a point value (constant): by choice of distribution you can get a huge variation in HIT for a given CV. That makes it dangerous to use, as the authors do at one point, an empirical/measured CV from one setting and translate it into a CV of a particular distribution (Gamma) in their idealised setting. Furthermore, there is a straightforward way to evaluate the HIT directly from the distribution (which the authors don't appear to have considered) so there is no need for proxy indicators like CV.
 
* There is a question of to what extent inhomogeneity in susceptibility is already taken into account in existing modelling - is it taken into account enough? I don't know the answer to this, and probably other people would be better placed to comment. Of course epidemiologists are used to using mixing matrices based on age, location and other things, and these will certainly account for some of the inhomogeneity: for example [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005697 Prem et al] and [https://www.sciencedirect.com/science/article/pii/S1755436518300306 Klepac et al]. But it is conceivable (from the point of view of my limited knowledge) that these efforts, which are based on empirical data, don't go far enough because it's hard to take into account all of the "assortative" behaviours people engage in. (For example - illustrative, not a real example - if your mixing matrix classifies people into football supporters or not, that would account for some inhomogeneity, but it may turn out that you really need to take account of what particular team people support, because supporters tend very strongly only to mix with those of the same team.) If that is true, then there could be a case for artificially boosting inhomogeneity in the models (perhaps in the manner of Gomes et al) to account for the missing/unmeasurable inhomogeneity.
 
  
= In more detail =
+
* In their model, social distancing / locking down only affects the overall time parameter. It's unclear how realistic this is, but under this model one could decouple the question of what the HIT is, with how to manage (by distancing etc.) the flow of the disease. As such, the focus is mainly on what the HIT is.
  
NB: This was originally written about the [https://www.medrxiv.org/content/10.1101/2020.04.27.20081893v1.full.pdf first version] of this paper. Since then a second version has appeared where the authors use a distribution for their parameters rather than fixing them. This note is largely addressed at features/results which are common to both versions of this paper.
+
* The present paper shows what happens if you "artificially" add in a variation in susceptibility, i.e., treat everyone uniformly except for a susceptibility parameter controlled by a given distribution. An alternative approach would be to use a real life contact graph, or some version of it. Age- and location-based contact information have been studied for this purpose, e.g., [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005697 Prem et al] and [https://www.sciencedirect.com/science/article/pii/S1755436518300306 Klepac et al]. Such a calculation is carried out in [https://arxiv.org/abs/2005.03085 Bitton et al] using Age- and Activity-based mixing matrices. They find, inter alia, that if $$R_0=3$$ then the HIT is reduced from $$66.7%$$ to $$49.1%$$. This empirically-derived distribution has the merit of being justifiable, but (and I'm speculating here) I wonder if it understates the full heterogeneity, in which case there might be an argument for trying a hybrid method where you artificially introduce a little more variability (somehow calibrated to reality), maybe in the manner of the present paper, on top of the empirically-found mixing matrices.
  
To parameterize the susceptibility distribution, the authors have used the Gamma family, presumably because it's a convenient family of distributions on non-negative reals with specified mean and standard deviation (and maybe it is common practice to use it). However, if you actually look at what the distribution looks like when the Coefficient of Variation ($$CV$$) is equal to 3, you will see it is quite extreme with a strong concentration at the low end. $$CV=3$$ corresponds to shape parameter $$k=1/9$$, which would imply that 63% of the population had susceptibility less than 0.09 (relative to a mean of 1) and 50% with susceptibility less than 0.01.
+
= In more detail =
  
So you can forget almost all the maths in this case: in simple terms such an extreme distribution (Gamma at $$CV=3$$) effectively disconnects a large part of the graph. Obviously in that case you wouldn't need nearly such a large proportion of the population to become immune to generate herd immunity, because you only need immunity amongst the subpopulation that isn't disconnected (and that itself is strongly stratified into mostly disconnected vs connected).
+
The lowest HIT estimates in this paper arise from using a Gamma distribution with $$CV=3$$ for the susceptibility distribution, which corresponds to shape parameter $$k=1/9$$. This has a large chunk of its probability at the very low end: it is saying that 63% of the population has susceptibility less than 0.09 (relative to a mean of 1) and 50% has susceptibility less than 0.01. With this in mind, it's not surprising that we end up with low HIT estimates because, roughly speaking, in this case you only need to induce herd immunity amongst the minority susceptible population.
  
The question then becomes is it plausible that this particular CV=3 distribution describes the true susceptibility distribution or connection graph? The evidence cited in the paper is that other epidemics have had $$CV$$s as high as 3.3 (Brazil, tuberculosis), but as we shall see, $$CV$$ is not the proper parameter, and having a $$CV$$ of 3 is not the same as having a $$CV$$ of 3 with a particular distribution (Gamma).
+
Note that a similar-sounding argument to this is not correct: if the infectivities, but not susceptibilities, were clustered near 0 then it would still be true that you only need to induce herd immunity amongst the minority infective population, but that wouldn't occur naturally because the infective portion of the population wouldn't be getting infected any more than the non-infective portion.
  
Reasons I think that the Gamma $$CV=3$$ distribution is implausible: we know (in normal times, which is what we are talking about here because we are considering whether there is herd immunity after restrictions are lifted) that 50% of the population aren't living as hermits, cut off from all contact. And I don't know of any suggestion that a large chunk of the population might have a prior immunity to Covid-19. Indeed, there is some evidence from closed settings to suggest that anyone can get it. E.g., in [https://www.cdc.gov/mmwr/volumes/69/wr/mm6919e6.htm the infamous choir rehearsal], everyone in range apparently caught it. And there is some evidence that despite younger people being far less affected in terms of severity, they can still catch the disease quite readily: [https://www.ons.gov.uk/peoplepopulationandcommunity/healthandsocialcare/conditionsanddiseases/bulletins/coronaviruscovid19infectionsurveypilot/england14may2020 a recent infection survey] in England found that "there is no evidence of differences in the proportions testing positive between the age categories 2 to 19, 20 to 49, 50 to 69 and 70 years and over". There may be an effect where young people catch the disease less, but the preceding ONS survey suggests it can't be all that extreme: not enough to imagine that 50+% of the population don't take part in the epidemic at all.
+
There is some evidence that infectivity is strongly clustered like this (and has a long tail - i.e., superspreaders). The present paper cites [https://wellcomeopenresearch.org/articles/5-67 Endo et al] that suggests a dispersion of something like $$k=1/10$$ in SARS-CoV-2. This relies (slightly optimistically in my opinion) on knowledge of the seed infections in different countries to derive its result, but perhaps more importantly this cited paper is making a statement about infectivity not susceptibility. Also cited is the classic 2005 paper [https://www.nature.com/articles/nature04153 Lloyd-Smith et al] on superspreading for the original SARS outbreak. This estimates a dispersion parameter of $$k=0.16$$ and also gives evidence that the Gamma family is a good one to use (because a Poisson with parameter Gamma is a Negative Binomial, and they find evidence for the Negative Binomial family being a good description), but again this is talking about infectivity not susceptibility so I believe this isn't directly applicable to the situation of the present paper.
  
= Tests =
+
Regardless of the somewhat shaky evidence, is it yet possible that the real susceptibility distribution looks like a Gamma with shape $$1/9$$? I suspect this is unlikely. As far as I am aware anyone can catch the disease, and there aren't any known huge differences in susceptibility to catching the disease amongst subpopulations. The best candidate might be age, since the disease is so strongly age-dependent in terms of severity, but while young people may be a little less susceptible to catching it, it's clearly (from any prevalence survey you look at) far from true that the younger 50% of the population (median age in the UK is 40.5 years) are less than $$1/100$$ as susceptible to catching it compared with the average. Or if susceptibility variation is mainly arising from connectivity, then a shape $$1/9$$ Gamma distribution is still not right because it's clearly not the case (in normal times, which is what we are talking about here, since we are investigating the question of herd immunity after dropping restrictions) that more than 50% of the population lives alone and essentially never meets anyone at all. Arguments against lower $$CV$$ become progressively less clear cut - I won't pursue this further.
  
To test the above claims (the authors' and my own), I [https://github.com/alex1770/Covid-19/blob/master/variation.py reimplemented their model in Python]. This gives a good match to the output of version 1 of their paper in "susceptibility" mode (where people are differently susceptible to infection, but not differently infectious), though not in "connectivity" mode (where people are differently infectious too). Possibly the discrepancy in the latter mode is due to a difference in our initial conditions. In any case there is only a discrepancy in the progress of the infection, not in the HITs.
+
To illustrate how the HIT doesn't properly depend on $$CV$$, I tried using a "two-point" distribution parameterized by $$x, y$$ and $$p$$, where $$P(X=x)=p$$, and $$P(X=y)=1-p$$. Fixing the mean to be 1 and the variance to be $$CV^2$$ leaves a free parameter that may as well be $$x$$, and I look at the opposite extremes of $$x=0$$ and $$x=0.9$$. There are superspreaders in both of these cases (though more extreme in the $$x=0.9$$ case), but the $$x=0$$ case also has a lot of (for want of a better term) "superhermits" - i.e., like Gamma at $$CV=3$$, lots of the distribution is concentrated at or near 0. Of course these distributions are unrealistic as real-world examples. They are just there to make the point that $$CV$$ isn't a good characterisation of spread for the purposes of calculating the HIT.
  
To illustrate how the HIT doesn't properly depend on $$CV$$, we now also try the "two-point" distribution parameterized by $$x, y$$ and $$p$$, where $$P(X=x)=p$$, and $$P(X=y)=1-p$$. Fixing the mean to be 1 and the variance to be $$CV^2$$ leaves a free parameter that may as well be $$x$$. $$x=0.99$$ corresponds to some very rare superspreaders, while $$x=0$$ corresponds to a lot of (for want of a better term) "superhermits" - i.e., like Gamma at $$CV=3$$, lots of the distribution is concentrated at or near 0.
+
(To help my understanding, and to check the paper, I [https://github.com/alex1770/Covid-19/blob/master/variation.py reimplemented their models in Python]. This gives a good match to the output of version 1 of their paper in "susceptibility" mode, though not in "connectivity" mode. Possibly the discrepancy in the latter mode is due to a difference in our initial conditions. In any case there is only a discrepancy in the progress of the infection, not in the HITs, which are anyway separately calculable - see below.)
  
Version 2 of the paper came out after I conducted that test, and in it the authors also try out a different family of distributions (lognormal) to test robustness and dependence of their results on a particular family (Gamma). Using lognormal they do in fact get much bigger answers for the HIT than they do for Gamma in the $$CV=3$$ case, but as far as I can see they do not mention this point in the main body of their paper, or discuss how it may be a problem for their use of $$CV$$. It's also worth noting that the lognormal family doesn't have a free parameter to vary like the-point distribution has, so you still don't get to see the full variation in behaviour within the $$CV=3$$ umbrella.
+
Version 2 of the paper came out after I tried these examples, and in it the authors also try out a different family of distributions (lognormal) to test robustness and dependence of their results on a particular family (Gamma). Using lognormal they do in fact get much bigger answers for the HIT than they do for Gamma in the $$CV=3$$ case, but as far as I can see this is not mentioned in the main body of their paper. It's also relevant that the lognormal family doesn't have a free parameter to vary like the two-point distribution has, so you can't make it particularly extreme like you can with the two-point family.
  
The HIT values for the four distributions, in "susceptibility" mode at $$R_0=2.7$$ (to coincide with what was used in v1 of the paper), all with $$CV=3$$, are: Two-point at x=0: 6.3%, Gamma: 9.5%, Lognormal: 20.4%, Two-point at x=0.99: 62.6%. So we see there is a huge variation in HIT, and it doesn't make sense to think of HIT as a function of $$CV$$.
+
As an illustration, the HIT values for the four distributions, in "susceptibility" mode at $$R_0=3$$, all with $$CV=3$$, are: Two-point at x=0: 6.7%, Gamma: 10.4%, Lognormal: 22.3%, Two-point at x=0.9: 63.0%. We see there is a large variation in HIT for the same $$CV$$, so it doesn't make sense to think of HIT as a function of $$CV$$. In "connectivity" mode the values are more consistent: Two-point at x=0: 6.7%, Gamma: 5.6%, Lognormal: 3.8%, Two-point at x=0.9: 1.3%. See also the graphs below.
  
= Formula =
+
= Appendix - direct formulae =
  
As it happens, there is a nice clean procedure to directly calculate the HIT under the authors' model, so there is no need for simulation, and there is no dependency of this on the particular characteristics such as the timing of the switching on and off of social distancing.
+
As it happens, there is a nice clean procedure to directly calculate the HIT under the authors' model, so there is no need for simulation, and there is no dependency of on particular characteristics such as the timing of the switching on and off of social distancing. (The reason for this is that $$S_t(x)=S_0(x)e^{-\Lambda(t)x}$$. That is, the $$t-$$dependence and $$x-$$dependence decouple.) I don't think the authors used this.
  
For the Gamma family there happens to be a closed-form formula. Using the "susceptibility" model it's
+
For the Gamma family there even happens to be a completely closed-form formula. Using the "susceptibility" model it's
  
 
\[\text{HIT} = 1-R_0^{-(1+CV^2)^{-1}},\] and with the "connectivity" model it's
 
\[\text{HIT} = 1-R_0^{-(1+CV^2)^{-1}},\] and with the "connectivity" model it's
Line 47: Line 73:
 
\[\text{HIT} = 1-R_0^{-(1+2CV^2)^{-1}}.\]
 
\[\text{HIT} = 1-R_0^{-(1+2CV^2)^{-1}}.\]
  
In general the procedure for any distribution of susceptibilities/connectivities, written as the random variable $$X$$ with $$E[X]=1$$, is as follows:
+
In general the procedure for any distribution of initial susceptibilities/connectivities, written as the random variable $$X$$ with $$E[X]=1$$, is as follows:
  
 
* Susceptibility: choose $$\Lambda\ge0$$ such that $$R_0 E[Xe^{-\Lambda X}]=1$$, then $$\text{HIT}=1-E[e^{-\Lambda X}]$$.
 
* Susceptibility: choose $$\Lambda\ge0$$ such that $$R_0 E[Xe^{-\Lambda X}]=1$$, then $$\text{HIT}=1-E[e^{-\Lambda X}]$$.
Line 54: Line 80:
  
 
(If you need a susceptibility distribution with $$E[X]\neq1$$, and also want to keep the notation of this paper, and want $$R_0$$ to retain its conventional meaning of the initial branching factor, then you need to absorb the factor of $$E[X]$$ into $$R_0$$ and rescale $$X$$ to have mean 1.)
 
(If you need a susceptibility distribution with $$E[X]\neq1$$, and also want to keep the notation of this paper, and want $$R_0$$ to retain its conventional meaning of the initial branching factor, then you need to absorb the factor of $$E[X]$$ into $$R_0$$ and rescale $$X$$ to have mean 1.)
 +
 +
We may now compare the solid curves from the paper of fig.3 (HIT as a function of $$CV$$ assuming Gamma distribution), and those of fig. S22 (HIT as a function of $$CV$$ assuming lognormal distribution), with those given by the above procedure. We also add in the two-point $$x=0$$ and two-point $$x=0.9$$ distributions for comparison. It can be seen that there is a wide variation in behaviour for a given $$CV$$, depending on what distribution is used.
 +
 +
[[File:HITgraph_susceptibility.png|600px]] [[File:HITgraph_connectivity.png|600px]]

Latest revision as of 19:11, 22 July 2021

About

Back to main page

The post below was written on 22 May 2020. This boxed comment is written from a later perspective (5 December 2020), and slightly more opinionated than the post from May which was intended to be neutral in tone.

At the time of writing the post, such low HITs seemed rather unlikely (to me anyway) based on the dynamics of the first waves around the world, but now I believe the evidence is almost conclusive that such low HITs are impossible. We have had second waves in places with high attack rates from a previous first wave. For example, New York City is now experiencing a resurgence despite an estimated 24.7% having been infected by late April, and despite ongoing suppression measures, together suggesting the HIT is well about 25% for NYC. There is a similar story in London, Stockholm and other places. It's possible to imagine an argument that the HIT is as low as 25% in NYC though it would be quite strained, relying on assuming the present restrictions are having no suppressive effect at all, and the outbreaks are happening in different areas from before. It seems to me more likely that the HIT for NYC is considerably higher.

Since the May 2020 paper, two of the original authors have published a new paper with a more streamlined presentation of their argument where they find the analytic formula noted in the appendix of the post below.

Since then, these authors and others have published a new paper (Aguas et al, November 2020) seeking to deduce HITs for European countries by fitting their model to case numbers, using certain values for the level of NPIs (non-pharmaceutical interventions, disease suppression by distancing, hygiene, stay-at-home etc.). They conclude that the HITs for the studied countries are around 10-20%, which I do not find believable for the reasons given above (and others). There is a paper (Fox et al, December 2020) which reanalyses Aguas et al using different assumptions (which I find more persuasive) about the suppression levels, and arrives at estimates of 60-80% for HITs. The key difference is that Aguas et al assumes that suppression measures have had no effect in Europe since August 2020, whereas Fox et al assumes that there has been fairly strong suppression all the time since March 2020.

As of 3 December 2020, one of the authors has reaffirmed her belief in a HIT of around 20%.

Additional note: To summarise the main objection to the paper, its mechanism hinges on there being a wide variation in susceptibility (propensity to catch the disease), but the existence of superspreaders (people who transmit the disease a lot more than average) only proves there is a wide variation in transmissibility. Unless superspreaders are also supercatchers, there is no reason to expect them to be preferentially removed from the susceptible pool, which means the existence of superspreaders alone gives no reason to expect a lower HIT by this mechanism. So the evidence for superspreaders (which is fairly solid) only becomes evidence for this low-HIT mechanism if superspreaders arise because they are generally more sociable (in which case you'd expect them to catch the disease more), not if they are just emitting more (or more potent) viral particles. Since we don't know a priori (at least, as of May 2020) which of these is true, it becomes an empirical question - it can only be decided by observation, not just mathematics, and as far as I can see, observation has come down on the side of a HIT that is not a great deal lower than the naive $$1-1/R_0$$.

Article

Note on "Individual variation in susceptibility or exposure to SARS-CoV-2 lowers the herd immunity threshold" by M. Gabriella M. Gomes et al.

This paper made a media splash early in May 2020 with headlines such as "Herd immunity may only need 10-20 per cent of people to be infected". The proposed mechanism is that people who are more susceptible to the infection are those more involved in spreading it, but will also become immune at a faster rate than others, so the disease naturally tends to make the key players immune more so than others.

Summary

(Opinion-based) I believe that studying heterogeneities will be important because the herd immunity threshold may indeed be lower than the simplistically-calculated value, and I also believe that the Gomes model is a potentially useful way to introduce heterogeneities that could be applied in general as a meta-method to more realistic/non-toy models. However, for reasons given below, I believe that the actual numerical results in this paper can only be taken as illustrative and should not be taken as representing reality.

Discussion

  • This paper uses two variants of a standard SEIR model. The first is the "susceptibility model", where the population is divided into classes according to a parameter that governs to what extent they are likely to become infected. Passing on the infection is still assumed to occur uniformly. The second is the "connectivity model" which takes this further by assuming that infectivity is proportional to susceptibility. I suppose the latter model is called "connectivity" because it is equivalent to changing the connectivity of the underlying graph - making it random subject to a degree distribution. Note that throughout, the term "susceptibility" refers to the potential to catch the disease, not the severity of outcome.
  • Varying the infectivity alone would just reduce to a standard SEIR model because everyone (all infectivity classes) would be being infected at the same rate. By contrast, in the two above models the susceptibility classes are being depleted at different rates from each other.
  • As the authors conclude, there is an effect where the herd immunity threshold (HIT) in the two models above is lower than that predicted by simple homogeneous modelling ($$1-1/R_0$$). This is a valuable point to make given that some seem to be taking as read that the herd immunity threshold must be 60-70%, and we obviously very much need to know when herd immunity arises, or at least to what extent we should expect to feel the effects of partial herd immunity. (Though the idea that the HIT is lower in the presence of heterogeneity will of course be familiar to epidemiologists.)
  • I doubt that a HIT as low as 10% is at all likely. It arises in this paper's model from an extreme distribution of susceptibilities where there is a big concentration at the low end. I couldn't find any evidence presented in the paper for a particular distribution, or a particular variance, for any given susceptibility distibution. There are references to sources which give evidence or arguments for a certain distribution (or dispersion) of individual infectivities, and it seems the authors of the present paper intend this to inform the distribution of individual susceptibilities, but I think this is a different thing. Of course, the present authors are aware of this, but as there is scant information to go on they are just making the best use of what there is. (More on this below.)
  • Contrary to the practice in the paper, I believe the Coefficient of Variation (CV), or other dispersion parameter that is basically a function of the mean and variance, is not, for these purposes, a suitable parameter to use to measure how much the susceptibility distribution varies from a point value (constant). To demonstrate this we can choose different (hypothetical) distributions with the same CV and get a huge range of HITs.
  • There is actually a nice clean way to evaluate the HIT directly from the susceptibility distribution under the given models.
  • In their model, social distancing / locking down only affects the overall time parameter. It's unclear how realistic this is, but under this model one could decouple the question of what the HIT is, with how to manage (by distancing etc.) the flow of the disease. As such, the focus is mainly on what the HIT is.
  • The present paper shows what happens if you "artificially" add in a variation in susceptibility, i.e., treat everyone uniformly except for a susceptibility parameter controlled by a given distribution. An alternative approach would be to use a real life contact graph, or some version of it. Age- and location-based contact information have been studied for this purpose, e.g., Prem et al and Klepac et al. Such a calculation is carried out in Bitton et al using Age- and Activity-based mixing matrices. They find, inter alia, that if $$R_0=3$$ then the HIT is reduced from $$66.7%$$ to $$49.1%$$. This empirically-derived distribution has the merit of being justifiable, but (and I'm speculating here) I wonder if it understates the full heterogeneity, in which case there might be an argument for trying a hybrid method where you artificially introduce a little more variability (somehow calibrated to reality), maybe in the manner of the present paper, on top of the empirically-found mixing matrices.

In more detail

The lowest HIT estimates in this paper arise from using a Gamma distribution with $$CV=3$$ for the susceptibility distribution, which corresponds to shape parameter $$k=1/9$$. This has a large chunk of its probability at the very low end: it is saying that 63% of the population has susceptibility less than 0.09 (relative to a mean of 1) and 50% has susceptibility less than 0.01. With this in mind, it's not surprising that we end up with low HIT estimates because, roughly speaking, in this case you only need to induce herd immunity amongst the minority susceptible population.

Note that a similar-sounding argument to this is not correct: if the infectivities, but not susceptibilities, were clustered near 0 then it would still be true that you only need to induce herd immunity amongst the minority infective population, but that wouldn't occur naturally because the infective portion of the population wouldn't be getting infected any more than the non-infective portion.

There is some evidence that infectivity is strongly clustered like this (and has a long tail - i.e., superspreaders). The present paper cites Endo et al that suggests a dispersion of something like $$k=1/10$$ in SARS-CoV-2. This relies (slightly optimistically in my opinion) on knowledge of the seed infections in different countries to derive its result, but perhaps more importantly this cited paper is making a statement about infectivity not susceptibility. Also cited is the classic 2005 paper Lloyd-Smith et al on superspreading for the original SARS outbreak. This estimates a dispersion parameter of $$k=0.16$$ and also gives evidence that the Gamma family is a good one to use (because a Poisson with parameter Gamma is a Negative Binomial, and they find evidence for the Negative Binomial family being a good description), but again this is talking about infectivity not susceptibility so I believe this isn't directly applicable to the situation of the present paper.

Regardless of the somewhat shaky evidence, is it yet possible that the real susceptibility distribution looks like a Gamma with shape $$1/9$$? I suspect this is unlikely. As far as I am aware anyone can catch the disease, and there aren't any known huge differences in susceptibility to catching the disease amongst subpopulations. The best candidate might be age, since the disease is so strongly age-dependent in terms of severity, but while young people may be a little less susceptible to catching it, it's clearly (from any prevalence survey you look at) far from true that the younger 50% of the population (median age in the UK is 40.5 years) are less than $$1/100$$ as susceptible to catching it compared with the average. Or if susceptibility variation is mainly arising from connectivity, then a shape $$1/9$$ Gamma distribution is still not right because it's clearly not the case (in normal times, which is what we are talking about here, since we are investigating the question of herd immunity after dropping restrictions) that more than 50% of the population lives alone and essentially never meets anyone at all. Arguments against lower $$CV$$ become progressively less clear cut - I won't pursue this further.

To illustrate how the HIT doesn't properly depend on $$CV$$, I tried using a "two-point" distribution parameterized by $$x, y$$ and $$p$$, where $$P(X=x)=p$$, and $$P(X=y)=1-p$$. Fixing the mean to be 1 and the variance to be $$CV^2$$ leaves a free parameter that may as well be $$x$$, and I look at the opposite extremes of $$x=0$$ and $$x=0.9$$. There are superspreaders in both of these cases (though more extreme in the $$x=0.9$$ case), but the $$x=0$$ case also has a lot of (for want of a better term) "superhermits" - i.e., like Gamma at $$CV=3$$, lots of the distribution is concentrated at or near 0. Of course these distributions are unrealistic as real-world examples. They are just there to make the point that $$CV$$ isn't a good characterisation of spread for the purposes of calculating the HIT.

(To help my understanding, and to check the paper, I reimplemented their models in Python. This gives a good match to the output of version 1 of their paper in "susceptibility" mode, though not in "connectivity" mode. Possibly the discrepancy in the latter mode is due to a difference in our initial conditions. In any case there is only a discrepancy in the progress of the infection, not in the HITs, which are anyway separately calculable - see below.)

Version 2 of the paper came out after I tried these examples, and in it the authors also try out a different family of distributions (lognormal) to test robustness and dependence of their results on a particular family (Gamma). Using lognormal they do in fact get much bigger answers for the HIT than they do for Gamma in the $$CV=3$$ case, but as far as I can see this is not mentioned in the main body of their paper. It's also relevant that the lognormal family doesn't have a free parameter to vary like the two-point distribution has, so you can't make it particularly extreme like you can with the two-point family.

As an illustration, the HIT values for the four distributions, in "susceptibility" mode at $$R_0=3$$, all with $$CV=3$$, are: Two-point at x=0: 6.7%, Gamma: 10.4%, Lognormal: 22.3%, Two-point at x=0.9: 63.0%. We see there is a large variation in HIT for the same $$CV$$, so it doesn't make sense to think of HIT as a function of $$CV$$. In "connectivity" mode the values are more consistent: Two-point at x=0: 6.7%, Gamma: 5.6%, Lognormal: 3.8%, Two-point at x=0.9: 1.3%. See also the graphs below.

Appendix - direct formulae

As it happens, there is a nice clean procedure to directly calculate the HIT under the authors' model, so there is no need for simulation, and there is no dependency of on particular characteristics such as the timing of the switching on and off of social distancing. (The reason for this is that $$S_t(x)=S_0(x)e^{-\Lambda(t)x}$$. That is, the $$t-$$dependence and $$x-$$dependence decouple.) I don't think the authors used this.

For the Gamma family there even happens to be a completely closed-form formula. Using the "susceptibility" model it's

\[\text{HIT} = 1-R_0^{-(1+CV^2)^{-1}},\] and with the "connectivity" model it's

\[\text{HIT} = 1-R_0^{-(1+2CV^2)^{-1}}.\]

In general the procedure for any distribution of initial susceptibilities/connectivities, written as the random variable $$X$$ with $$E[X]=1$$, is as follows:

  • Susceptibility: choose $$\Lambda\ge0$$ such that $$R_0 E[Xe^{-\Lambda X}]=1$$, then $$\text{HIT}=1-E[e^{-\Lambda X}]$$.
  • Connectivity: choose $$\Lambda\ge0$$ such that $$R_0 E[X^2e^{-\Lambda X}]=E[X^2]$$, then $$\text{HIT}=1-E[e^{-\Lambda X}]$$.

(If you need a susceptibility distribution with $$E[X]\neq1$$, and also want to keep the notation of this paper, and want $$R_0$$ to retain its conventional meaning of the initial branching factor, then you need to absorb the factor of $$E[X]$$ into $$R_0$$ and rescale $$X$$ to have mean 1.)

We may now compare the solid curves from the paper of fig.3 (HIT as a function of $$CV$$ assuming Gamma distribution), and those of fig. S22 (HIT as a function of $$CV$$ assuming lognormal distribution), with those given by the above procedure. We also add in the two-point $$x=0$$ and two-point $$x=0.9$$ distributions for comparison. It can be seen that there is a wide variation in behaviour for a given $$CV$$, depending on what distribution is used.

HITgraph susceptibility.png HITgraph connectivity.png