Background

It is of considerable interest to know now how well our vaccines are performing against the Variants of Concern and Variants under Investigation (VOCs and VUIs) that are currently in circulation. Can the variants evade vaccines or not? This knowledge would help us in a number of important policy areas, including how much we need to clamp down on the VOCs, whether hospitals need to plan for a variant-driven resurgence this year, whether/how to restrict entry to the country and whether we need to modify vaccines to better deal with such variants.

Here we present a simple suggestion for making use of information that should already be available (though is not currently published) that could, with minimal work and more-or-less in real time, give some idea of our vaccines' efficacy against at least two of the important variants present in the UK right now (May 2021): B.1.617 and B.1.351. The idea is that if publicly available data on variant counts (such as the feed from the Sanger centre) were segregated into counts of those who had and had not been vaccinated, then that should be enough information to deduce some information about the efficacy of vaccines on those variants, provided they occur at least a few tens of times, while not presenting a privacy concern in terms of revealing private health data about identifiable people.

The proposed process is related to a case-matched observational study, such as carried out in Israel (Dagan et al, Feb 2021) and England (Public Health England, Feb 2021), but with some differences:

Those studies investigated vaccine efficacy against infection (vs non-infection). Here it is proposed to investigate differential vaccine efficacy against two specific variants (B.1.1.7 vs another; for example B.1.1.7 vs B.1.617).
Those studies require a certain amount of work to find matches for hundreds of thousands of people. They also require access to a database that has to remain closed for privacy reasons, limiting the number of people that can work on it and check the results, and slowing down the process.
The process proposed here would just examine existing confirmed cases after they have been found, so there would be no need to process hundreds of thousands of people to find a few who might catch the disease in the future (given it's currently a time of low disease prevalence).
There should be minimal privacy concerns in simply returning a count of people with a particular variant and a particular vaccine status.
The process here would only determine relative vaccine efficacy, comparing one variant (B.1.1.7) with another (e.g., B.1.617). To determine the actual vaccine efficacy against a variant, one could then make use of the previously-discovered vaccine efficacy against B.1.1.7. So this process would have the effect of piggy-backing on previous hard work in determining efficacy against B.1.1.7.

The UK is well-placed to uncover this information because a significant portion of the population is now vaccinated (about 50% with at least one dose at the time of writing) and it has an extensive testing and genomic surveillance programme, currently sequencing about 10,000 new cases per week. Information on variants in the UK is published about twice a month in Technical Briefings. Summary total counts of VOCs/VUIs appear weekly on this government cases data page. There is a more detailed and curated feed from the Sanger centre, and also a country-wide raw feed from COG-UK.

The trial data for the Oxford/Astrazeneca vaccine (AZ) against B.1.351 is currently very limited. There was a study (Madhi et al, March 2021) carried out in South Africa where the point efficacy against symptomatic disease was 22%, but there were so few cases that the 95% confidence interval was -50% to 60%, so we are still in the dark to a large extent. (There was also a matter of the dose spacing probably not being optimal for AZ, and the participants were all young so there wasn't a chance to measure efficacy against severe disease.)

The idea

Normally when calculating vaccine efficacy, one would have to go to a lot of trouble to prevent or remove the effects of confounding. The problem arises if being vaccinated is not independent of exposure to (or susceptibility to) the disease. Traditional antidotes to this problem include using a randomised control trial (RCT), or an observational study using a system of matched pairs as mentioned above. Both of these are resource-intensive and can take some time to report back.

However, we have an extra resource here which means we can construct a measure of vaccine efficacy against variants that requires minimal effort but has a chance of being unconfounded enough to tell us something real. As we are studying two variants at once we can use the B.1.1.7 infection rate as a baseline exposure and measure the efficacy of the vaccine against the other variant using a cross-ratio.

Suppose, among the unvaccinated population, there are $$A$$ cases of variant $$0$$ and $$B$$ cases of variant $$1$$, and in the vaccinated population, $$C$$ cases of variant $$0$$ and $$D$$ of variant $$1$$. Knowing $$A$$, what would we expect $$C$$ to be? It's very hard to say, since it depends not only on how effective the vaccine is, but also a host of other factors such as how many people are vaccinated versus unvaccinated, whether vaccinated people tend to have higher or lower risk to start with, whether their behaviours and so exposures are different after vaccination, and so on. Similarly, it's hard to predict $$D$$ from $$B$$. But, if the vaccine works equally well against both variants, we do expect that the ratio $$C:A$$ will be about the same as the ratio $$D:B$$ - most of the unknown factors are unknown but equal for the two cases. Hence $$(D/B)/(C/A)=(AD)/(BC)$$ should be around $$1$$.

If the vaccine is $$\rho$$ times less effective against variant $$1$$ than considered above, but everything else is unchanged, then we expect $$D$$ to be around $$\rho$$ times larger than it would otherwise have been, so $$(AD)/(BC)$$ should be around $$\rho$$, and in general $$(AD)/(BC)$$ is a measure of how much less effective the vaccine is against variant $$1$$ than it is against variant $$0$$.

Writing it out with indexes for convenience later, let's say $$N_{ij}$$ denotes the number of people who have been diagnosed with variant $$i$$ and have vaccination status $$j$$. Here $$i=u,0,1$$ with $$i=u$$ corresponding to no infection (not used here, but mentioning it for completeness), $$i=0$$ to B.1.1.7, and $$i=1$$ to the variant under investigation (such as B.1.617). For simplicity here we're taking vaccination status to be a binary $$j=0$$ (unvaccinated) or $$1$$ (vaccinated), ignoring dose timing and distinctions between vaccines. Then

\[\hat\rho=\frac{N_{00}N_{11}}{N_{01}N_{10}}\]

is the basic form of the measure of relative vaccine efficacy as a ratio of relative risks considered here. If $$\hat\rho<1$$ then variant $$1$$ responds to the vaccine better than variant $$0$$ does, and vice-versa if $$\hat\rho>1$$. See below for more refined measures.

Discussion of the cross-ratio

$$\def\PP{\mathbb{P}}\def\EE{\mathbb{E}}\def\ind{\perp \!\!\! \perp}$$ We're going to make use of two assumptions, the second of which is negotiable and will be weakened later.

The chance of an infection with variant $$i$$ given vaccine status $$j$$ ($$i=0,1$$ and $$j=0,1$$) decomposes as $$E_i\alpha_{ij}$$, where $$E_i=E_i(p)$$ depends on the person and $$\alpha_{ij}$$ are four constants.
$$\EE[E_1|V=j]/\EE[E_0|V=j]$$ does not depend on $$j$$ (regarding the set of people as the probability space).

The first of these assumptions is saying that the effect of vaccination ($$\alpha_{i1}/\alpha_{i0}$$) is multiplicative. Multiplicativity is a common assumption in analysis of vaccines: to be able to talk about "efficacy" as a single number implies that there is an effect multiplier that applies across a wide range of conditions. $$E_i=E_i(p)$$ can loosely be thought of as exposure of person $$p$$ to variant $$i$$ combined with that person's susceptibility, however it isn't necessarily a quantitative measure of exposure-to-virus as there is no reason to expect the probability of becoming infected to be proportional to viral load. $$E_i$$ is really a baseline risk, which can by assumption be modified multiplicatively by the presence of a vaccine.

The second of these assumptions is the heart of the matter. It is saying that we haven't tended to vaccinate people who are more at risk from one variant than another. We'll revisit the plausibility of this below, but let's first see the consequence of it. If $$Y=Y(p)$$ denotes the infected status of person $$p$$ ($$u$$, $$0$$ or $$1$$), then

\[ \begin{equation*} \begin{split} \PP(Y=1,V=j)/\PP(Y=0,V=j) &= \EE(E_1\alpha_{1j}|V=j)/\EE(E_0\alpha_{0j}|V=j) \;\;\text{(by assumption 1)} \\ &= C\frac{\alpha_{1j}}{\alpha_{0j}}\;\;\text{ (for some constant, }C\text{, by assumption 2), so} \\ \frac{\PP(Y=0,V=0)\PP(Y=1,V=1)}{\PP(Y=0,V=1)\PP(Y=1,V=0)} &= \frac{\alpha_{00}\alpha_{11}}{\alpha_{01}\alpha_{10}} = \frac{\alpha_{11}/\alpha_{10}}{\alpha_{01}/\alpha_{00}} = \frac{\text{RR}_1}{\text{RR}_0} = \rho \end{split} \end{equation*} \]

The left hand side is estimated by $$\hat\rho$$, while the right hand side is the ratio of the vaccine effects, as relative risks, against variants 1 and 0 $$(\text{RR}_i=\alpha_{i1}/\alpha_{i0})$$, and ends up as something you might call the relative relative risk, $$\rho$$. So under these assumptions we have an easy estimate of the relative vaccine efficacy against one variant versus another. Since there is already a lot of existing knowledge about vaccine efficacy against B.1.1.7, and since B.1.1.7 is currently the dominant variant, we would want to make the comparison between ($$i=0$$) B.1.1.7 and ($$i=1$$) another variant, whereupon we would gain some knowledge of vaccine efficacy against the other variant.

Note that existing immunity from past infections, as well as vaccination, will modify the infection probabilities, and $$\rho$$ really measures the relative efficacy of the vaccine against the two variants in the presence of this existing immunity. This is arguably what you want to know, but it means that the answer doesn't generalise so easily to different parts of the world. However, the difference between $$\rho$$ and $$\rho_0$$ will probably be a fairly small under present circumstances because at the moment it seems likely that $$N_{i0}/N_{i1}\gg pq/((1-p)(1-q))$$ where $$p$$ is proportion of people already infected and $$q$$ is the proportion vaccinated. (Under certain assumptions, such as existing immunity working in the same way as vaccinations, you could recover the "raw" relative efficacy, $$\rho_0$$, that is, what it would be if there were no existing immunity.)

Estimator and credible intervals

The point estimate,

\[\hat\rho=\frac{N_{00}N_{11}}{N_{01}N_{10}}\]

of

\[\rho=\frac{\alpha_{00}\alpha_{11}}{\alpha_{01}\alpha_{10}}\]

would be a bit crude because the best estimate of a product (or ratio) isn't the product (or ratio) of the best estimate. It also doesn't give you a confidence interval (or credible interval) for $$\rho$$. There are many ways to deal with this. One way is to treat $$N_{ij}$$ as samples from a Poisson distribution whose parameters, $$\gamma_{ij}$$, we'll use instead of $$N_{ij}$$ to form the cross ratio. Such parameters can conveniently be given a very wide Gamma prior, $$\Gamma(\epsilon,\epsilon^{-1})$$ in (shape,scale) notation, which is easy to calculate with but also justifiable for small $$\epsilon$$ because it's uninformative. The posterior distribution of $$\gamma_{ij}$$ is then $$\Gamma(N_{ij}+\epsilon,(1+\epsilon)^{-1})=(\text{const})\Gamma(N_{ij}+\epsilon,1)$$ and the estimate for $$\rho$$ is the random variable

\[\frac{\gamma_{00}\gamma_{11}}{\gamma_{01}\gamma_{10}}.\]

This is easy to work with, either by simulating the four Gamma random variables and seeing what the empirical distribution of the cross ratio is, or by direct integration (it's the product of two Beta prime distributions whose densities are elementary functions, so it's just a simple integration in one variable). One can then take (e.g.) the 2.5% and 97.5% points of the distribution in the usual way to get a 95% Bayesian credible interval (which in practice will end up similar to a 95% confidence interval).

Example using made-up data (in order to see roughly how narrow a credible interval we might expect):

Suppose $$N_{00}=60000, N_{01}=5000, N_{10}=340, N_{11}=60$$. The totals $$N_{00}+N_{01}=65000$$ and $$N_{10}+N_{11}=400$$ are very roughly right for the respective counts of B.1.1.7 and B.1.617.2 over the months of March and April, but the proportions of breakthrough infections $$5000/65000$$ and $$60/400$$ are made-up for illustration. Using these numbers (and using $$\epsilon=0$$) the 95% credible interval for $$\rho$$ would be $$1.58-2.76$$. So in this (fictitious) example we would conclude that B.1.617.2 responded to vaccines $$1.58$$ to $$2.76$$ times worse, in terms of relative risk, than B.1.1.7 did. In general, assuming $$N_{11}$$, the number of breakthrough infections in the minority variant, is much the smallest of these numbers, the ratio of the upper and lower end of this credible interval will be about a factor of $$e^{4/\sqrt{N_{11}}}$$, so accuracy mainly depends on finding a decent number of such breakthrough infections. But the worst case, where vaccines are significantly impaired against a variant, should be readily detectable with few datapoints because the centre of the estimate will be significantly above 1 and the range will be relatively narrow.

Discussion of plausibility of assumption 2

How likely is it that $$\EE[E_1|V=j]/\EE[E_0|V=j]$$ does not depend on $$j$$?

If instead of comparing two variants we were trying in the traditional way to compare "infection" vs "no infection", we could do the same calculations with variants 0 and 1 replaced with "no infection" and "infection" respectively. We would end up with the odds ratio (odds of catching disease given vaccinated divided by odds given unvaccinated) familiar to normal vaccine studies. However, this would obviously be in danger of being confounded because it's very possible that being vaccinated is correlated with being exposed to infection. That is the reason that we have randomised vaccine trials or carefully matched observational trials.

How does what we're doing here differ from such a case-matched observational trial? We are arguing that it's justifiable to avoid a detailed matching process here because we're comparing two variants, rather than comparing an infection and non-infection. What would a matching process look like here? In a normal infected/non-infected study, for each vaccinated person you'd have to seek out an unvaccinated person (or people) with similar characteristics. The goal would be to try to find a matched person who has similar exposure to the disease (which ends up meaning the matched person has a similar risk of catching the disease in the counterfactual situation that they weren't vaccinated). In such a study you'd try to ensure this by conditioning on enough relevant personal information such as (in the case of the PHE study) age, region, care home status and sex, and then hope that the disease exposure is uniform within each (small) subgroup so defined.

Our task is easier: all we care about is that the vaccinated group has similar relative exposure to each of the two variants under consideration. As such, we're in much better shape because there is less scope for vaccinations to be picking out people who are more likely to get one variant than another than there is for vaccinations to be picking out people who are more likely to be infected at all. It may even be that we can get away with no subdivision at all.

Of course, such a bias could still occur. For example, visitors from India or South Africa who were not part of the UK vaccination programme are more likely to be exposed to B.1.617.x or B.1.351 and may have different vaccination probabilities from UK residents. However, we're in luck in that (non-B.1.1.7) variant cases are already segregated into categories such as recent travellers, surge testing and special studies: see Sanger Centre information. (We would probably want to exclude recent travellers, but include the results of surge testing.)

There could be a residual/secondary confounding effect from foreign travel, in that UK residents who have been in contact with travellers might tend to have a different average vaccination coverage than the rest of the country, but this is hopefully a much smaller effect because as residents they would be part of the same vaccination programme as everyone else.

In general, we need to be cautious about factors that could result in vaccination being correlated with exposure to the particular variant under scrutiny. Where the main reservoir of a variant is abroad, it makes sense to consider foreign travel as above. It also would make sense to consider whether belonging to a particular ethnic or racial group could be a confounding variable, since it is possible that people of a given ethnicity may have more contact with people from countries which have a high proportion of that ethnicity, and it is known that vaccine uptake is different in different ethnic groups. Fortunately in the case of India and B.1.617, it so happens that vaccination rates among the (self-described) Indian ethnicity in the UK is very high - approximately equal to the national average - so Indian ethnicity is probably not a confounding factor for this analysis of B.1.617.

Another possible confounding variable is time: more people are vaccinated now than in January, and some important variants have only recently made an appearance, so this creates a dependency of relative variant prevalence on vaccination status, mediated by time. But the large majority of sequenced B.1.617 and B.1.351 have been from the last month or so (April), during which period the vaccination levels in the country have not changed enormously. Since there are currently plenty of B.1.1.7 infections, we could restrict the B.1.1.7 cases to having a similar time (and location) profile to those of the variants being studied.

There is another way to reduce the effect of confounding with our data, which is to condition on confounding variables as discussed in the next section.

Improved estimate based on subgroups

We may be able to get a better estimate of $$\rho$$ by conditioning on confounding variables, partitioning the cases into subsets. The idea here would be to condition on easily available information, such as location or age of person. Let's use $$C=C(p)$$ to denote a possible confounding variable (such as age or location), so that instead of plain $$N_{ij}$$, we use $$N_{ijk}$$ to be the number of people diagnosed with variant $$i$$ with vaccination status $$j$$ and subset $$k$$. For example, $$k$$ could represent the month number or it could index one of the seven NHS regions of England, or age band of the person, or some combination of these. It could be of deconfounding benefit to condition on location if it happens that, for example, people in London tend to have more international connections and so are exposed more to variants that are dominant in other countries.

Assumption two would then be replaced by a conditional version:

2'. For any $$k$$, $$\EE[E_1|V=j,C=k]/\EE[E_0|V=j,C=k]$$ does not depend on $$j$$.

In plain terms this is saying, for example with $$C$$ representing the month of infection, "In any given month, knowing vaccination status tells you nothing about relative average exposure to variants 0 and 1". Or if $$C$$ represents the person's NHS region, then it would be saying "In any given NHS region, knowing vaccination status tells you nothing about relative average exposure to variants 0 and 1."

Similar to previously, we could let $$\gamma_{ijk}$$ be distributed as \[\gamma_{ijk}\sim \Gamma(N_{ijk}+\epsilon,1),\] then form \[\hat\rho_k=\frac{\gamma_{00k}\gamma_{11k}}{\gamma_{01k}\gamma_{10k}},\] and define $$\rho$$ as the common distribution of $$\rho_k$$ after they are conditioned to be equal. (As a technical point, of course you need to say how you are conditioning on a probability zero event. This can be made well-defined by defining the conditioned distribution as the normalised product of the pdfs, which makes sense after fixing the choice of variable, such as $$\rho_k$$. In practice $$\log(\rho_k)$$ may be a nicer choice than $$\rho_k$$ but it shouldn't make a big difference.) I think it's worth mentioning that computationally this is very easy: it's a few lines of code and should run in a fraction of a second.

In reality we'd not want to decompose into more than a few subsets because (at the time of writing) $$N_{ij}$$ for $$i=1$$ is only of the order of $$100$$ or so and we can't subdivide this too much. However, this process should be robust to some subgroups, $$k$$, having small (or zero) counts, $$N_{1jk}$$, because they would just generate a very wide distribution, $$\hat\rho_k$$, that wouldn't have much effect on $$\rho$$.

Measuring vaccine efficacy against variants

Contents