Sun, 27 Dec 2020 12:41
εÎ>>Î>>ηνικά/Deutsch/ Portuguªs
Welcome! In this post, we'll be taking a character-by-character look at thesource code of the BioNTech/Pfizer SARS-CoV-2 mRNA vaccine.
I want to thank the large cast of people who spent time previewing thisarticle for legibility and correctness. All mistakes remain mine though,but I would love to hear about them quickly at bert@hubertnet.nl or@PowerDNS_Bert
Now, these words may be somewhat jarring - the vaccine is a liquid that getsinjected in your arm. How can we talk about source code?
This is a good question, so let's start off with a small part of the verysource code of the BioNTech/Pfizer vaccine, also known asBNT162b2, alsoknown as Tozinameran also known asComirnaty.
First 500 characters of the BNT162b2 mRNA. Source: World Health Organization
The BNT162b mRNA vaccine has this digital code at its heart. It is 4284characters long, so it would fit in a bunch of tweets. At the verybeginning of the vaccine production process, someone uploaded this code to aDNA printer (yes), which then converted the bytes on disk to actual DNAmolecules.
A Codex DNA BioXp 3200 DNA printer
Out of such a machine come tiny amounts of DNA, which after a lot ofbiological and chemical processing end up as RNA (more about which later) inthe vaccine vial. A 30 microgram dose turns out to actually contain 30micrograms of RNA. In addition, there is a clever lipid (fatty) packagingsystem that gets the mRNA into our cells.
RNA is the volatile 'working memory' version of DNA. DNA is like the flashdrive storage of biology. DNA is very durable, internally redundant andvery reliable. But much like computers do not execute code directly from aflash drive, before something happens, code gets copied to a faster,more versatile yet far more fragile system.
For computers, this is RAM, for biology it is RNA. The resemblance isstriking. Unlike flash memory, RAM degrades very quickly unless lovinglytended to. The reason the Pfizer/BioNTech mRNA vaccine must be stored in thedeepest of deep freezers is the same: RNA is a fragile flower.
Each RNA character weighs on the order of 0.53·10'>>²¹ grams, meaningthere are 6·10¹'¶ characters in a single 30 microgram vaccine dose.Expressed in bytes, this is around 25 petabytes, although it must be saidthis consists of around 2000 billion repetitions of the same 4284characters. The actual informational content of the vaccine is just over akilobyte. SARS-CoV-2 itself weighs in at around 7.5 kilobytes.
The briefest bit of backgroundDNA is a digital code. Unlike computers, which use 0 and 1, life uses A, C, Gand U/T (the 'nucleotides', 'nucleosides' or 'bases').
In computers we store the 0 and 1 as the (ab)sence of a charge, or as acurrent, as a magnetic transition, or as a voltage, or as a modulation of asignal, or as a change of reflexivity. Or in short, the 0 and 1 are not somekind of abstract concept - they live as electrons and in many other physicalembodiments.
In nature, A, C, G and U/T are molecules, stored as chains in DNA (or RNA).
In computers, we group 8 bits into a byte, and the byte is the typical unitof data being processed.
Nature groups 3 nucleotides into a codon, and this codon is the typical unitof processing. A codon contains 6 bits of information (2 bits per DNAcharacter, 3 characters = 6 bits. This means 2'¶ = 64 different codon values).
Pretty digital so far. When in doubt, head to the WHOdocument with thedigital code to see for yourself.
Some further reading is availablehere - this link ('Whatis life') might help make sense of the rest of this page. Or, if you likevideo, I have two hours for you.
So what does that code DO?The idea of a vaccine is to teach our immune system how to fight a pathogen,without us actually getting ill. Historically this has been done byinjecting a weakened or incapacitated (attenuated) virus, plus an 'adjuvant'to scare our immune system into action. This was a decidedly analoguetechnique involving billions of eggs (or insects). It also required a lotof luck and loads of time. Sometimes a different (unrelated) virus was alsoused.
An mRNA vaccine achieves the same thing ('educate our immune system') but ina laser like way. And I mean this in both senses - very narrow but alsovery powerful.
So here is how it works. The injection contains volatile genetic materialthat describes the famous SARS-CoV-2 'Spike' protein. Through cleverchemical means, the vaccine manages to get this genetic material into some ofour cells.
These then dutifully start producing SARS-CoV-2 Spike proteins in largeenough quantities that our immune system springs into action. Confrontedwith Spike proteins, and (importantly) tell-tale signs that cells have beentaken over, our immune system develops a powerful response against multipleaspects of the Spike protein AND the production process.
And this is what gets us to the 95% efficient vaccine.
The source code!Let's start at the very beginning, a very good placeto start. The WHO document has thishelpful picture:
This is a sort of table of contents. We'll start with the 'cap', actuallydepicted as a little hat.
Much like you can't just plonk opcodes in a file on a computer and run it,the biological operating system requires headers, has linkers and thingslike calling conventions.
The code of the vaccine starts with the following two nucleotides:
GAThis can be compared very much to every DOS and Windows executable startingwith MZ, or UNIX scripts starting with#!. In both life andoperating systems, these two characters are not executed in any way. Butthey have to be there because otherwise nothing happens.
The mRNA 'cap' has a number offunctions. For one, it marks code as comingfrom the nucleus. In our case of course it doesn't, our code comes from avaccination. But we don't need to tell the cell that. The cap makes our codelook legit, which protects it from destruction.
The initial two GA nucleotides are also chemically slightly different fromthe rest of the RNA. In this sense, the GA has some out-of-bandsignaling on it.
The ''five-prime untranslated region''Some lingo here. RNA molecules can only be read in one direction.Confusingly, the part where the reading begins is called the 5' or'five-prime'. The reading stops at the 3' or three-prime end.
Life consists of proteins (or things made by proteins). And these proteinsare described in RNA. When RNA gets converted into proteins, this is calledtranslation.
Here we have the 5' untranslated region ('UTR'), so this bit does not end upin the protein:
GAAΨAAACΨAGΨAΨΨCΨΨCΨGGΨCCCCACAGACΨCAGAGAGAACCCGCCACCHere we encounter our first surprise. The normal RNA characters are A, C, Gand U. U is also known as 'T' in DNA. But here we find a Ψ, what is goingon?
This is one of the exceptionally clever bits about the vaccine. Our bodyruns a powerful antivirus system (''the original one''). For this reason,cells are extremely unenthusiastic about foreign RNA and try very hard todestroy it before it does anything.
This is somewhat of a problem for our vaccine - it needs to sneak past ourimmune system. Over many years of experimentation, it was found that if theU in RNA is replaced by a slightly modified molecule, our immune systemloses interest. For real.
So in the BioNTech/Pfizer vaccine, every U has been replaced by1-methyl-3'-pseudouridylyl, denoted by Ψ. The really clever bit is thatalthough this replacement Ψ placates (calms) our immune system, it isaccepted as a normal U by relevant parts of the cell.
In computer security we also know this trick - it sometimes is possible totransmit a slightly corrupted version of a message that confuses firewalls andsecurity solutions, but that is still accepted by the backend servers -which can then get hacked.
We are now reaping the benefits of fundamental scientific research performedin the past. Thediscoverersof this Ψ technique had to fight to gettheirwork funded and then accepted. We should all be very grateful, and I am surethe Nobel prizes will arrive in duecourse.
Many people have asked, could viruses also use the Ψ technique to beat ourimmune systems? In short, this is extremely unlikely. Life simply doesnot have the machinery to build 1-methyl-3'-pseudouridylyl nucleotides.Viruses rely on the machinery of life to reproduce themselves, and thisfacility is simply not there. The mRNA vaccines quickly degrade in thehuman body, and there is no possibility of the Ψ-modified RNAreplicating with the Ψ still in there. ''No, Really, mRNA Vaccines Are Not Going To Affect YourDNA''is also a good read.
Ok, back to the 5' UTR. What do these 51 characters do? As everything innature, almost nothing has one clear function.
When our cells need to translate RNA into proteins, this is done using amachine called the ribosome. The ribosome is like a 3D printer forproteins. It ingests a strand of RNA and based on that it emits a string ofamino acids, which then fold into a protein.
Source: Wikipedia user BensaccountThis is what we see happening above. The black ribbon at the bottom is RNA.The ribbon appearing in the green bit is the protein being formed. Thethings flying in and out are amino acids plus adaptors to make them fit onRNA.
This ribosome needs to physically sit on the RNA strand for it to get towork. Once seated, it can start forming proteins based on further RNA itingests. From this, you can imagine that it can't yet read the parts whereit lands on first. This is just one of the functions of the UTR: theribosome landing zone. The UTR provides 'lead-in'.
In addition to this, the UTR also contains metadata: when should translationhappen? And how much? For the vaccine, they took the most 'right now' UTRthey could find, taken from the alpha globingene.This gene is known to robustly produce a lot of proteins. In previousyears, scientists had already found ways to optimize this UTR even further(according to the WHO document), so this is not quite the alpha globin UTR.It is better.
The S glycoprotein signal peptideAs noted, the goal of the vaccine is to get the cell to produce copiousamounts of the Spike protein of SARS-CoV-2. Up to this point, we have mostlyencountered metadata and ''calling convention'' stuff in the vaccine sourcecode. But now we enter the actual viral protein territory.
We still have one layer of metadata to go however. Once the ribosome (from thesplendid animation above) has made a protein, that protein still needs to gosomewhere. This is encoded in the ''S glycoprotein signal peptide (extended leadersequence)''.
The way to see this is that at the beginning of the protein there is a sortof address label - encoded as part of the protein itself. In this specificcase, the signal peptide says that this protein should exit the cell via the''endoplasmic reticulum''. Even Star Trek lingo is not as fancy as this!
The ''signal peptide'' is not very long, but when we look at the code, thereare differences between the viral and vaccine RNA:
(Note that for comparison purposes, I have replaced the fancy modified Ψ by aregular RNA U)
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3Virus: AUG UUU GUU UUU CUU GUU UUA UUG CCA CUA GUC UCU AGU CAG UGU GUUVaccine: AUG UUC GUG UUC CUG GUG CUG CUG CCU CUG GUG UCC AGC CAG UGU GUU ! ! ! ! ! ! ! ! ! ! ! ! ! So what is going on? I have not accidentally listed the RNA in groups of 3letters. Three RNA characters make up a codon. And every codon encodes for aspecific amino acid. The signal peptide in the vaccine consists of exactlythe same amino acids as in the virus itself.
So how come the RNA is different?
There are 4"=64 different codons, since there are 4 RNA characters, andthere are three of them in a codon. Yet there are only 20 differentamino acids. This means that multiple codons encode for the same amino acid.
Life uses the following nearly universal table for mapping RNA codons toamino acids:
The RNA codon table (Wikipedia)
In this table, we can see that the modifications in the vaccine (UUU ->UUC) are all synonymous. The vaccine RNA code is different, but the sameamino acids and the same protein come out.
If we look closely, we see that the majority of the changes happen in thethird codon position, noted with a '3' above. And if we check the universalcodon table, we see that this third position indeed often does not matterfor which amino acid is produced.
So, the changes are synonymous, but then why are they there? Lookingclosely, we see that all changes except one lead to more C and Gs.
So why would you do that? As noted above, our immune system takes a very dimview of 'exogenous' RNA, RNA code coming from outside the cell. To evadedetection, the 'U' in the RNA was already replaced by a Ψ.
However, it turns out that RNA with a higheramount of Gs and Cs isalso converted more efficiently intoproteins,
And this has been achieved in the vaccine RNA by replacing many characterswith Gs and Cs wherever this was possible.
I'm slightly fascinated by the one change that did not lead to anadditional C or G, the CCA -> CCU modification. If anyone knows the reason,please let me know! Note that I'm aware that some codons are more commonthan others in the human genome, but I also read that this does notinfluence translation speed alot.
The actual Spike proteinThe next 3777 characters of the vaccine RNA are similarly 'codon optimized'to add a lot of C's and G's. In the interest of space I won't list allthe code here, but we are going to zoom in on one exceptionally specialbit. This is the bit that makes it work, the part that will actually help usreturn to life as normal:
* * L D K V E A E V Q I D R L I T GVirus: CUU GAC AAA GUU GAG GCU GAA GUG CAA AUU GAU AGG UUG AUC ACA GGCVaccine: CUG GAC CCU CCU GAG GCC GAG GUG CAG AUC GAC AGA CUG AUC ACA GGC L D P P E A E V Q I D R L I T G ! !!! !! ! ! ! ! ! ! ! Here we see the usual synonymous RNA changes. For example, in the firstcodon we see that CUU is changed into CUG. This adds another 'G' to thevaccine, which we know helps enhance protein production. Both CUUand CUG encode for the amino acid 'L' or Leucine, so nothing changed in theprotein.
When we compare the entire Spike protein in the vaccine, all changes aresynonymous like this.. except for two, and this is what we see here.
The third and fourth codons above represent actual changes. The K and Vamino acids there are both replaced by 'P' or Proline. For 'K' this requiredthree changes ('!!!') and for 'V' it required only two ('!!').
It turns out that these two changes enhance the vaccine efficiencyenormously.
So what is happening here? If you look at a real SARS-CoV-2 particle, youcan see the Spike protein as, well, a bunch of spikes:
SARS virus particles (Wikipedia)
The spikes are mounted on the virus body ('the nucleocapsid protein'). Butthe thing is, our vaccine is only generating the spikes itself, and we'renot mounting them on any kind of virus body.
It turns out that, unmodified, freestanding Spike proteins collapse into adifferent structure. If injected as a vaccine, this would indeed cause ourbodies to develop immunity.. but only against the collapsed spike protein.
And the real SARS-CoV-2 shows up with the spiky Spike. The vaccine would notwork very well in that case.
So what to do? In 2017 it was described how putting a double Prolinesubstitution in just the rightplace would make theSARS-CoV-1 and MERSS proteins take up their 'pre-fusion' configuration, even without being part ofthe whole virus. This works because Proline is a very rigid amino acid. Itacts as a kind of splint, stabilising the protein in the state we need toshow to the immune system.
The people thatdiscovered this should be walkingaround high-fiving themselves incessantly. Unbearable amounts of smugnessshould be emanating from them. And it would all be welldeserved.
Update! I have been contacted by the McLellanlab, one of thegroups behind the Proline discovery. They tell me the high-fiving issubdued because of the ongoing pandemic, but they are pleased to havecontributed to the vaccines. They also stress the importance of many othergroups, workers and volunteers.
The end of the protein, next stepsIf we scroll through the rest of the source code, we encounter some smallmodifications at the end of the Spike protein:
V L K G V K L H Y T s Virus: GUG CUC AAA GGA GUC AAA UUA CAU UAC ACA UAAVaccine: GUG CUG AAG GGC GUG AAA CUG CAC UAC ACA UGA UGA V L K G V K L H Y T s s ! ! ! ! ! ! ! ! At the end of a protein we find a 'stop' codon, denoted here by a lowercase's'. This is a polite way of saying that the protein should end here. Theoriginal virus uses the UAA stop codon, the vaccine uses two UGA stopcodons, perhaps just for good measure.
The 3' Untranslated RegionMuch like the ribosome needed some lead-in at the 5' end, where we found the'five prime untranslated region', at the end of a protein we find a similarconstruct called the 3' UTR.
Many words could be written about the 3' UTR, but here I quote what theWikipediasays: ''The 3'-untranslated region plays a crucial role in geneexpression by influencing the localization, stability, export, andtranslation efficiency of an mRNA .. despite our current understanding of3'-UTRs, they are still relative mysteries''.
What we do know is that certain 3'-UTRs are very successful at promotingprotein expression. According to the WHO document, the BioNTech/Pfizervaccine 3'-UTR was picked from ''the amino-terminal enhancer of split (AES)mRNA and the mitochondrial encoded 12S ribosomal RNA to confer RNA stabilityand high total protein expression''. To which I say, well done.
The AAAAAAAAAAAAAAAAAAAAAA end of it allThe very end of mRNA is polyadenylated. This is a fancy way of saying itends on a lot of AAAAAAAAAAAAAAAAAAA. Even mRNA has had enough of 2020 itappears.
mRNA can be reused many times, but as this happens, it also loses some ofthe A's at the end. Once the A's run out, the mRNA is no longer functionaland gets discarded. In this way, the 'poly-A' tail is protection fromdegradation.
Studies have been done to find out what the optimal number of A's at the endis for mRNA vaccines. I read in the open literature that this peaked at 120or so.
The BNT162b2 vaccine ends with:
****** ****UAGCAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAGCAUAU GACUAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAAAAAAA AAAAThis is 30 A's, then a ''10 nucleotide linker'' (GCAUAUGACU), followed by another 70A's.
I suspect that what we see here is the result of further proprietaryoptimization to enhance protein expression even more.
SummarisingWith this, we now know the exact mRNA contents of the BNT162b2 vaccine, andfor most parts we understand why they are there:
The CAP to make sure the RNA looks like regular mRNAA known successful and optimized 5' untranslated region (UTR)A codon optimized signal peptide to send the Spike protein to the rightplace (copied 100% from the original virus)A codon optimized version of the original spike, with two 'Proline'substitutions to make sure the protein appears in the right formA known successful and optimized 3' untranslated regionA slightly mysterious poly-A tail with an unexplained 'linker' in thereThe codon optimization adds a lot of G and C to the mRNA. Meanwhile, using Ψ(1-methyl-3'-pseudouridylyl) instead of U helps evade our immune system, sothe mRNA stays around long enough so we can actually help train the immunesystem.
Further reading/viewingIn 2017 I held a two hour presentation on DNA, which you can viewhere. Like this page it is aimed at computerpeople.
In addition, I've been maintaining a page on 'DNA forprogrammers' since 2001.
You might also enjoy this introduction to our amazing immunesystem.
Finally, this listing of my blog posts has quite someDNA, SARS-CoV-2 and COVID related material.