Today, @deepmind in partnership with @emblebi launches the #AlphaFold Protein Structure Database in a landmark moment for UK science 🔽 pic.twitter.com/AGj2dtkCh5
Highly accurate protein structure prediction with AlphaFold - Amazing - this will open up incredible opportunities for discovery & translation @DeepMind @emblebi - agree with @ewanbirney "One of the most important data sets since mapping of human genome" www.nature.com/articles/s41586-021-03819-2
DeepMind offers AI tool to predict shape of all human proteins www.ft.com/content/fbcc9af4-8dcd-4385-85d5-59c180175b67 via @financialtimes
A transformative moment for biology & beyond - Highly accurate protein structure prediction for the human proteome; with predictions freely available via a public database hosted by @emblebi alphafold.ebi.ac.uk/ @DeepMind www.nature.com/articles/s41586-021-03828-1
DeepMind and several research partners have released a database containing the 3D structures of nearly every protein in the human body, as computationally determined by the breakthrough protein folding system demonstrated last year, AlphaFold. The freely available database represents an enormous advance and convenience for scientists across hundreds of disciplines and domains, and may very well form the foundation of a new phase in biology and medicine.
The AlphaFold Protein Structure Database is a collaboration between DeepMind, the European Bioinformatics Institute and others, and consists of hundreds of thousands of protein sequences with their structures predicted by AlphaFold — and the plan is to add millions more to create a “protein almanac of the world.”
“We believe that this work represents the most significant contribution AI has made to advancing the state of scientific knowledge to date, and is a great example of the kind of benefits AI can bring to society,” said DeepMind founder and CEO Demis Hassabis.
If you’re not familiar with proteomics in general — and it’s quite natural if that’s the case — the best way to think about this is perhaps in terms of another major effort: that of sequencing the human genome. As you may recall from the late ’90s and early ’00s, this was a huge endeavor undertaken by a large group of scientists and organizations across the globe and over many years. The genome, finished at last, has been instrumental to the diagnosis and understanding of countless conditions, and in the development of drugs and treatments for them.
It was, however, just the beginning of the work in that field — like finishing all the edge pieces of a giant puzzle. And one of the next big projects everyone turned their eyes toward in those years was understanding the human proteome — which is to say all the proteins used by the human body and encoded into the genome.
The problem with the proteome is that it’s much, much more complex. Proteins, like DNA, are sequences of known molecules; in DNA these are the handful of familiar bases (adenine, guanine, etc.), but in proteins they are the 20 amino acids (each of which is coded by multiple bases in genes). This in itself creates a great deal more complexity, but it’s only the start. The sequences aren’t simply “code” but actually twist and fold into tiny molecular origami machines that accomplish all kinds of tasks within our body. It’s like going from binary code to a complex language that manifests objects in the real world.
Practically speaking this means that the proteome is made up of not just 20,000 sequences of hundreds of acids each, but that each one of those sequences has a physical structure and function. And one of the hardest parts of understanding them is figuring out what shape is made from a given sequence. This is generally done experimentally using something like x-ray crystallography, a long, complex process that may take months or longer to figure out a single protein — if you happen to have the best labs and techniques at your disposal. The structure can also be predicted computationally, though the process has never been good enough to actually rely on — until AlphaFold came along.
I’m only compressing this long history into one paragraph because it was extensively covered at the time, but it’s hard to overstate how sudden and complete this advance was. This was a problem that stumped the best minds in the world for decades, and it went from “we maybe have an approach that kind of works, but extremely slowly and at great cost” to “accurate, reliable, and can be done with off the shelf computers” in the space of a year.
The specifics of DeepMind’s advances and how it achieved them I will leave to specialists in the fields of computational biology and proteomics, who will no doubt be picking apart and iterating on this work over the coming months and years. It’s the practical results that concern us today, as the company employed its time since the publication of AlphaFold 2 (the version shown in 2020) not just tweaking the model, but running it… on every single protein sequence they could get their hands on.
The result is that 98.5% of the human proteome is now “folded,” as they say, meaning there is a predicted structure that the AI model is confident enough (and importantly, we are confident enough in its confidence) represents the real thing. Oh, and they also folded the proteome for 20 other organisms, like yeast and E. coli, amounting to about 350,000 protein structures total. It’s by far — by orders of magnitude — the largest and best collection of this absolutely crucial information.
All that will be made available as a freely browsable database that any researcher can simply plug a sequence or protein name into and immediately be provided the 3D structure. The details of the process and database can be found in a paper published today in the journal Nature.
“The database as you’ll see it tomorrow, it’s a search bar, it’s almost like Google search for protein structures,” said Hassabis in an interview with TechCrunch. “You can view it in the 3D visualizer, zoom around it, interrogate the genetic sequence… and the nice thing about doing it with EMBL-EBI is it’s linked to all their other databases. So you can immediately go and see related genes, And it’s linked to all these other databases, you can see related genes, related in other organisms, other proteins that have related functions, and so on.”
“As a scientist myself, who works on an almost unfathomable protein,” said EMBL-EBI’s Edith Heard (she didn’t specify which protein), “it’s really exciting to know that you can find out what the business end of a protein is now, in such a short time — it would have taken years. So being able to access the structure and say ‘aha, this is the business end,’ you can then focus on trying to work out what that business end does. And I think this is accelerating science by steps of years, a bit like being able to sequence genomes did decades ago.”
So new is the very idea of being able to do this that Hassabis said he fully expects the entire field to change — and change the database along with it.
“Structural biologists are not yet used to the idea that they can just look up anything in a matter of seconds, rather than take years to experimentally determine these things,” he said. “And I think that should lead to whole new types of approaches to questions that can be asked and experiments that can be done. Once we start getting wind of that, we may start building other tools that cater to this sort of serendipity: What if I want to look at 10,000 proteins related in a particular way? There isn’t really a normal way of doing that, because that isn’t really a normal question anyone would ask currently. So I imagine we’ll have to start producing new tools, and there’ll be demand for that once we start seeing how people interact with this.”
That includes derivative and incrementally improved versions of the software itself, which has been released in open source along with a great deal of development history. Already we have seen an independently developed system, RoseTTAFold, from researchers at the University of Washington’s Baker Lab, which extrapolated from AlphaFold’s performance last year to create something similar yet more efficient — though DeepMind seems to have taken the lead again with its latest version. But the point was made that the secret sauce is out there for all to use.
The DNDI focuses, as you might guess, on diseases that are rare enough that they don’t warrant the kind of attention and investment from major pharmaceutical companies and medical research outfits that would potentially result in discovering a treatment.
“This is a very practical problem in clinical genetics, where you have a suspected series of mutations, of changes in an affected child, and you want to try and work out which one is likely to be the reason why our child has got a particular genetic disease. And having widespread structural information, I am almost certain will improve the way we can do that,” said DNDI’s Ewan Birney in a press call ahead of the release.
Ordinarily examining the proteins suspected of being at the root of a given problem would be expensive and time-consuming, and for diseases that affect relatively few people, money and time are in short supply when they can be applied to more common problems like cancers or dementia-related diseases. But being able to simply call up the structures of 10 healthy proteins and 10 mutated versions of the same, insights may appear in seconds that might otherwise have taken years of painstaking experimental work. (The drug discovery and testing process still takes years, but maybe now it can start tomorrow for Chagas disease instead of in 2025.)
Lest you think too much is resting on a computer’s prediction of experimentally unverified results, in another, totally different case, some of the painstaking work had already been done. John McGeehan of the University of Portsmouth, with whom DeepMind partnered for another potential use case, explained how this affected his team’s work on plastic decomposition.
“When we first sent our seven sequences to the DeepMind team, for two of those we already had experimental structures. So we were able to test those when they came back, and it was one of those moments, to be honest, when the hairs stood up on the back of my neck,” said McGeehan. “Because the structures that they produced were identical to our crystal structures. In fact, they contained even more information than the crystal structures were able to provide in certain cases. We were able to use that information directly to develop faster enzymes for breaking down plastics. And those experiments are already underway, immediately. So the acceleration to our project here is, I would say, multiple years.”
The plan is to, over the next year or two, make predictions for every single known and sequenced protein — somewhere in the neighborhood of a hundred million. And for the most part (the few structures not susceptible to this approach seem to make themselves known quickly) biologists should be able to have great confidence in the results.
The process AlphaFold uses to predict structures is, in some cases, better than experimental options. And although there is an amount of uncertainty in how any AI model achieves its results, Hassabis was clear that this is not just a black box.
“For this particular case, I think explainability was not just a nice-to-have, which often is the case in machine learning, but it was a must-have, given the seriousness of what we wanted it to be used for,” he said. “So I think we’ve done the most we’ve ever done on a particular system to make the case with explainability. So there’s both explainability on a granular level on the algorithm, and then explainability in terms of the outputs, as well the predictions and the structures, and how much you should or shouldn’t trust them, and which of the regions are the reliable areas of prediction.”
Nevertheless, his description of the system as “miraculous” attracted my special sense for potential headline words. Hassabis said that there’s nothing miraculous about the process itself, but rather that he’s a bit amazed that all their work has produced something so powerful.
“This was by far the hardest project we’ve ever done,” he said. “And, you know, even when we know every detail of how the code works, and the system works, and we can see all the outputs, it’s still just still a bit miraculous when you see what it’s doing… that it’s taking this, this 1D amino acid chain and creating these beautiful 3D structures, a lot of them aesthetically incredibly beautiful, as well as scientifically and functionally valuable. So it was more a statement of a sort of wonder.”
The impact of AlphaFold and the proteome database won’t be felt for some time at large, but it will almost certainly — as early partners have testified — lead to some serious short-term and long-term breakthroughs. But that doesn’t mean that the mystery of the proteome is solved completely. Not by a long shot.
As noted above, the complexity of the genome is nothing compared to that of the proteome at a fundamental level, but even with this major advance we have only scratched the surface of the latter. AlphaFold solves a very specific, though very important problem: given a sequence of amino acids, predict the 3D shape that sequence takes in reality. But proteins don’t exist in a vacuum; they’re part of a complex, dynamic system in which they are changing their conformation, being broken up and reformed, responding to conditions, the presence of elements or other proteins, and indeed then reshaping themselves around those.
In fact a great deal of the human proteins for which AlphaFold gave only a middling level of confidence to its predictions may be fundamentally “disordered” proteins that are too variable to pin down the way a more static one can be (in which case the prediction would be validated as a highly accurate predictor for that type of protein). So the team has its work cut out for it.
“It’s time to start looking at new problems,” said Hassabis. “Of course, there are many, many new challenges. But the ones you mentioned, protein interaction, protein complexes, ligand binding, we’re working actually on all these things, and we have early, early stage projects on all those topics. But I do think it’s worth taking, you know, a moment to just talk about delivering this big step… it’s something that the computational biology community’s been working on for 20, 30 years, and I do think we have now broken the back of that problem.”
Read full article at TechCrunch
22 July, 2021 - 02:02pm
DeepMind plans to release hundreds of millions of protein structures for free
Proteins are long, complex molecules that perform numerous tasks in the body, from building tissue to fighting disease. Their purpose is dictated by their structure, which folds like origami into complex and irregular shapes. Understanding how a protein folds helps explain its function, which in turn helps scientists with a range of tasks — from pursuing fundamental research on how the body works, to designing new medicines and treatments.
Previously, determining the structure of a protein relied on expensive and time-consuming experiments. But last year DeepMind showed it can produce accurate predictions of a protein’s structure using AI software called AlphaFold. Now, the company is releasing hundreds of thousands of predictions made by the program to the public.
“I see this as the culmination of the entire 10-year-plus lifetime of DeepMind,” company CEO and co-founder Demis Hassabis told The Verge. “From the beginning, this is what we set out to do: to make breakthroughs in AI, test that on games like Go and Atari, [and] apply that to real-world problems, to see if we can accelerate scientific breakthroughs and use those to benefit humanity.”
There are currently around 180,000 protein structures available in the public domain, each produced by experimental methods and accessible through the Protein Data Bank. DeepMind is releasing predictions for the structure of some 350,000 proteins across 20 different organisms, including animals like mice and fruit flies, and bacteria like E. coli. (There is some overlap between DeepMind’s data and pre-existing protein structures, but exactly how much is difficult to quantify because of the nature of the models.) Most significantly, the release includes predictions for 98 percent of all human proteins, around 20,000 different structures, which are collectively known as the human proteome. It isn’t the first public dataset of human proteins, but it is the most comprehensive and accurate.
If they want, scientists can download the entire human proteome for themselves, says AlphaFold’s technical lead John Jumper. “There is a HumanProteome.zip effectively, I think it’s about 50 gigabytes in size,” Jumper tells The Verge. “You can put it on a flash drive if you want, though it wouldn’t do you much good without a computer for analysis!”
After launching this first tranche of data, DeepMind plans to keep adding to the store of proteins, which will be maintained by Europe’s flagship life sciences lab, the European Molecular Biology Laboratory (EMBL). By the end of the year, DeepMind hopes to release predictions for 100 million protein structures, a dataset that will be “transformative for our understanding of how life works,” according to Edith Heard, director general of the EMBL.
The data will be free in perpetuity for both scientific and commercial researchers, says Hassabis. “Anyone can use it for anything,” the DeepMind CEO noted at a press briefing. “They just need to credit the people involved in the citation.”
Understanding a protein’s structure is useful for scientists across a range of fields. The information can help design new medicines, synthesize novel enzymes that break down waste materials, and create crops that are resistant to viruses or extreme weather. Already, DeepMind’s protein predictions are being used for medical research, including studying the workings of SARS-CoV-2, the virus that causes COVID-19.
New data will speed these efforts, but scientists note it will still take a lot of time to turn this information into real-world results. “I don’t think it’s going to be something that changes the way patients are treated within the year, but it will definitely have a huge impact for the scientific community,” Marcelo C. Sousa, a professor at the University of Colorado’s biochemistry department, told The Verge.
Scientists will have to get used to having such information at their fingertips, says DeepMind senior research scientist Kathryn Tunyasuvunakool. “As a biologist, I can confirm we have no playbook for looking at even 20,000 structures, so this [amount of data] is hugely unexpected,” Tunyasuvunakool told The Verge. “To be analyzing hundreds of thousands of structures — it’s crazy.”
Notably, though, DeepMind’s software produces predictions of protein structures rather than experimentally determined models, which means that in some cases further work will be needed to verify the structure. DeepMind says it spent a lot of time building accuracy metrics into its AlphaFold software, which ranks how confident it is for each prediction.
Helen Walden, a professor of structural biology at the University of Glasgow, tells The Verge that DeepMind’s data will “significantly ease” research bottlenecks, but that “the laborious, resource-draining work of doing the biochemistry and biological evaluation of, for example, drug functions” will remain.
Sousa, who has previously used data from AlphaFold in his work, says for scientists the impact will be felt immediately. “In our collaboration we had with DeepMind, we had a dataset with a protein sample we’d had for 10 years, and we’d never got to the point of developing a model that fit,” he says. “DeepMind agreed to provide us with a structure, and they were able to solve the problem in 15 minutes after we’d been sitting on it for 10 years.”
Proteins are constructed from chains of amino acids, which come in 20 different varieties in the human body. As any individual protein can be comprised of hundreds of individual amino acids, each of which can fold and twist in different directions, it means a molecule’s final structure has an incredibly large number of possible configurations. One estimate is that the typical protein can be folded in 10^300 ways — that’s a 1 followed by 300 zeroes.
Because proteins are too small to examine with microscopes, scientists have had to indirectly determine their structure using expensive and complicated methods like nuclear magnetic resonance and X-ray crystallography. The idea of determining the structure of a protein simply by reading a list of its constituent amino acids has been long theorized but difficult to achieve, leading many to describe it as a “grand challenge” of biology.
In recent years, though, computational methods — particularly those using artificial intelligence — have suggested such analysis is possible. With these techniques, AI systems are trained on datasets of known protein structures and use this information to create their own predictions.
DeepMind’s AlphaFold program has been upgraded since last year’s CASP competition and is now 16 times faster. “We can fold an average protein in a matter of minutes, most cases seconds,” says Hassabis. The company also released the underlying code for AlphaFold last week as open-source, allowing others to build on its work in the future.
Liam McGuffin, a professor at Reading University who developed some of the UK’s leading protein-folding software, praised the technical brilliance of AlphaFold, but also noted that the program’s success relied on decades of prior research and public data. “DeepMind has vast resources to keep this database up to date and they are better placed to do this than any single academic group,” McGuffin told The Verge. “I think academics would have got there in the end, but it would have been slower because we’re not as well resourced.”
Many scientists The Verge spoke to noted the generosity of DeepMind in releasing this data for free. After all, the lab is owned by Google-parent Alphabet, which has been pouring huge amounts of resources into commercial healthcare projects. DeepMind itself loses a lot of money each year, and there have been numerous reports of tensions between the company and its parent firm over issues like research autonomy and commercial viability.
Hassabis, though, tells The Verge that the company always planned to make this information freely available, and that doing so is a fulfillment of DeepMind’s founding ethos. He stresses that DeepMind’s work is used in lots of places at Google — “almost anything you use, there’s some of our technology that’s part of that under the hood” — but that the company’s primary goal has always been fundamental research.
“The agreement when we got acquired is that we are here primarily to advance the state of AGI and AI technologies and then use that to accelerate scientific breakthroughs,” says Hassabis. “[Alphabet] has plenty of divisions focused on making money,” he adds, noting that DeepMind’s focus on research “brings all sorts of benefits, in terms of prestige and goodwill for the scientific community. There’s many ways value can be attained.”
Hassabis predicts that AlphaFold is a sign of things to come — a project that shows the huge potential of artificial intelligence to handle messy problems like human biology.
“I think we’re at a really exciting moment,” he says. “In the next decade, we, and others in the AI field, are hoping to produce amazing breakthroughs that will genuinely accelerate solutions to the really big problems we have here on Earth.”
Subscribe to get the best Verge-approved tech deals of the week.
Check your inbox for a welcome email.
22 July, 2021 - 02:02pm
The key to understanding our basic biological machinery is its architecture. The chains of amino acids that comprise proteins twist and turn to make the most confounding of 3D shapes. It is this elaborate form that explains protein function; from enzymes that are crucial to metabolism to antibodies that fight infectious attacks.
Despite years of onerous and expensive lab work that began in the 1950s, scientists have only decoded the structure of a fraction of human proteins. DeepMind’s AI program, AlphaFold, has predicted the structure of nearly all 20,000 proteins expressed by humans. In an independent benchmark test that compared predictions to known structures, the system was able to predict the shape of a protein to a good standard 95% of time.
DeepMind, which has partnered with the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI), hopes the database will help researchers to analyse how life works at an atomic scale by unpacking the apparatus that drives some diseases, make strides in the field of personalised medicine, create more nutritious crops and develop “green enzymes” that can break down plastic.
“The applications are actually limited only by our imagination – but at a more fundamental level, the AlphaFold database will increase our understanding of how proteins function, and their role in the fundamental processes of life,” said Prof Edith Heard, the director-general of the EMBL.
“This understanding means we can be better equipped to unravel the molecular mechanisms of life and accelerate our pursuits to protect and treat human health, as well as the health of our planet, and making this tool open access will accelerate the power of research discovery and innovation for scientists around the world.”
“I almost fell off my chair in just excitement and amazement that this longstanding problem of how proteins fold had been solved,” said Prof Ewan Birney, the director of the EMBL-EBI, after the results were first presented in November.
“This dataset is rather like the human genome … and it’s this dataset where we start some new bits of science that we weren’t able to do beforehand. I’m very excited to start walking down that road.”
22 July, 2021 - 02:02pm
Computers can now rapidly and reliably predict the 3D shape of most proteins, such as this structure from a fruit fly.
Last week, two groups unveiled the culmination of years of work by computer scientists, biologists, and physicists: advanced modeling programs that can predict the precise 3D atomic structures of proteins and some molecular complexes. And now, the biggest payoff of that work has arrived. One of those teams reports today it has used its newly minted artificial intelligence (AI) programs to solve the structures of 350,000 proteins from humans and 20 model organisms, such as Escherichia coli bacteria, yeast, and fruit flies, all mainstays of biological research. In the coming months, the group says it plans to expand its list of modeled proteins to cover all cataloged proteins, some 100 million molecules.
“It’s pretty overwhelming,” says John Moult, a protein folding expert at the University of Maryland, Shady Grove, who runs a biennial competition called the Critical Assessment of protein Structure Prediction (CASP). Moult says structural biologists have dreamed for decades that accurate computer models would one day augment extremely precise protein shapes derived from experimental methods such as x-ray crystallography. “I never thought the dream would come true,” Moult says.
Both programs use AI to spot folding patterns in vast databases of solved protein structures. The programs compute the most likely structure of unknown proteins by also considering basic physical and biological rules governing how neighboring amino acids in a protein interact. In their paper, Baek and Baker used RoseTTAFold to create a structure database of hundreds of G-protein coupled receptors, a class of common drug targets.
A database of DeepMind’s new protein predictions, assembled with collaborators at the European Molecular Biology Laboratory (EMBL), is freely accessible online. “It’s fantastic they have made this available,” Baker says. “It will really increase the pace of research.”
Because the 3D structure of a protein largely dictates its function, the DeepMind library is apt to help biologists sort out how thousands of unknown proteins do their jobs. “We at EMBL believe this will be transformative to understanding how life works,” says the lab’s director general, Edith Heard.
DeepMind collaborators say AlphaFold2 has already spurred the development of novel enzymes that break down plastics in the environment more quickly than those found previously and led to novel possibilities for drugs to treat neglected diseases. “This will be one of the most important data sets since the mapping of the human genome,” says Ewan Birney, director of EMBL’s European Bioinformatics Institute.
The impacts aren’t likely to stop there. The predictions will help experimentalists who solve structures, Baek says. Data from x-ray crystallography and cryo–electron microscopy experiments can be difficult to interpret, Baek and others say, and having a model can help. “In the short term, it will boost structure determination efforts,” she predicts. “And over time it will also slowly replace [experimental] structural determination efforts.”
If that happens, structural biologists won’t find themselves out of work. Baker notes that both experimental and computational scientists are already beginning to turn their efforts to the more complex challenge of understanding exactly which proteins interact with one another and what molecular changes happen during these interactions. “It’s going to reset the field,” Baker says. “It’s a very exciting time.”
22 July, 2021 - 02:02pm
The human genome holds the instructions for more than 20,000 proteins. But only about one-third of those have had their 3D structures determined experimentally. And in many cases, those structures are only partially known.
Now, a transformative artificial intelligence (AI) tool called AlphaFold, which has been developed by Google’s sister company DeepMind in London, has predicted the structure of nearly the entire human proteome (the full complement of proteins expressed by an organism). In addition, the tool has predicted almost complete proteomes for various other organisms, ranging from mice and maize (corn) to the malaria parasite (see ‘Folding options’).
The more than 350,000 protein structures, which are available through a public database, vary in their accuracy. But researchers say the resource — which is set to grow to 130 million structures by the end of the year — has the potential to revolutionize the life sciences.
“It’s totally transformative from my perspective. Having the shapes of all these proteins really gives you insight into their mechanisms,” says Christine Orengo, a computational biologist at University College London (UCL).
“This is the biggest contribution an AI system has made so far to advancing scientific knowledge. I don’t think it’s a stretch to say that,” says Demis Hassabis, co-founder and chief executive of DeepMind.
But researchers emphasize that the data dump is a beginning, not an end. They will want to validate the predictions and, more importantly, apply them to experiments that were hitherto impossible. “It’s an amazing first step, that we have all this data on that scale,” says David Jones, a UCL computational biologist who advised DeepMind on an earlier iteration of AlphaFold.
DeepMind stunned the life-sciences community last year, when an updated version of AlphaFold swept a biennial protein-prediction exercise called CASP (Critical Assessment of Protein Structure Prediction). In this long-running competition, which has traditionally been the domain of academics, researchers predict the structures of proteins whose structures have been experimentally solved, but not yet made public.
With this added efficiency, the DeepMind team set out to predict the structures of nearly every known protein encoded by the human genome, as well as those of 20 model organisms. The structures are available in a database maintained by EMBL-EBI (the European Molecular Biology Laboratory European Bioinformatics Institute) in Hinxton, UK.
Even the less-accurate predictions might offer insights. Biologists think that a large proportion of human proteins and those of other eukaryotes — organisms with cells that have nuclei — contain regions that are are inherently disordered and take on a defined structure only in concert with other molecules. “Many proteins are just wiggly in solution, they don’t have a fixed structure,” says AlphaFold lead researcher John Jumper. Some of the regions that AlphaFold predicted with low confidence match up with those that biologists suspect are disordered, says Pushmeet Kohli, head of AI for science at DeepMind.
Determining how individual proteins interact with other cellular players is one of the greatest challenges to the AlphaFold predictions, say researchers. For the CASP competition, most of its predictions were of independently folding units of a protein, called domains. But the human proteome, and those of other organisms, contains proteins with multiple domains that fold semi-independently. Human cells also contain molecules made of multiple chains of interacting proteins, such as receptors on cell membranes.
The approximately 365,000 structure predictions deposited this week should swell to 130 million — nearly half of all known proteins — by the year’s end, says Sameer Velankar, a structural bioinformatician at EMBL-EBI. The database will be updated as new proteins are identified and predictions improved. “This is not a resource you expect to have access to,” says Tunyasuvunakool, and she is eager to see what scientists come up with.
Researchers are already using AlphaFold and related tools to help make sense out of experimental data generated using X-ray crystallography and cryo-electron microscopy. Marcelo Sousa, a biochemist at the University of Colorado Boulder, used AlphaFold to make models from X-ray data of proteins that bacteria use to evade an antibiotic called colistin. The parts of the experimental model that differed from the AlphaFold prediction were typically regions that the software had assigned with low confidence, Sousa notes, a sign that AlphaFold is accurately predicting its limits.
Still, biologists will want to continue benchmarking these predictions to experimental data to get a better handle on their reliability, says Venki Ramakrishnan, a structural biologist at the MRC Laboratory of Molecular Biology in Cambridge, UK. “We need to be able to trust these data,” adds Orengo.
Jones is impressed with what the network has achieved. But he says that many of the models predicted by AlphaFold could have been generated with earlier software developed by academics. “For most proteins, those results are probably good enough for quite a lot of the things you want to do.” Scientists dead-set on obtaining the structure of any particular protein could probably succeed using experimental approaches.
But the availability of so many protein structures is likely to mark a “paradigm shift” in biology, says Mohammed AlQuraishi, a computational biologist at Columbia University in New York City who works on protein-structure prediction. His field has spent so much time and energy on predicting accurate protein structures on this scale that it hasn’t yet worked out what do with such resources. “Everything we do today that relies on a protein sequence, we can now do with protein structure.”
Orengo hopes that the database will help her to better understand the structural constraints of proteins. She has mapped a database of known proteins into about 5,000 ‘structural families’, but about half of the proteins in the database are excluded because there is nothing else like them for which a structure has been determined. AlphaFold’s predictions could help uncover new shapes, she says. “We’ll really see what folding space looks like.”
Jones expects AlphaFold will lead to a lot of soul-searching among biologists about what to do with so many structures — and the ease of creating many more. “There will be conferences. Now we’ve got 130 million models, how does this change our view of biology? It may be it doesn’t change it,” he says. “I suspect it will.”
Jumper, J. et al. Nature https://doi.org/10.1038/s41586-021-03819-2 (2021).
Tunyasuvunakool, K. et al. Nature https://doi.org/10.1038/s41586-021-03828-1 (2021).
An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.
An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.
22 July, 2021 - 10:06am
Scientists on Thursday unveiled the most exhaustive database yet of the proteins that form the building blocks of life, in a breakthrough observers said would "fundamentally change biological research".
Every cell in every living organism is triggered to perform its function by proteins that deliver constant instructions to maintain health and ward off infection.
Unlike the genome -- the complete sequence of human genes that encode cellular life -- the human proteome is constantly changing in response to genetic instructions and environmental stimuli.
Understanding how proteins operate -- the shape in which they end up, or "fold" into -- within cells has fascinated scientists for decades.
But determining each protein's precise function through direct experimentation is painstaking.
Fifty years of research have until now yielded only 17 percent of the human proteome's amino acids, the subunits of proteins.
On Thursday, researchers at Google's DeepMind and the European Molecular Biology Laboratory (EMBL) unveiled a database of 20,000 proteins expressed by the human genome, freely and openly available online.
They also included more than 350,000 proteins from 20 organisms such as bacteria, yeast and mice that scientists rely on for research.
To create the database, scientists used a state-of-the-art machine learning programme that was able to accurately predict the shape of proteins based on their amino acid sequences.
Instead of spending months using multi-million dollar equipment, they trained their AlphaFold system on a database of 170,000 known protein structures.
The AI then used an algorithm to make accurate predictions of the shape of 58 percent of all proteins within the human proteome.
This more than doubled the number of high-accuracy human protein structures that researchers had identified during 50 years of direct experimentation, essentially overnight.
The potential applications are enormous, from researching genetic diseases and combating anti-microbial resistance to engineering more drought-resistant crops.
Paul Nurse, winner of the 2001 Nobel Prize for Medicine and director of the Francis Crick Institute, said Thursday's release was "a great leap for biological innovation".
"With this resource freely and openly available, the scientific community will be able to draw on collective knowledge to accelerate discovery, ushering in a new era for AI-enabled biology," he said.
John McGeehan, director for the Centre for Enzyme Innovation at the University of Portsmouth, whose team is developing enzymes capable of consuming single-use plastic waste, said AlphaFold had revolutionised the field.
"What took us months and years to do, AlphaFold was able to do in a weekend. I feel like we have just jumped at least a year ahead of where we were yesterday," he said.
The ability to predict a protein's shape from its amino acid sequence using a computer rather than experimentation is already helping scientists in a number of research fields.
AlphaFold is already being used in research into cures for diseases that disproportionately affect poorer countries.
One US-based team is using the AI prediction to study ways of overcoming strains of drug-resistant bacteria.
Another group is using the database to better understand how SARS-CoV-2, the virus that causes Covid-19, bonds with human cells.
Venki Ramakrishnan, winner of the 2009 Nobel Prize for Chemistry, said Thursday's research, published in the journal Nature, was a "stunning advance" in biological research.
He said AlphaFold had essentially solved the so-called "protein-folding problem", which argued that the 3D structure of a given protein should be determinable from its amino acid sequence, and which had puzzled scientists for half a century.
Given that the number of shapes a protein could theoretically take is astronomically large, the protein-fold problem was partly one of processing power.
The task was so daunting that in 1969 US molecular biologist Cyril Levinthal famously theorised that it would take longer than the age of the known universe to enumerate all possible protein configurations using brute calculation.
But with AlphaFold capable of performing a mind-dizzying number of calculations every second, the problem stood no chance when faced with AI and algorithms.
"It has occurred long before many people in the field would have predicted," Ramakrishnan said.
"It will be exciting to see the many ways in which it will fundamentally change biological research."
22 July, 2021 - 10:00am
Back in December 2020, DeepMind took the world of biology by surprise when it solved a 50-year grand challenge with AlphaFold, an AI tool that predicts the structure of proteins. Last week the London-based company published full details of that tool and released its source code.
Now the firm has announced that it has used its AI to predict the shapes of nearly every protein in the human body, as well as the shapes of hundreds of thousands of other proteins found in 20 of the most widely studied organisms, including yeast, fruit flies, and mice. The breakthrough could allow biologists from around the world to understand diseases better and develop new drugs.
So far the trove consists of 350,000 newly predicted protein structures. DeepMind says it will predict and release the structures for more than 100 million more in the next few months—more or less all proteins known to science.
“Protein folding is a problem I've had my eye on for more than 20 years,” says DeepMind cofounder and CEO Demis Hassabis. “It’s been a huge project for us. I would say this is the biggest thing we’ve done so far. And it’s the most exciting in a way, because it should have the biggest impact in the world outside of AI.”
Proteins are made of long ribbons of amino acids, which twist themselves up into complicated knots. Knowing the shape of a protein’s knot can reveal what that protein does, which is crucial for understanding how diseases work and developing new drugs—or identifying organisms that can help tackle pollution and climate change. Figuring out a protein’s shape takes weeks or months in the lab. AlphaFold can predict shapes to the nearest atom in a day or two.
The new database should make life even easier for biologists. AlphaFold might be available for researchers to use, but not everyone will want to run the software themselves. “It’s much easier to go and grab a structure from the database than it is running it on your own computer,” says David Baker of the Institute for Protein Design at the University of Washington, whose lab has built its own tool for predicting protein structure, called RoseTTAFold, based on AlphaFold’s approach.
In the last few months Baker’s team has been working with biologists who were previously stuck trying to figure out the shape of proteins they were studying. “There's a lot of pretty cool biological research that's been really sped up,” he says. A public database containing hundreds of thousands of ready-made protein shapes should be an even bigger accelerator.
“It looks astonishingly impressive,” says Tom Ellis, a synthetic biologist at Imperial College London studying the yeast genome, who is excited to try the database. But he cautions that most of the predicted shapes have not yet been verified in the lab.
In the new version of AlphaFold, predictions come with a confidence score that the tool uses to flag how close it thinks each predicted shape is to the real thing. Using this measure, DeepMind found that AlphaFold predicted shapes for 36% of human proteins with an accuracy that is correct down to the level of individual atoms. This is good enough for drug development, says Hassabis.
Previously, after decades of work, only 17% of the proteins in the human body have had their structures identified in the lab. If AlphaFold’s predictions are as accurate as DeepMind says, the tool has more than doubled this number in just a few weeks.
Even predictions that are not fully accurate at the atomic level are still useful. For more than half of the proteins in the human body, AlphaFold has predicted a shape that should be good enough for researchers to figure out the protein’s function. The rest of AlphaFold’s current predictions are either incorrect, or are for the third of proteins in the human body that don’t have a structure at all until they bind with others. “They’re floppy,” says Hassabis.
“The fact that it can be applied at this level of quality is an impressive thing,” says Mohammed AlQuraish, a systems biologist at Columbia University who has developed his own software for predicting protein structure. He also points out that having structures for most of the proteins in an organism will make it possible to study how these proteins work as a system, not just in isolation. “That’s what I think is most exciting,” he says.
DeepMind is releasing its tools and predictions for free and will not say if it has plans for making money from them in future. It is not ruling out the possibility, however. To set up and run the database, DeepMind is partnering with the European Molecular Biology Laboratory, an international research institution that already hosts a large database of protein information.