Tartu scientists study 'dark matter' of proteine universe

Scientists at the University of Tartu have used a large-scale computational effort to identify and annotate new families of proteins that were not visible to science before. What helped to make this leap to the unknown was the computational tool that illuminates the “dark matter” of the natural protein universe with high predictive accuracy.

Tanel Tenson, a molecular biologist and professor of biotechnology at the University of Tartu, explained that the “atlas of everything” reveals the entire network of interconnections in the protein universe, which makes it possible to abandon the previous method of studying proteins one at a time, and discover hitherto completely unknown families of proteins.

By exploring the network of “dark matter,” Tenson and his colleagues discovered 290 putative new protein families and defined a new superfamily of toxin-antitoxin systems, which they also experimentally validated.

“It is curious how every time the gene technology makes a leap forward, it has resulted from studies on how bacteria defend themselves against viruses,” he said. “In this study we have also predicted a new family of proteins that protects bacteria from viruses.”

Proteins and gene expression

The genes in our DNA determine how our body makes all the biomolecules it needs for its functions.

The process of expressing a gene – producing its corresponding protein – begins with “transcription,” the transfer of DNA information to a messenger molecule (mRNA), and ends with the process called “translation,” or the synthesis of an actual protein molecule, Tenson explained.

“Genes have to be ‘transcribed’ and ‘translated’ into proteins because it is a protein that does all the work for us,” he said. “The protein sequences are the active ones. Our genetic material on its own won’t do anything, so it needs to be converted into proteins, or rather into the amino acid sequences of proteins.”

“And, of course, when people discovered that we have DNA – a genetic material – and that we have the protein sequences, which are derived from it in this way, scientists started working on deciphering the essential proteins, one by one.”

“They started with those considered the most important – e.g., insulin, a hormone that regulates our uptake of glucose, or hemoglobin, which transports oxygen in our blood. So, one by one, we started to accumulate this knowledge,” he went on.

“What is new about our study, however, is that we decided to ask not what we want to know, like a particular type of protein, but what is out there that we still know nothing about, but might be very important to us,” Tenson explained about the team’s collaborative effort now published in Nature.

Protein families as ancestry trees

“So scientists started to sequence the DNA, or genomes, and we can now really understand some pieces of the inheritable material. We have more and more of these sequences.

“But the problem is that usually sequences from protein families also collect evolutions, so to speak, they collect mutations. And so we actually have something like a protein family tree. They are essentially like human family trees, drawn in the same way you would draw your ancestral tree,” Tenson explained.

“Since each [protein expression] is unique, there is always an ancestor protein with an ancestor sequence, and when they are inherited, they also undergo sequence changes.”

Tenson and his team set out to evaluate the entire protein universe to identify previously unstudied “family trees” and attempt to understand their biological functions.

“We have studied some protein families extensively, but we can also find protein families derived from our DNA sequences, which we know are there, and nobody knows what they might be doing,” he outlined.

Dark matter of the protein universe

The scientists have now developed the computational tool and used it to evaluate large amounts of data to discover entirely new patterns of proteins, whose function would be explained – or annotated – with the help of the same tool, essentially by seeing and analyzing the interconnected patterns made now visible in the big picture of the “protein universe.”

“What is the big picture of protein families? What do the families that no one has ever studied do? Because there is evidence of them in our DNA sequence, and we can ‘see’ them, so now do you go about studying what you know nothing about?” he asked.

One might wonder why there are so many visible protein structures that are entirely “dark” to science. “Because there are simply so many of them. People study only those that they consider most important,” the professor explained.

The team demonstrated the success of their method with the discovery of an entirely new family of proteins, as well as the explanation of their function.

“We had a couple of examples that we validated: So from the mere gene sequence, we had some early suspicion what they might be doing, we added some experimental parts based on the database of the dark matter, and discovered interesting ideas about what these might be doing. We also tried it out experimentally and it just worked out,” Tenson said.

a, Starting from the clusters in UniRef50, we collected all the functional annotations for all included UniProtKB and UniParc entries, including domain (D), coiled-coil (CC) and intrinsically disordered (IDPs) predictions and excluding all of those with putative, hypothetical, uncharacterized and DUF in their names. C_x corresponds to the coverage of an annotation, C_i corresponds to the functional brightness across the entire sequence. We selected the protein with the highest full-length annotation coverage (that is, brightness, C_i) as the functional representative of each cluster. b, From the collected UniRef50 clusters, we selected those with a structural representative with pLDDT greater than 90 in the AFDB v.4, and constructed a large-scale sequence similarity network by all-against-all MMseqs2 searches, representing the sequence landscape of more than 6 million UniRef50 clusters.

Prediction success, or how to find things in the dark?

“We have annotated the big part of our natural protein universe, but most of it is still this dark matter – so we have no idea what these sequences do. You see things, but you don’t know what they are,” he went on.

“We constructed the ‘dark matter atlas’ solely from the data in our genomic sequences, and while we were working on this database something interesting happened, we realized that we had an idea what some of these non-annotated protein families might be doing,” he said.

So from the large-scale picture that started to emerge the team could “predict” protein families which no one ever studied before and what the biological functions of these families might be.

“Of course, there are too many unknowns. We can’t verify experimentally all of them, but we are working with one protein class in viruses now.”

AI cannot tell you things that you don’t know

We are entering an era of protein sequence and structure annotation where hundreds of millions of predicted protein structures are now available.

But for many proteins, science has little information about their biological functions. “The proteins can be predicted from the genomic sequences, but we have no idea what they do. That’s the dark matter – we don’t know what the proteins are doing,” Tenson explained regarding the difficulty.

“Normally, when you sequence a new genome, you can annotate proteins by homology: It’s like if you have a piece of writing, and then you go and make some changes to it, but then you go back and compare the two versions. You know that one is an edited version of the other. So this is a homology-based approach to protein annotation,” he said.

But with so many new protein sequences becoming visible, many are too difficult to annotate using standard homology-based approaches, and “then we have this other part of the universe that we can’t annotate at all – dark matter – because using homology, we don’t find any similar proteins that have been already studied. So that’s really dark matter.”

“Also, for a long time, we were not very good at predicting the three-dimensional structures of proteins. We talked about the amino acid sequences, but to be active, they fold into three-dimensional structures, which was a really difficult step: We had the protein sequences, but we had to also understand what the protein structure looked like. All of our experimental methods in this respect are very time-consuming,” Tenson continued.

“You can attempt to apply various methods, but it’s a laborious process. In this regard, the AI and the AlphaFold tool was a real breakthrough. Essentially, it takes into account all previously determined protein structures, including three-dimensional structures, and from this, the AI can often predict the 3D structures of new proteins as well,” Tenson explained.

In their study, the scientist examined the extent to which the AlphaFold has made these “dark matter” predictions accurate.

Uniport is a database of all the knowledge on proteins that there is, it has all the annotated proteins known to science, but not yet their 3D structures. And there is the AlphaFold, which is itself not a database, but a tool trained on various databases that already also contain 3D structures.

“AlphaFold was quite a big breakthrough; however, the AI can only predict similarities, not entirely new families of proteins hitherto unknown,” Tenson went on.

“As with any AI, you need a training set, the AI can’t tell you things you don’t know, so you need to have something similar in the database, and then the AI will be much better at predictions,” he said.

Visualized and searchable ‘dark matter’

The new interactive, annotated, and searchable web database that Tenson and his colleagues have compiled, is available here.

“The tool is still developing in the sense that when we get more data, we can put more data in, and then we will get better predictions as well. So the more sequences we put in, the more information we get out, and the more experimental data we put in, the more ideas about the unknown will become available,” Tenson said.

“We are mostly partnering with people interested in antibiotics and antibiotic resistance. And we have people affiliated with us, but I think that the core team is about 15 people. We have been collaborating with health practitioners quite a lot as well,” he added.

“Unfortunately, the tool is not automatic at the moment, but I hope it will be fully integrated with the Uniprot database, for example, so that users don’t need to go into a separate database but have access to it immediately,” he went on.

Why it is challenging to integrate is a very good question, he added. “If you have ever worked with multiple databases managed by different institutions, you understand what it takes… now we are talking about the personalized e-state in Estonia, and our minister says that we need to allocate €200 million to take the next step,” Tenson made a comparison of the investments necessary to fully develop such integration.

We’re not at the full picture of the protein universe yet, “but we’re getting there; it’s like a curve that’s moving closer and closer to saturation, so we’re not there yet, but I think we’re now moving rapidly toward that end,” he said.

How to find things in the dark?

“We have already figured out so many details, now we need to have a large-scale picture,” Tenson said.

“Similar attempts to ours have been made before only with the annotated protein sequences,” he added.

The new tool shows the universe that the protein inhabits and all its other families.

“If you’re interested in a protein or have the genomic sequence, then use our tool to figure out what it’s doing,” the professor explained, on how to use the new tool.

“So if people sequence the genome of a new organism, it could help to get ideas flowing about all the protein families that constitute its dark matter. The sequencing power is already so large and it’s growing exponentially, and we don’t want to stop at just seeing these sequences, we also want to know what these sequences are doing.”

“You can enter your protein sequence into the tool’s search engine, or you could also enter the genome sequence, and then you at least know what part of the universe your protein is in,” he said. “You will get help with ideas about which experiments you could do to annotate these sequences.”

This article was originally published on the Estonian Public Broadcasting online news portal ERR . Written by Kristina Kersa.

If the world of proteins fascinates you and their role in life excites you, dive into more protein-powered discoveries on our webpage and read more about how we could be Boosting Meat Quality with Plant-Based Ingredients!