Exercise 4 - Evolution and Bioinformatics

Exercise 4 - Evolution and Bioinformatics

In this exercise you will be asked to reflect on some of the principles of evolution we talked about in the lecture and how they impact how we employ machine learning tools.

Part 1: Language models and alignments

We talked a lot about protein language models. Read through this blog post and then follow this notebook. What tasks are tackled with the model? Do you think it performs well? What other tasks could you imagine using this model for?

After this, read the FoldSeek paper. FoldSeek is an algorithm to align and search protein structures, a task that is significantly harder than sequence alignment since protein structures are not linear chains but 3D objects. How do the authors still manage to align structures efficiently (what is their trick)? What are the advantages and disadvantages of their approach?

Part 2: Bioinformatics Coding

There are many different packages that people use in bioinformatics. The one we will use in this exercise is called Biotite. We use it because it is highly efficient, very well written and clearly documented. For more details about the background of the package, you can watch the optional resource videos on the lesson page.

First, go through the main tutorial and try to understand how the package is structured. Why do you think the authors chose this structure of separating out different parts of the application into different namespaces like structure or sequence? What are the advantages and disadvantages of this structure?

On the examples page, you can find many example usages for Biotite, for example for pairwise sequence alignment or for protein property visualisation. Choose one of the following tasks:

  1. Find a different example on the examples page and explain what it does and how it works.
  2. Align the three GTPases HRas, KRas and NRas similar to the tutorial in the examples and describe how the conservations/mutations you see relate to the biology. If you want to learn more about UniProt (the database where the sequences come from), you can watch this video.
  3. Visualise the hydropathy of the protein 4m48 using the example code. What do you observe? How does this relate to the structure and function of the protein?