Overview

This project aimed to read sequences of amino acids in FASTA format and return a sequence indicating whether a given amino acid belongs to an alpha-helix secondary structure. It is often described as a seq2seq problem.
Two constraints for this project were that the saved weights could not exceed 500 kB and NumPy was the only inference dependency.
The final model selected was an RNN with LSTM units. The model was trained using Keras in Google Colab, and the forward pass was implemented in NumPy in the src/SSPred.py file, which can be found on the GitHub repository

Report Abstract (Full Report)

Fast and reliable protein secondary structure prediction is desirable when an amino acid sequence structure has not been resolved using techniques such as X-ray crystallography, nuclear magnetic resonance spectroscopy, or cryo-electron microscopy. This report evaluated the two-state protein secondary structure prediction accuracy of non-parametric, probabilistic, and deep learning methods on a labelled dataset of 5326 FASTA sequences. The models assessed included a k-nearest Neighbour Classifier (KNN), a Categorical Naive Bayes Classifier (CNB), a Hidden Markov Model (HMM), and a Recurrent Neural Network (RNN). Natural Language Processing (NLP) techniques were also applied, such as using an n-gram approach to sequence representation. An n-gram sequence representation improved validation accuracy across all models except the HMM, which cannot use an n-gram input.

Furthermore, the RNN achieved a validation accuracy of 72%, the highest among the models. The results of this work suggest that models and pre-processing techniques from the field of NLP may be well suited to protein structure prediction. However, the RNN model also presents limitations due to the large size of the model parameters and the requirement of equal-length input sequences.