Prediction of potential drug ligands that bind to Beta-Lactamases
September 7, 2021
I developed this project as the Midterm assignment for the Machine Learning Zoomcamp. Data Professor proposed the idea and dataset of this initiative, a collaborative Open Bioinformatics Research Project that is still in progress.
Figure 1.- 3D-structure of a betalactamase (PDB ID: 2q9n). Retrieved from https://commons.wikimedia.org/wiki/File:PDB_2q9n_EBI.png. Used under a CC0 licence.
Background
This project aims to evaluate the activity of molecules that have been experimentally tested to bind or not bind to Beta-Lactamases. Some of these proteins allow multi-drug resistant bacteria or superbugs to inactivate a wide range of penicillin-like antibiotics, which is known as antimicrobial resistance (AMR). According to the World Health Organization, AMR is one of the top ten global public health threats facing humanity in this century, so it is important to search for potential compounds that combat these superbugs and prevent AMR, which is the aim of this project. You can find detailed information about AMR and Beta-Lactamase in this blog.
Dataset
The dataset of this project consists of 136 csv files with information of interactions between small molecules and Beta-Lactamases.
Data preparation and feature matrix
The feature matrix to train machine learning models was obtained by calculating molecular descriptors from the canonical smiles
of molecules. These molecular descriptors are also known as molecular fingerprints, and they are property profiles of molecules, represented as vectors with each vector element representing the existence or the frequency of a structural feature. The extraction of molecular fingerprints from SMILES was performed with
PaDEL software, following instructions from
this video.
PaDEL has 12 available fingerprints, but for this project, I calculated 10 of them because KlekotaRothFingerprintCount and KlekotaRothFingerprinter required a long computing time to be obtained. In this project, the target protein was Beta-lactamase AmpC.
Machine Learning Models
For this project, I tested three machine learning models, including Logistic Regression, Random Forest, and XGBoost, for a binary classification task. I chose pchembl value
as the target variable. To fine-tune hyperparameters, I used sklearn class
GridSearchCV.
Additional information
The complete information regarding exploratory data analysis and selection of the best model jupyter notebook, training and validation python scripts, implementation of the best model as a web service using Flask, deployment to the cloud with Heroku, and further details are available on the GitHub repository of this project.
- Posted on:
- September 7, 2021
- Length:
- 2 minute read, 376 words
- Categories:
- Drug Discovery Machine Learning Bioinformatics
- Tags:
- Python