Prediction of potential drug ligands that bind to Beta-Lactamases

September 7, 2021

I developed this project as the Midterm assignment for the Machine Learning Zoomcamp. Data Professor proposed the idea and dataset of this initiative, a collaborative Open Bioinformatics Research Project that is still in progress.

Figure 1.- 3D-structure of a betalactamase (PDB ID: 2q9n). Retrieved from https://commons.wikimedia.org/wiki/File:PDB_2q9n_EBI.png. Used under a CC0 licence.

Background

This project aims to evaluate the activity of molecules that have been experimentally tested to bind or not bind to Beta-Lactamases. Some of these proteins allow multi-drug resistant bacteria or superbugs to inactivate a wide range of penicillin-like antibiotics, which is known as antimicrobial resistance (AMR). According to the World Health Organization, AMR is one of the top ten global public health threats facing humanity in this century, so it is important to search for potential compounds that combat these superbugs and prevent AMR, which is the aim of this project. You can find detailed information about AMR and Beta-Lactamase in this blog.

Dataset

The dataset of this project consists of 136 csv files with information of interactions between small molecules and Beta-Lactamases.

Data preparation and feature matrix

The feature matrix to train machine learning models was obtained by calculating molecular descriptors from the canonical smiles of molecules. These molecular descriptors are also known as molecular fingerprints, and they are property profiles of molecules, represented as vectors with each vector element representing the existence or the frequency of a structural feature. The extraction of molecular fingerprints from SMILES was performed with PaDEL software, following instructions from this video.

PaDEL has 12 available fingerprints, but for this project, I calculated 10 of them because KlekotaRothFingerprintCount and KlekotaRothFingerprinter required a long computing time to be obtained. In this project, the target protein was Beta-lactamase AmpC.

Machine Learning Models

For this project, I tested three machine learning models, including Logistic Regression, Random Forest, and XGBoost, for a binary classification task. I chose pchembl value as the target variable. To fine-tune hyperparameters, I used sklearn class GridSearchCV.

Additional information

The complete information regarding exploratory data analysis and selection of the best model jupyter notebook, training and validation python scripts, implementation of the best model as a web service using Flask, deployment to the cloud with Heroku, and further details are available on the GitHub repository of this project.

Posted on:
September 7, 2021
Length:
2 minute read, 376 words
Categories:
Drug Discovery Machine Learning Bioinformatics
Tags:
Python
See Also:
Prediction of antimicrobial peptides using machine learning classifiers
Sentiment analysis of Ecuadorian political tweets from 2021 elections
CRBN mutations