If you are someone who is intending to create predictive models to determine, for example, the probability of a compound interacting with a particular target protein, I am sure you will be dealing with data that needs cleaning, or shall we say, curation. Depending on your needs, you will always have to perform minor or major processing of your data, what ever the source is (ChEMBL, PubChem, etc.), before you can effectively utilize them.
In the course of performing data curation, I strongly suggest that you make periodic backups of your database, especially, whenever you made a major change or a series of small changes. This is so that you won’t have to go back to square one whenever you make even just a small mistake.
Based on my experience, I will be enumerating some simple tips, some of it might even be just common sense, that might help you a little with your endeavor.
Data management and editing tools
With regard to tools or software, basically, any database management software (PostgreSQL, sqlite, etc.) can help, as is Microsoft Excel or other spreadsheet applications (open source or otherwise). If you are using Linux, command line-based editing can also be done in combination with those previously mentioned. In that regard, vim, emacs and any other similar applications are available. A database management software is usually the best you can use, but, it has its limitations. Thus, the use of other methods or software applications is also recommended. It actually boils down to which approach you are familiar with and can work with ease and efficiency.
Dealing with assay data
There are a lot of databases that could provide you with assay data that can be utilized for machine learning purposes. In our case, we rely on the ChEMBL database . However, similar with other databases, ChEMBL data, as is, can not be readily used for the aforementioned purpose. There is much to be done to tailor it for such. When creating an initial data table, we generally gather the following information: compound id (ChEMBL_ID), target (protein accession), type (IC50, EC50, etc.), relation (=, <, >, <=, >=), value (activity value), unit (nM, mM, etc.), assay id and assay description.
Hard work actually comes after creating the initial table. The following steps discuss how you should proceed, once you have the initial table prepared.
Eliminate clutter from your data
You may sometimes, inadvertently, obtain information which will not contribute any usefulness in the training of your model. You will not be able to notice those unless you try to find them deliberately then eliminate them. If you don’t, you will be able to notice them, eventually though, but, you may have already wasted a lot of time processing some useless data. I have enumerated 2 specific steps below regarding this matter.
*Eliminate data that do not contain activity values or unit:
This is, obviously, very important because it will save you a lot of of time, especially, when you are dealing with a lot of data. The absence of activity values, basically, indicates that the data is useless, so there is no need to continue processing them. However, with regards to data having no unit information in them, the lack of unit may indicate that the value might be a negative log of, for example, IC50 or EC50, that is, the values are pIC50 or pEC50. You, then, decide how you deal with this data. You could either discard them or convert the values back to molarity units. In my case, I treat every useful data preciously so I always make conversions.
*Eliminate the type of assay data that do not fit your requirements:
Let us say, you only require assay data containing IC50 or EC50 values with molarity units. Then, you would need to retain only those data that fit these criteria. Be careful, however, with regards to molarity units. For example, the nM (nanomolar) unit can also be expressed as nmol/L, so data containing units similar to that (pmol/L, umol/L, etc.) should also be retained.
Make your data uniform
You will not be able to create a model with which you can make an accurate prediction, unless, you achieve a minimum amount of uniformity among your data. Foremost of which is that all values must be converted in such a way that all data would have the same unit. Say, for example, you want to use values with nM units. You should make recalculations of the values based on the original units and, at the same time, change the unit labels to nM. This process is easily done, by simply issuing a line of command within a database management software such as PostgreSQL. Of course, you would have to issue a similar command as many times as the number of units to convert, which is really not that much.
Invalid entries may affect the amount of data that you can use based on your requirements. You might be able to increase the amount of your data a little if you find misspelled entries that are actually useful data when you make the necessary corrections. However, you should not make these corrections unless you are 100% sure and that you have made the necessary confirmation.
Entry mistakes with regard to case (uppercase, lowercase or a mix of them) are easy to find, and depending on your preference, you may or may not need to correct them. In my case, I usually correct them according to the proper, generally used form making them easy to inspect visually.
Entries with regard to relational symbols should also be corrected. It is advisable to convert ambiguous entries, when possible, to any of the following symbols: =, <, >, >=, <=.
Dealing with compound data
It is most likely that when you are going to create predictive models for different target proteins, you are going to incorporate compound structure information in a certain form within the predictive model itself. Specifically, compound structure information are incorporated as compound descriptors which are calculated from compound structure data (SDF, SMILES).
Preprocessing
Before compound descriptors can be generated, compound structure data should undergo preprocessing, which is normally referred to as standardization. This includes neutralization, desalting, retrieval of parent structure and other related processes. There are a lot of approaches out there in order to perform those, but in our case, we have adopted the ChEMBL Structure Pipeline based on RDKit with some modifications to get our desired results. Details of these, will be discussed in another blog which I will post soon.
Once standardization is complete, the standardized SDF or SMILES will be converted to canonical SMILES using the open source application Open Babel. You may ask, “Why should I use Open Babel?”. The truth is, you can actually use any application you are familiar with, as long as you use the same whenever you prepare your compound structure data during the creation of your model and during prediction. Then, you may ask, “Why should I always use the same application?”. You can actually create canonical SMILES using other applications such as RDKit. However, as I am sure you will discover later, they call it the same name, canonical SMILES, but, different applications have different ways of creating them resulting to different outputs. That is why, using the same application every time is a must. As such, you will get no surprises when you perform your predictions. Results will, almost always, be reproducible.
Compound descriptor calculation
There are also lots of methods by which you can generate compound descriptors (Open Babel, RDKit, others). However, we have been using alvaDesc for several years now and before that, its predecessor software DRAGON. We have found it very reliable for our purpose and believe that it has helped us create accurate predictive models. SDF can be used as input for this process, but, if you are dealing with lots of compound data, for example, from the ChEMBL database, we simply use the canonical SMILES generated using Open Babel. It saves a lot of time compared to when using SDF as input.
Dealing with protein data
You might also need to incorporate protein sequence information in your predictive model which requires that you convert it to a certain form like a descriptor. Previously, we have used the PROFEAT 2016 web server to generate protein descriptors. However, in a paper we published recently where we studied the performance of different combinations of compound and protein descriptors, we have found that the combination of alvaDesc compound descriptor and multiple sequence alignment (MSA) protein descriptor gave the best performance among others (ECFP4/MSA, alvaDesc/PROFEAT, ECFP4/PROFEAT, alvaDesc/ProtVec, ECFP4/ProtVec). If you want more details, you can download the paper here.
Category: Machine Learning