Kaggleコンペティションの試みの物語：分子特性の予測

こんにちは、

こちらAmitとShaanです。 Koozyt で AI/データ解析のプロジェクトに関わっているエンジニアです。TokyoZブログに初めて執筆します！

最近、「Predicting Molecular Properties」というタイトルのKaggleコンペティションに参加しました。 Kaggleは、MLコンテストの開催に必要なデータセット、コンピューティングリソース、およびその他の情報を提供するウェブサイトです。今回のタスクは、分子の異なる幾何学的特徴を与えられた分子内の原子のペア間の磁気相互作用を測定することでした。具体的には、見つけなければならない分子特性は「Scalar Coupling Constant」でした。

私たちの経験を皆様と共有したいと思います！

ここからは英語になります。お許しください。

Background:

First of all, it’s possible to find the scalar coupling constants using equations from quantum mechanics, but the time required for the calculation- is huge! Moreover, those techniques are very expensive!

On the other hand, ML engineers are thought to be heroes nowadays because it seems they can solve everything with ML algorithms (at least some people believe it). As researchers had been searching for quite a long time for a simpler way to calculate scalar coupling constants, the members of the CHemistry and Mathematics in Phase Space (CHAMPS) hosted a competition on Kaggle named ‘Predicting Molecular Properties’. So, can ML engineers find a simpler way?

Motivation:

Nowadays, AI is being used in so many different domains, i.e.: coaching non-league football teams as well as professional tennis players, creating AI human news-anchors. As we are always looking for ways to get experience in applying AI in various domains of knowledge, when we came across this competition in Kaggle, we were instantly excited to jump into the data and start getting our hands dirty.

Description of the Data:

CHAMPS provided us with a training dataset and a testing dataset.

The training dataset contained 4,658,147 scalar coupling observations for 85,003 unique molecules, and the test dataset contained 2,505,542 scalar coupling observations for 45,772 unique molecules. These molecules contained only the atoms: carbon (C), hydrogen (H), nitrogen (N), fluorine (F), and oxygen (O).

There were 8 different types of scalar coupling: 1JHC, 1JHN, 2JHH, 2JHC, 2JHN, 3JHH, 3JHC, 3JHH. Fluorine coupling was not represented in this dataset.

J coupling is an indirect interaction between the nuclear spins of 2 atoms in a magnetic field. The number that comes before the J in the J coupling types (1J, 2J, 3J) denotes the number of bonds between the atoms that are coupling.

So 1J, 2J, 3J coupling will have 1, 2, and 3 bonds between the atoms, respectively. If we look at the distribution of the distance between atoms in the different types of couplings below, we can see that 1J has the lowest distance between atoms, and 3J has the highest distance between atoms, with 2J somewhere in the middle.

The increase in bonds between atoms is observed as an increase in the distance between them. The distance feature contains information about the arrangement of atoms in space which could help a model predict the J coupling constant more accurately.

Our Approach:

Understanding what properties of molecules affect the scalar coupling constant is the key to training a model that can accurately predict these values for future experiments.

Some properties that affect the scalar coupling constants are:

Distance: the distance between the given cartesian points of each atom.
N_bonds: the number of bonds on a specific atom.
J-types: 8 types of possible values as seen in the previous figure.

In addition to these properties, we made a lot of statistical features based on the geometrical features given to us for all the molecules. We also incorporated various other features generated by other Kagglers who generously made their notebooks public. For generating chemical features, we used a python library named ‘Openbabel’.

We used 2 categories of algorithms: gradient-boosting algorithms and graph neural networks.

From the beginning, we saw that the outputs are good if we use 8 models for 8 J-types. These 8 models have similar architecture but different parameter values because each of these models is trained on the dataset of each J-type.

We tweaked different parameters on the models, added new variables, etc to improve the score.

We used graph neural networks to improve the score further.

We had many sets of outputs but we selected 12 best sets of outputs and then we used ensembling on the 12 outputs and submitted the ensembled output to Kaggle.

The ensembled outputs always had better score compared to the stand-alone outputs.

Difficulties We Faced:

This competition required a lot of domain knowledge in the fields related to molecular chemistry. Moreover, it’s apparent from the solutions published by top scorers of this competition that outputs from graph neural networks beat outputs of other types of models. Since we didn’t initially start out with graph neural networks, we didn’t have much time to refine our results with it. We think our score would have been better if we had gotten the best out of our graph neural networks.

Evaluation of the Outputs:

Submissions are evaluated on the Log of the Mean Absolute Error, calculated for each scalar coupling type, and then averaged across types, so that a 1% decrease in MAE for one type provides the same improvement in the score as a 1% decrease for another type.

The formula simply means that the lower (considering the negative sign) value of score indicates better performance of the model! For example ‘-1.90’ is better than ‘-1.59’ because ‘-1.90’ is less than ‘-1.59’. Okay, now let’s have a look at how our models performed:

Model name	Model Details	Best public score on Kaggle
XGBoost 1	1 single model with only given features	-1.59
XGBoost 2	8 different models for different molecule groups using statistical features and chemical features borrowed from public kernels	-1.86
LGBM	A single model with only statistical features	-0.81
AdaBoost	A single model with only statistical features	-0.94
Graph neural network	A single model with only given features	-1.71
Ensembling	The ensemble of XGBoost 2 and GNN model	-1.90

Final score on Kaggle:

The competition authority evaluates the outputs using two different scores.

For evaluating the outputs submitted before the submission deadline, they take some secret indices of the test dataset to evaluate the output, this subset of the dataset is called public dataset and the score on the public dataset is called the public score.
Similar to the public dataset, they have another secret set of indices of the test dataset. This subset is named as the private dataset and is used for evaluating outputs after the submission deadline. The score on the private dataset is called the private score.

So, here is our public and private scores:

Score type	Score	Position
Public score	-1.90	294 (out of 2749 teams)
Private score	-1.89	299 (out of 2749 teams)

So, the scores show that our team secured the top 10-11% of all the teams participated in this competition!

Achievements:

We had to deal with a lot of challenges while participating in this competition. First of all, the competition required some background and experience in molecular chemistry. Despite having little background in the field, we predicted the outputs quite well! That means we can work on any dataset on which we don’t have detailed know-how.

Rather than the type of data, we had to tweak different types of boosting networks and models which increased our knowledge. Now, we can easily apply these boosting algorithms with other types of datasets in our work.

Moreover, we always had to fight with the scarcity of time. We had to schedule our tasks beforehand so that we can get the best out of the limited amount of time. So, this competition improved our time-management skills as well!

Last but not least, we had to work with Graph Neural Networks which was completely a new type of machine learning model which we haven’t applied to any project before. But, now we have some experience with Graph Neural Networks, so we can use these types of networks in our future projects!

Special thanks to Hiperdyne and Koozyt for allowing us to participate in this competition!

References: