Molecules, SMILES and transformers

Recently, I’ve been training ML models to generate molecules, particularly transformers using their SMILES representation (see MolDecod GitHub repo). I thought I would write a short blog post to give some context on what are molecules and SMILES strings and how we can train transformer models on them.

What are molecules?

Molecules are the fundamental units of chemical compounds. They consist of two or more atoms connected through these chemical bonds. The arrangement of these atoms determines the molecule's properties and behavior. Molecules range from simple structures like water (H2O) to complex compounds like caffeine (C8H10N4O2).

Molecules form the basis of life processes, medicines, materials...

What are SMILES?

SMILES (Simplified Molecular Input Line Entry System) is a string notation used to represent molecular structures. It encodes complex 3D structures into a 1D string format, making it easier to store, transmit, and process molecular information.

SMILES uses ASCII characters to describe molecular structure, including atoms, bonds, branches, and cycles. This system provides a concise way to represent molecules while maintaining readability for both machines and humans. Its compact nature makes it particularly useful for computational chemistry and database applications.

The SMILES representation for caffeine is "CN1C=NC2=C1C(=O)N(C(=O)N2C)C".

How can we use them with transformer models?

SMILES strings can serve as valuable input data for training machine learning models, particularly transformer models. Transformers process SMILES as sequences of characters, like they would do for natural language, leveraging the attention mechanism to capture relationships between different parts of the molecule.

Transformer models are good at handling SMILES strings due to their ability to process variable-length inputs and learn long-range dependencies within the molecular structure. This helps in understanding complex molecules where atoms far apart in the SMILES string may be chemically significant to each other.

The attention mechanism in transformers allows the model to focus on relevant parts of the SMILES string, like ring structures, functional groups, and their relationships, even if they are separated by many characters in the string.

Illustration of the attention mechanism on SMILES strings

I trained recently a “small” transformer for molecule generation (MolDecod), using a decoder-only architecture (GPT-like) and rotary positional encoding.

With 5 million parameters and trained on 2.7 million molecules, it achieves, with a temperature of 0.5: 98% of validity, 95% of uniqueness, 85% of diversity and 87% of novelty.

Check out the link below if you want to see more!