data

smiles_data_layer

class data.smiles_data_layer.SmilesDataset(*args, **kwds)[source]

Bases: torch.utils.data.dataset.Dataset

Creates dataset for SMILES-property data. :param filename: string with full path to dataset file. Dataset file

must be csv file.

Parameters
  • cols_to_read (list) – list specifying columns to read from dataset file. Could be of various length, cols_to_read[0] will be used as index as index for column with SMILES, cols_to_read[1:] will be used as indices for labels values.

  • delimiter (str) – columns delimiter in filename. default is ,.

  • tokens (list) – list of unique tokens from SMILES. If not specified, will be extracted from provided dataset.

  • pad (bool) – argument specifying whether to pad SMILES. If true SMILES will be padded from right and the flipped. default is True.

  • augment (bool) – argument specifying whether to augment SMILES.

graph_data_layer

smiles_protein_data_layer

class data.smiles_protein_data_layer.SmilesProteinDataset(*args, **kwds)[source]

Bases: torch.utils.data.dataset.Dataset

vanilla_data_layer

class data.vanilla_data_layer.VanillaDataset(filename, cols_to_read, features, delimiter=',', tokens=None)[source]

Bases: object

smiles_enumerator

class data.smiles_enumerator.Iterator(n, batch_size, shuffle, seed)[source]

Bases: object

Abstract base class for data iterators. # Arguments

n: Integer, total number of samples in the dataset to loop over. batch_size: Integer, size of a batch. shuffle: Boolean, whether to shuffle the data between epochs. seed: Random seeding for data shuffling.

reset()[source]
class data.smiles_enumerator.SmilesEnumerator(charset='@C)(=cOn1S2/H[N]\\', pad=120, leftpad=True, isomericSmiles=True, enum=True, canonical=False)[source]

Bases: object

SMILES Enumerator, vectorizer and devectorizer #Arguments

charset: string containing the characters for the vectorization

can also be generated via the .fit() method

pad: Length of the vectorization leftpad: Add spaces to the left of the SMILES isomericSmiles: Generate SMILES containing information about stereogenic centers enum: Enumerate the SMILES during transform canonical: use canonical SMILES during transform (overrides enum)

property charset
fit(smiles, extra_chars=[], extra_pad=5)[source]

Performs extraction of the charset and length of a SMILES datasets and sets self.pad and self.charset #Arguments

smiles: Numpy array or Pandas series containing smiles as strings extra_chars: List of extra chars to add to the charset (e.g. “\” when “/” is present) extra_pad: Extra padding to add before or after the SMILES vectorization

randomize_smiles(smiles)[source]

Perform a randomization of a SMILES string must be RDKit sanitizable

reverse_transform(vect)[source]

Performs a conversion of a vectorized SMILES to a smiles strings charset must be the same as used for vectorization. #Arguments

vect: Numpy array of vectorized SMILES.

transform(smiles)[source]

Perform an enumeration (randomization) and vectorization of a Numpy array of smiles strings #Arguments

smiles: Numpy array or Pandas series containing smiles as strings

class data.smiles_enumerator.SmilesIterator(x, y, smiles_data_generator, batch_size=32, shuffle=False, seed=None, dtype=<class 'numpy.float32'>)[source]

Bases: data.smiles_enumerator.Iterator

Iterator yielding data from a SMILES array. # Arguments

x: Numpy array of SMILES input data. y: Numpy array of targets data. smiles_data_generator: Instance of SmilesEnumerator

to use for random SMILES generation.

batch_size: Integer, size of a batch. shuffle: Boolean, whether to shuffle the data between epochs. seed: Random seed for data shuffling. dtype: dtype to use for returned batch. Set to keras.backend.floatx if using Keras

next()[source]

For python 2.x. # Returns

The next batch.

utils

class data.utils.DummyDataLoader(batch_size)[source]

Bases: object

class data.utils.DummyDataset(*args, **kwds)[source]

Bases: torch.utils.data.dataset.Dataset

data.utils.augment_smiles(smiles, labels, n_augment=5)[source]
data.utils.canonize_smiles(smiles, sanitize=True)[source]
Takes list of SMILES strings and returns list of their canonical SMILES.
Args:

smiles (list): list of SMILES strings sanitize (bool): parameter specifying whether to sanitize SMILES or not. For definition of sanitized SMILES check www.rdkit.org/docs/api/rdkit.Chem.rdmolops-module.html#SanitizeMol

Output:

new_smiles (list): list of canonical SMILES and NaNs if SMILES string is invalid or unsanitized (when ‘sanitize = True’)

When ‘sanitize = True’ the function is analogous to: sanitize_smiles(smiles, canonize=True).

data.utils.create_loader(dataset, batch_size, shuffle=True, num_workers=1, pin_memory=False, sampler=None)[source]
data.utils.cut_padding(samples, lengths, padding='left')[source]
data.utils.get_fp(smiles, n_bits=2048)[source]
data.utils.get_tokens(smiles, tokens=None)[source]

Returns list of unique tokens, token-2-index dictionary and number of unique tokens from the list of SMILES :param smiles: list of SMILES strings to tokenize. :type smiles: list :param tokens: string of tokens or None. :type tokens: string :param If none will be extracted from dataset.:

Returns

list of unique tokens/SMILES alphabet. token2idx (dict): dictionary mapping token to its index. num_tokens (int): number of unique tokens.

Return type

tokens (list)

data.utils.mol2image(x, n=2048)[source]
data.utils.pad_sequences(seqs, max_length=None, pad_symbol=' ')[source]
data.utils.process_graphs(smiles, node_attributes, get_atomic_attributes, edge_attributes, get_bond_attributes=None, kekulize=True)[source]
data.utils.process_smiles(smiles, sanitized=False, target=None, augment=False, pad=True, tokenize=True, tokens=None, flip=False, allowed_tokens=None)[source]
data.utils.read_smi_file(filename, unique=True)[source]

Reads SMILES from file. File must contain one SMILES string per line with

token in the end of the line.
Args:

filename (str): path to the file unique (bool): return only unique SMILES

Returns:

smiles (list): list of SMILES strings from specified file. success (bool): defines whether operation was successfully completed or not.

If ‘unique=True’ this list contains only unique copies.

data.utils.read_smiles_property_file(path, cols_to_read, delimiter=',', keep_header=False)[source]
data.utils.sanitize_smiles(smiles, canonize=True, min_atoms=- 1, max_atoms=- 1, return_num_atoms=False, allowed_tokens=None, allow_charges=False, return_max_len=False, logging='warn')[source]

Takes list of SMILES strings and returns list of their sanitized versions. For definition of sanitized SMILES check http://www.rdkit.org/docs/api/rdkit.Chem.rdmolops-module.html#SanitizeMol

Args:

smiles (list): list of SMILES strings canonize (bool): parameter specifying whether to return canonical SMILES or not. min_atoms (int): minimum allowed number of atoms max_atoms (int): maxumum allowed number of atoms return_num_atoms (bool): return additional array of atom numbers allowed_tokens (iterable, optional): allowed tokens set allow_charges (bool): allow nonzero charges of atoms logging (“warn”, “info”, “none”): logging level

Output:

new_smiles (list): list of SMILES and NaNs if SMILES string is invalid or unsanitized. If ‘canonize = True’, return list of canonical SMILES.

When ‘canonize = True’ the function is analogous to: canonize_smiles(smiles, sanitize=True).

data.utils.save_smi_to_file(filename, smiles, unique=True)[source]

Takes path to file and list of SMILES strings and writes SMILES to the specified file.

Args:

filename (str): path to the file smiles (list): list of SMILES strings unique (bool): parameter specifying whether to write only unique copies or not.

Output:

success (bool): defines whether operation was successfully completed or not.

data.utils.save_smiles_property_file(path, smiles, labels, delimiter=',')[source]
data.utils.seq2tensor(seqs, tokens, flip=True)[source]
data.utils.time_since(since)[source]