data¶

smiles_data_layer¶

class data.smiles_data_layer.SmilesDataset(*args, **kwds)[source]¶

Bases: torch.utils.data.dataset.Dataset

Creates dataset for SMILES-property data. :param filename: string with full path to dataset file. Dataset file

must be csv file.

Parameters

cols_to_read (list) – list specifying columns to read from dataset file. Could be of various length, cols_to_read[0] will be used as index as index for column with SMILES, cols_to_read[1:] will be used as indices for labels values.
delimiter (str) – columns delimiter in filename. default is ,.
tokens (list) – list of unique tokens from SMILES. If not specified, will be extracted from provided dataset.
pad (bool) – argument specifying whether to pad SMILES. If true SMILES will be padded from right and the flipped. default is True.
augment (bool) – argument specifying whether to augment SMILES.

graph_data_layer¶

smiles_protein_data_layer¶

class data.smiles_protein_data_layer.SmilesProteinDataset(*args, **kwds)[source]¶: Bases: torch.utils.data.dataset.Dataset

vanilla_data_layer¶

class data.vanilla_data_layer.VanillaDataset(filename, cols_to_read, features, delimiter=',', tokens=None)[source]¶: Bases: object

smiles_enumerator¶

class data.smiles_enumerator.Iterator(n, batch_size, shuffle, seed)[source]¶

Bases: object

Abstract base class for data iterators. # Arguments

n: Integer, total number of samples in the dataset to loop over. batch_size: Integer, size of a batch. shuffle: Boolean, whether to shuffle the data between epochs. seed: Random seeding for data shuffling.

reset()[source]¶

class data.smiles_enumerator.SmilesEnumerator(charset='@C)(=cOn1S2/H[N]\\', pad=120, leftpad=True, isomericSmiles=True, enum=True, canonical=False)[source]¶

Bases: object

SMILES Enumerator, vectorizer and devectorizer #Arguments

charset: string containing the characters for the vectorization
can also be generated via the .fit() method

pad: Length of the vectorization leftpad: Add spaces to the left of the SMILES isomericSmiles: Generate SMILES containing information about stereogenic centers enum: Enumerate the SMILES during transform canonical: use canonical SMILES during transform (overrides enum)

property charset¶

fit(smiles, extra_chars=[], extra_pad=5)[source]¶: Performs extraction of the charset and length of a SMILES datasets and sets self.pad and self.charset #Arguments

smiles: Numpy array or Pandas series containing smiles as strings extra_chars: List of extra chars to add to the charset (e.g. “\” when “/” is present) extra_pad: Extra padding to add before or after the SMILES vectorization

randomize_smiles(smiles)[source]¶: Perform a randomization of a SMILES string must be RDKit sanitizable

reverse_transform(vect)[source]¶: Performs a conversion of a vectorized SMILES to a smiles strings charset must be the same as used for vectorization. #Arguments

vect: Numpy array of vectorized SMILES.

transform(smiles)[source]¶: Perform an enumeration (randomization) and vectorization of a Numpy array of smiles strings #Arguments

smiles: Numpy array or Pandas series containing smiles as strings

class data.smiles_enumerator.SmilesIterator(x, y, smiles_data_generator, batch_size=32, shuffle=False, seed=None, dtype=<class 'numpy.float32'>)[source]¶

Bases: data.smiles_enumerator.Iterator

Iterator yielding data from a SMILES array. # Arguments

x: Numpy array of SMILES input data. y: Numpy array of targets data. smiles_data_generator: Instance of SmilesEnumerator

to use for random SMILES generation.

batch_size: Integer, size of a batch. shuffle: Boolean, whether to shuffle the data between epochs. seed: Random seed for data shuffling. dtype: dtype to use for returned batch. Set to keras.backend.floatx if using Keras

next()[source]¶: For python 2.x. # Returns

The next batch.

utils¶

class data.utils.DummyDataLoader(batch_size)[source]¶: Bases: object

class data.utils.DummyDataset(*args, **kwds)[source]¶: Bases: torch.utils.data.dataset.Dataset

data.utils.augment_smiles(smiles, labels, n_augment=5)[source]¶

data.utils.canonize_smiles(smiles, sanitize=True)[source]¶

Takes list of SMILES strings and returns list of their canonical SMILES.

Args:: smiles (list): list of SMILES strings sanitize (bool): parameter specifying whether to sanitize SMILES or not. For definition of sanitized SMILES check www.rdkit.org/docs/api/rdkit.Chem.rdmolops-module.html#SanitizeMol
Output:: new_smiles (list): list of canonical SMILES and NaNs if SMILES string is invalid or unsanitized (when ‘sanitize = True’)

When ‘sanitize = True’ the function is analogous to: sanitize_smiles(smiles, canonize=True).

data.utils.create_loader(dataset, batch_size, shuffle=True, num_workers=1, pin_memory=False, sampler=None)[source]¶

data.utils.cut_padding(samples, lengths, padding='left')[source]¶

data.utils.get_fp(smiles, n_bits=2048)[source]¶

data.utils.get_tokens(smiles, tokens=None)[source]¶

Returns list of unique tokens, token-2-index dictionary and number of unique tokens from the list of SMILES :param smiles: list of SMILES strings to tokenize. :type smiles: list :param tokens: string of tokens or None. :type tokens: string :param If none will be extracted from dataset.:

Returns: list of unique tokens/SMILES alphabet. token2idx (dict): dictionary mapping token to its index. num_tokens (int): number of unique tokens.
Return type: tokens (list)

data.utils.mol2image(x, n=2048)[source]¶

data.utils.pad_sequences(seqs, max_length=None, pad_symbol=' ')[source]¶

data.utils.process_graphs(smiles, node_attributes, get_atomic_attributes, edge_attributes, get_bond_attributes=None, kekulize=True)[source]¶

data.utils.process_smiles(smiles, sanitized=False, target=None, augment=False, pad=True, tokenize=True, tokens=None, flip=False, allowed_tokens=None)[source]¶

data.utils.read_smi_file(filename, unique=True)[source]¶

Reads SMILES from file. File must contain one SMILES string per line with

token in the end of the line.

Args:: filename (str): path to the file unique (bool): return only unique SMILES
Returns:: smiles (list): list of SMILES strings from specified file. success (bool): defines whether operation was successfully completed or not.

If ‘unique=True’ this list contains only unique copies.

data.utils.read_smiles_property_file(path, cols_to_read, delimiter=',', keep_header=False)[source]¶

data.utils.sanitize_smiles(smiles, canonize=True, min_atoms=- 1, max_atoms=- 1, return_num_atoms=False, allowed_tokens=None, allow_charges=False, return_max_len=False, logging='warn')[source]¶

Takes list of SMILES strings and returns list of their sanitized versions. For definition of sanitized SMILES check http://www.rdkit.org/docs/api/rdkit.Chem.rdmolops-module.html#SanitizeMol

Args:
smiles (list): list of SMILES strings canonize (bool): parameter specifying whether to return canonical SMILES or not. min_atoms (int): minimum allowed number of atoms max_atoms (int): maxumum allowed number of atoms return_num_atoms (bool): return additional array of atom numbers allowed_tokens (iterable, optional): allowed tokens set allow_charges (bool): allow nonzero charges of atoms logging (“warn”, “info”, “none”): logging level

Output:
new_smiles (list): list of SMILES and NaNs if SMILES string is invalid or unsanitized. If ‘canonize = True’, return list of canonical SMILES.

When ‘canonize = True’ the function is analogous to: canonize_smiles(smiles, sanitize=True).

data.utils.save_smi_to_file(filename, smiles, unique=True)[source]¶

Takes path to file and list of SMILES strings and writes SMILES to the specified file.

Args:
filename (str): path to the file smiles (list): list of SMILES strings unique (bool): parameter specifying whether to write only unique copies or not.

Output:
success (bool): defines whether operation was successfully completed or not.

data.utils.save_smiles_property_file(path, smiles, labels, delimiter=',')[source]¶

data.utils.seq2tensor(seqs, tokens, flip=True)[source]¶

data.utils.time_since(since)[source]¶