data¶
smiles_data_layer¶
-
class
data.smiles_data_layer.
SmilesDataset
(*args, **kwds)[source]¶ Bases:
torch.utils.data.dataset.Dataset
Creates dataset for SMILES-property data. :param filename: string with full path to dataset file. Dataset file
must be csv file.
- Parameters
cols_to_read (list) – list specifying columns to read from dataset file. Could be of various length, cols_to_read[0] will be used as index as index for column with SMILES, cols_to_read[1:] will be used as indices for labels values.
delimiter (str) – columns delimiter in filename. default is ,.
tokens (list) – list of unique tokens from SMILES. If not specified, will be extracted from provided dataset.
pad (bool) – argument specifying whether to pad SMILES. If true SMILES will be padded from right and the flipped. default is True.
augment (bool) – argument specifying whether to augment SMILES.
graph_data_layer¶
smiles_protein_data_layer¶
vanilla_data_layer¶
smiles_enumerator¶
-
class
data.smiles_enumerator.
Iterator
(n, batch_size, shuffle, seed)[source]¶ Bases:
object
Abstract base class for data iterators. # Arguments
n: Integer, total number of samples in the dataset to loop over. batch_size: Integer, size of a batch. shuffle: Boolean, whether to shuffle the data between epochs. seed: Random seeding for data shuffling.
-
class
data.smiles_enumerator.
SmilesEnumerator
(charset='@C)(=cOn1S2/H[N]\\', pad=120, leftpad=True, isomericSmiles=True, enum=True, canonical=False)[source]¶ Bases:
object
SMILES Enumerator, vectorizer and devectorizer #Arguments
- charset: string containing the characters for the vectorization
can also be generated via the .fit() method
pad: Length of the vectorization leftpad: Add spaces to the left of the SMILES isomericSmiles: Generate SMILES containing information about stereogenic centers enum: Enumerate the SMILES during transform canonical: use canonical SMILES during transform (overrides enum)
-
property
charset
¶
-
fit
(smiles, extra_chars=[], extra_pad=5)[source]¶ Performs extraction of the charset and length of a SMILES datasets and sets self.pad and self.charset #Arguments
smiles: Numpy array or Pandas series containing smiles as strings extra_chars: List of extra chars to add to the charset (e.g. “\” when “/” is present) extra_pad: Extra padding to add before or after the SMILES vectorization
-
randomize_smiles
(smiles)[source]¶ Perform a randomization of a SMILES string must be RDKit sanitizable
-
class
data.smiles_enumerator.
SmilesIterator
(x, y, smiles_data_generator, batch_size=32, shuffle=False, seed=None, dtype=<class 'numpy.float32'>)[source]¶ Bases:
data.smiles_enumerator.Iterator
Iterator yielding data from a SMILES array. # Arguments
x: Numpy array of SMILES input data. y: Numpy array of targets data. smiles_data_generator: Instance of SmilesEnumerator
to use for random SMILES generation.
batch_size: Integer, size of a batch. shuffle: Boolean, whether to shuffle the data between epochs. seed: Random seed for data shuffling. dtype: dtype to use for returned batch. Set to keras.backend.floatx if using Keras
utils¶
-
data.utils.
canonize_smiles
(smiles, sanitize=True)[source]¶ - Takes list of SMILES strings and returns list of their canonical SMILES.
- Args:
smiles (list): list of SMILES strings sanitize (bool): parameter specifying whether to sanitize SMILES or not. For definition of sanitized SMILES check www.rdkit.org/docs/api/rdkit.Chem.rdmolops-module.html#SanitizeMol
- Output:
new_smiles (list): list of canonical SMILES and NaNs if SMILES string is invalid or unsanitized (when ‘sanitize = True’)
When ‘sanitize = True’ the function is analogous to: sanitize_smiles(smiles, canonize=True).
-
data.utils.
create_loader
(dataset, batch_size, shuffle=True, num_workers=1, pin_memory=False, sampler=None)[source]¶
-
data.utils.
get_tokens
(smiles, tokens=None)[source]¶ Returns list of unique tokens, token-2-index dictionary and number of unique tokens from the list of SMILES :param smiles: list of SMILES strings to tokenize. :type smiles: list :param tokens: string of tokens or None. :type tokens: string :param If none will be extracted from dataset.:
- Returns
list of unique tokens/SMILES alphabet. token2idx (dict): dictionary mapping token to its index. num_tokens (int): number of unique tokens.
- Return type
tokens (list)
-
data.utils.
process_graphs
(smiles, node_attributes, get_atomic_attributes, edge_attributes, get_bond_attributes=None, kekulize=True)[source]¶
-
data.utils.
process_smiles
(smiles, sanitized=False, target=None, augment=False, pad=True, tokenize=True, tokens=None, flip=False, allowed_tokens=None)[source]¶
-
data.utils.
read_smi_file
(filename, unique=True)[source]¶ Reads SMILES from file. File must contain one SMILES string per line with
- token in the end of the line.
- Args:
filename (str): path to the file unique (bool): return only unique SMILES
- Returns:
smiles (list): list of SMILES strings from specified file. success (bool): defines whether operation was successfully completed or not.
If ‘unique=True’ this list contains only unique copies.
-
data.utils.
sanitize_smiles
(smiles, canonize=True, min_atoms=- 1, max_atoms=- 1, return_num_atoms=False, allowed_tokens=None, allow_charges=False, return_max_len=False, logging='warn')[source]¶ Takes list of SMILES strings and returns list of their sanitized versions. For definition of sanitized SMILES check http://www.rdkit.org/docs/api/rdkit.Chem.rdmolops-module.html#SanitizeMol
- Args:
smiles (list): list of SMILES strings canonize (bool): parameter specifying whether to return canonical SMILES or not. min_atoms (int): minimum allowed number of atoms max_atoms (int): maxumum allowed number of atoms return_num_atoms (bool): return additional array of atom numbers allowed_tokens (iterable, optional): allowed tokens set allow_charges (bool): allow nonzero charges of atoms logging (“warn”, “info”, “none”): logging level
- Output:
new_smiles (list): list of SMILES and NaNs if SMILES string is invalid or unsanitized. If ‘canonize = True’, return list of canonical SMILES.
When ‘canonize = True’ the function is analogous to: canonize_smiles(smiles, sanitize=True).
-
data.utils.
save_smi_to_file
(filename, smiles, unique=True)[source]¶ Takes path to file and list of SMILES strings and writes SMILES to the specified file.
- Args:
filename (str): path to the file smiles (list): list of SMILES strings unique (bool): parameter specifying whether to write only unique copies or not.
- Output:
success (bool): defines whether operation was successfully completed or not.