chainer_chemistry.datasets.molnet.get_molnet_dataset

chainer_chemistry.datasets.molnet.get_molnet_dataset(dataset_name, preprocessor=None, labels=None, split=None, frac_train=0.8, frac_valid=0.1, frac_test=0.1, seed=777, return_smiles=False, return_pdb_id=False, target_index=None, task_index=0, **kwargs)[source]

Downloads, caches and preprocess MoleculeNet dataset.

Parameters:
  • dataset_name (str) – MoleculeNet dataset name. If you want to know the detail of MoleculeNet, please refer to official site If you would like to know what dataset_name is available for chainer_chemistry, please refer to molnet_config.py.
  • preprocessor (BasePreprocessor) – Preprocessor. It should be chosen based on the network to be trained. If it is None, default AtomicNumberPreprocessor is used.
  • labels (str or list) – List of target labels.
  • split (str or BaseSplitter or None) – How to split dataset into train, validation and test. If None, this functions use the splitter that is recommended by MoleculeNet. Additionally You can use an instance of BaseSplitter or choose it from ‘random’, ‘stratified’ and ‘scaffold’.
  • return_smiles (bool) – If set to True, smiles array is also returned.
  • return_pdb_id (bool) – If set to True, PDB ID array is also returned. This argument is only used when you select ‘pdbbind_smiles’.
  • target_index (list or None) – target index list to partially extract dataset. If None (default), all examples are parsed.
  • task_index (int) – Target task index in dataset for stratification. (Stratified Splitter only)
Returns (dict):
Dictionary that contains dataset that is already split into train, valid and test dataset and 1-d numpy array with dtype=object(string) which is a vector of smiles for each example or None.