Data

The Data class creates a data object that handles reading inputs and outputs.

For example, mydata = Data() creates a Data object mydata. After read inputs and outputs data using read_inputdata() and read_outputdata(), the data are stored as numpy array in two dictionaries, which can be accessed by mydata.input_dict['keys'] and mydata.output_dict['key'], where the 'key' refers to input and output variables.

Input data dictionary 'key':

  • 'fingerprints': 3D array [num_images, num_atoms, num_fingerprints] of the fingerprints.

  • 'atom_type': 2D array [num_images, num_atoms] of the atom element type, starting from 1 and sequentially increases.

  • 'volume': 2D array [num_images,1] of cell volume, the last dimension is used for matrix multiplication.

  • 'dGdr': 4D array [num_images, num_der_paris, num_fingerprints, 3] of the derivative of fingerprints w.r.t atom coordiante, the last dimension is three coordinates.

  • 'neighbor_atom_coord': 4D array [num_images, num_der_pairs, 3, 1] to store the neighbor atom coordinates, the last dimension is added for the convenience of matrix multiplication in tensorflow.

  • 'center_atom_id': 2D array [num_images, num_der_pairs] to store the center atom ID.

  • 'neighbor_atom_id': 2D array [num_images, num_der_pairs] to store the neighbor atom ID, note that the neighbors could be ghost atoms.

Output data dictionary 'keys':

  • 'pe': 1D array [num_images] of potential energy.

  • 'force': 3D array [num_images, num_atoms, 3] of atomic force.

  • 'stress': 2D array [num_images, 9] of stress tensor with 9 components (6 independent).

Class

class atomdnn.data.Data(descriptors_path=None, fp_filename=None, der_filename=None, xyzfile_path=None, xyzfile_name=None, format='extxyz', image_num=None, skip=0, verbose=False, silent=False, read_der=True, **kwargs)[source]

Create Data object, with an option to read inputs and outputs. Parameters are explained in read_inputdata().

read_inputdata(descriptors_path, fp_filename, der_filename=None, image_num=None, skip=0, append=False, verbose=False, silent=False, read_der=True)[source]

Read input data from read_fingerprints_from_lmpdump() and read_der_from_lmpdump().

Parameters:
  • descriptor_path – directory to descriptor files

  • fp_filename – file names for descriptors, use ‘*’ for multiple files order numerically

  • der_filename – file names for derivatives, use ‘*’ for multiple files order numerically

  • image_num – None if read all files given by the fp_filename

  • skip (int) – skip some images

  • append (bool) – True if append inputs to already existing data object

  • verbose (bool) – True to show all reading file names

  • read_der (bool) – True if read derivatives

read_fingerprints_from_lmpdump(descriptors_path, fp_filename, image_num=None, skip=0, append=False, verbose=False, silent=False)[source]

Read descriptors(fingerprints), atom_type and volume from the descriptor files created with LAMMPS, and save them into data object.

read_der_from_lmpdump(descriptor_path, der_filename, image_num=None, skip=0, append=False, verbose=False, silent=False)[source]

Read derivatives of fingerprints w.r.t. coordinates (dGdr), neibhor_atom_coord, center_atom_id, neighbor_atom_id.

read_outputdata(xyzfile_path, xyzfile_name, format='extxyz', image_num=None, skip=0, append=False, verbose=False, silent=False, read_force=True, read_stress=True, **kwargs)[source]

Read outputs(energy, force and stress) from extxyz files

Parameters:
  • xyzfile_path – directory contains a serials of input atomic structures

  • xyzfile_name – atomic structure filename, wildcard * is used for files numerically ordered

  • format – ‘lammp-data’,’extxyz’,’vasp’ etc. See complete list on https://wiki.fysik.dtu.dk/ase/ase/io/io.html#ase.io.read. ‘extxyz’ is recommanded.

  • read_force (bool) – make sure extxyz files have force data if it’s True

  • read_stress (bool) – make sure extxyz files have stress data if it’s True

  • image_num – number of images that will be used, if it’s None then read all files specified by xyzfile_name

  • append (bool) – append the reading to previous data object

  • verbose (bool) – set to True if want to print out the extxyz file names

  • kwargs – used to pass optional file styles

shuffle()[source]

Shuffle the data.

slice(start=None, end=None)[source]

Slice the data between image start and image end, and return both the input and output dictionaries. Index starts from 1

get_input_dict(start=None, end=None)[source]

Return the input dictionaries from image start to image end. Index starts from 1. If end is not privided, return only one dictionary of image start.

get_output_dict(start=None, end=None)[source]

Return the output dictionaries from image start to image end. Index starts from 1. If end is not privided, return only one dictionary of image start.

convert_data_to_tensor()[source]

Convert the input and ouput data to Tensorflow tensors. This can speed up the data manipulation using Tensorflow functions.

check_data()[source]

Check consistance of input and output data.

append(apdata, read_force=True, read_stress=True)[source]

Append one dataset with a second dataset.

Functions

These functions are used to manipulate Tensorflow dataset

atomdnn.data.split_dataset(dataset, train_pct, val_pct=None, test_pct=None, shuffle=False, data_size=None)[source]

Split the tensorflow dataset into training, validation and test.

Parameters:
  • dataset – tensorflow dataset

  • train_pct – the percentage of data used for training

  • val_pct – the percentage of data used for validation

  • test_pct – the percentage of data used for testing

  • shuffle (bool) – shuffle the dataset

  • data_size (int) – if None, then use all data in the dataset

Returns:

training, validation and test dataset

Return type:

tensorflow dataset

atomdnn.data.get_input_dict(dataset)[source]
Parameters:

dataset – Tensorflow dataset

Returns:

input dictionary, see Data for the structure of the dictionary

Return type:

dictionary

atomdnn.data.get_output_dict(dataset)[source]
Parameters:

dataset – Tensorflow dataset

Returns:

output dictionary, see Data for the structure of the dictionary

Return type:

dictionary

atomdnn.data.slice_dataset(dataset, start, end)[source]

Get a slice of the dataset. :param dataset: input dataset :param start: starting index :param end: ending index

Returns:

tensorflow dataset