Datasets
XANESX
This dataset is compatible with the models MLP, CNN, LSTM,
AE-MLP, AE-CNN, AEGAN-MLP.
It supports both forward train_xyz and reverse train_xanes training modes.
The XANESX dataset is a general-purpose dataset for
representing featurised molecular structures and XANES spectra.
Each sample corresponds to a single molecule and is stored as a
serialized data object with the attributes of input features (x), target features (y),
and the associated energy grid (e). Molecular structures are
converted into fixed-length feature representations
using the descriptor(s) defined in the Descriptors section.
XANES spectra are read from text files and can be optionally transformed during preprocessing.
Input file:
type: xanesxparams(dict, optional):preload(int,true): If true, preload the entire dataset into RAM; otherwise load samples on-the-fly.fourier(float,false): If true, apply Fourier transformation to the XANES spectragaussian(int,false): If true, apply Gaussian transformation to the XANES spectrawidths_eV(float,[0.5, 1.0, 2.0, 4.0]): Widths (eV) of the Gaussian basis functionbasis_stride(int,4): Stride between Gaussian centers on the energy grid
- Example:
dataset: type: xanesx root_path: data/fe/processed_train xyz_path: data/fe/xyz_train xanes_path: data/fe/xanes_train params: fourier: True
Graph
This dataset is compatible with the models GNN
and supports the forward training mode train_xyz only.
The Graph dataset represents molecular structures as graphs. Each molecule is encoded as a graph derived from its atomic geometry, where atom are represented as nodes and interatomic relationships are represented as edges.
Node features include one-hot–encoded atomic numbers and an indicator specifying the absorber atom.
Edge features are generated using a wACSF-style radial Gaussian expansion,
capturing local geometric information over a fixed radial grid defined by r_min, r_max, and the number of points n.
Graph-level attributes are computed using the descriptor(s) defined in the Descriptors section.
The target features XANES spectra (y) and the corresponding energy grid (e) are defined as additional global attributes.
Input file:
type: graphparams(dict, optional):preload(int,true): If true, preload the entire dataset into RAM; otherwise load samples on-the-fly.n(int,16): Number of radial basis functions used to encode each edge.r_min(float,0.0): Minimum interatomic distance for the radial edge encoding.r_max(float,4.0): Maximum interatomic distance for the radial edge encoding.fourier(float,false): If true, apply Fourier transformation to the XANES spectra.gaussian(int,false): If true, apply Gaussian transformation to the XANES spectra.widths_eV(float,[0.5, 1.0, 2.0, 4.0]): Widths (eV) of the Gaussian basis function.basis_stride(int,4): Stride between Gaussian centers on the energy grid.
- Example:
dataset: type: graph root_path: data/graph-set/processed_train xyz_path: data/graph-set/xyz_train xanes_path: data/graph-set/xanes_train params: r_max: 5.0
E3EEmbed
This dataset is compatible with the E3EEmbed models
and supports the forward training mode train_xyz only.
The e3eembed dataset represents molecular structures using absorber-centred atomic environments derived directly from Cartesian geometry. Each structure is defined by its atomic numbers (z), Cartesian coordinates (pos), and a mask indicating valid atoms within the system.
Atoms are treated as nodes, with the absorbing atom explicitly identified (by convention index 0). Node features are not provided as explicit one-hot encodings; instead, atomic numbers are mapped to learnable embeddings within the model.
Geometric relationships between atoms are not precomputed as edge features. Instead, relative positions with respect to the absorber atom are constructed on-the-fly, enabling the model to compute interatomic distances, unit vectors, and neighbourhood information dynamically.
This formulation allows the use of equivariant message passing, where rotationally consistent features are built directly from interatomic vectors and radial basis expansions during the forward pass, rather than relying on fixed descriptors such as wACSF.
Optional graph-level attributes may be included using the descriptor(s) defined in the :ref:Descriptors <descriptors> section. The target XANES spectra (y) and the corresponding energy grid (e) are stored as global attributes associated with each structure.
Input file:
type: graphparams(dict, optional):preload(int,true): If true, preload the entire dataset into RAM; otherwise load samples on-the-fly.fourier(float,false): If true, apply Fourier transformation to the XANES spectra.gaussian(int,false): If true, apply Gaussian transformation to the XANES spectra.widths_eV(float,[0.5, 1.0, 2.0, 4.0]): Widths (eV) of the Gaussian basis function.basis_stride(int,4): Stride between Gaussian centers on the energy grid.
- Example:
dataset: type: graph root_path: data/graph-set/processed_train xyz_path: data/graph-set/xyz_train xanes_path: data/graph-set/xanes_train params:
Multihead
This dataset is compatible with the models Multihead-MLP,
Multihead-CNN.
It supports both forward train_xyz and reverse train_xanes training modes.
The Multihead dataset extends the XANESX dataset to support multi-output (multi-head) learning,
where a single molecular structure is associated with multiple XANES spectra.
The dataset accepts multiple xyz_path and xanes_path entries.
Each entry corresponds to a distinct prediction head.
During preprocessing, the dataset automatically matches files with the same stem across all provided paths.
Each sample corresponds to a single molecule and is stored as a
serialized data object with the attributes of input features (x), target features (y),
the associated energy grid (e), head index (head_idx), and head name (head_name).
Molecular structures are
converted into fixed-length feature representations using
the descriptor(s) defined in the Descriptors section.
XANES spectra are read from text files and can be optionally transformed
(fourier or gaussian transformations) during preprocessing.
The head index identifies the output head associated with the spectrum,
and head name derived from the directory from which the spectrum was loaded.
type: multiheadparams(dict, optional):preload(int,true): If true, preload the entire dataset into RAM; otherwise load samples on-the-fly.n(int,16): Number of radial basis functions used to encode each edge.r_min(float,0.0): Minimum interatomic distance for the radial edge encoding.r_max(float,4.0): Maximum interatomic distance for the radial edge encoding.fourier(float,false): If true, apply Fourier transformation to the XANES spectra.gaussian(int,false): If true, apply Gaussian transformation to the XANES spectra.widths_eV(float,[0.5, 1.0, 2.0, 4.0]): Widths (eV) of the Gaussian basis function.basis_stride(int,4): Stride between Gaussian centers on the energy grid.
- Example:
dataset: type: multihead root_path: data/multihead/processed_train xyz_path: - data/multihead/train/xyz1 - data/multihead/train/xyz2 - data/multihead/train/xyz3 xanes_path: - data/multihead/train/xanes1 - data/multihead/train/xanes2 - data/multihead/train/xanes3 params: preload: False
Transformer
This dataset is compatible with the models Transformer
and supports the forward training mode train_xyz only.
The Transformer dataset can be applied for transformer-based architectures in XANESNET.
Each sample corresponds to a single molecule and is encoded with a set of attributes of
per-atom MACE embeddings (mace), auxiliary descriptors (desc),
atomic positions (pos), atomic weights (weight), and padding masks (mask),
XANES spectra as target features (y) and associated energy grid (e) .
Both the MACE and auxiliary descriptors are configurable via the :ref:Descriptors <descriptors> section. XANES spectra are read from text files and can be optionally transformed during preprocessing.
type: transformerparams(dict, optional):preload(int,true): If true, preload the entire dataset into RAM; otherwise load samples on-the-fly.fourier(float,false): If true, apply Fourier transformation to the XANES spectra.gaussian(int,false): If true, apply Gaussian transformation to the XANES spectra.widths_eV(float,[0.5, 1.0, 2.0, 4.0]): Widths (eV) of the Gaussian basis function.basis_stride(int,4): Stride between Gaussian centers on the energy grid.
- Example:
dataset: type: transformer root_path: data/fe/processed_train xyz_path: data/fe/xyz_train xanes_path: data/fe/xanes_train params: fourier: True
EnvEmbed
This dataset is compatible with the models EnvEmbed.
It supports the forward training mode train_xyz only.
The EnvEmbed dataset is designed for environment embedding-based architectures in XANESNET.
Each sample corresponds to a single molecule and is represented with a set of attributes including
descriptor features (desc) computed from the molecular structure
(configurable via the Descriptors), length of the descriptor features (lengths),
distances of all atoms to the absorber atom (dist),
coefficient of Gaussian basis transformation of the spectra (c*),
and XANES spectra as target features (y).
type: envembedparams(dict, optional):preload(int,true): If true, preload the entire dataset into RAM; otherwise load samples on-the-fly.widths_eV(float,[0.5, 1.0, 2.0, 4.0]): Widths (eV) of the Gaussian basis function.basis_stride(int,4): Stride between Gaussian centers on the energy grid.
- Example:
dataset: type: envembed root_path: data/fe/processed_train xyz_path: data/fe/xyz_train xanes_path: data/fe/xanes_train params: preload: True