Structure Analyzer/Featurizer (SAF)

PR Welcome GitHub issues PyPI Python Version
Software version 0.0.1
Last updated June 23, 2025.

Structure Analyzer/Featurizer (SAF) is a Python package to generate geometric features of interatomic distances, atomic environment information, and coordination numbers from a folder containing CIF (Crystallographic Information File) files.

Citation

If you use SAF in your scientific publication, please cite the following:

as well as the cifkit package, which is the engine of SAF for coordination environment analysis:

Publications and scientific utility

Structure features include interatomic distances, information on atomic environments, and coordination numbers:

  • 94 binary structural features

  • 134 ternary structural features

  • 182 quaternary structural features

SAF was originally developed to determine the coordination number and geometry for each crystallographic site in complex structures [1]. Later, we included interactive functionality for experimentalists and data scientists to generate structural features. These features have been used as input data for ML models to predict crystal structures and their properties [2].

In the above Digital Discovery paper, we describe the performance of SAF in combination with CAF for generating compositional and structural numerical features for ML applications in crystal classification of binary compounds. The results are shown in Figures 1 and 2 below, we compare the performance of our developments (SAF and CAF) with existing feature generation methods such as JARVIS, MAGPIE, mat2vec, and OLED.

PLS-DA latent value plot using the first two latent value dimensions: (a) JARVIS, (b) MAGPIE, (c) mat2vec, (d) OLED (all sets of features were generated with CBFV), and our developments – (e) CAF and (f) SAF.

Note

Figure 1: PLS-DA latent value plot using the first two latent value dimensions: (a) JARVIS, (b) MAGPIE, (c) mat2vec, (d) OLED (all sets of features were generated with CBFV), and our developments – (e) CAF and (f) SAF.

PLS-DA latent value plot using the first two latent value dimensions: (a) JARVIS, (b) MAGPIE, (c) mat2vec, (d) OLED (all sets of features were generated with CBFV), and our developments – (e) CAF and (f) SAF.

Note

Figure 2: SAF + CAF PLS-DA plot.

See also

What’s the differecne between SAF and CAF? SAF generates structural features based on crystal structures (CIF files), while CAF generates compositional features based on chemical formulas, whileYou can learn more about SAF in https://bobleesj.github.io/composition-analyzer-featurizer/.

Publications using SAF

Here is a list of publications using SAF for materials analysis and data-driven materials synthesis:

Getting started

We have a command-line Python application. Please visit the Getting started page to learn how to generate features from a folder containing .cif files.

Scope

The current version supports the processing of binary, ternary, and quaternary .cif files containing the following elements:

_images/SAF-supported-elements-table.png

Note

The Pauling CN 12 radii values for some gases (N, O, F, Cl, Br, and I) as well as Tc and Sm were interpolated using Gaussian Process Regression. The CIF radii for the aforementioned gases were compiled as averages of low-temperature structures from Persson’s CIF database.

How to ask for help

  • Do you have any feature requests? Please feel free to open an issue on GitHub using the Bug Report or Feature Request template.

  • Do you have any questions about running the code? Please feel free to reach out to Sangjoon Bob Lee at bobleesj@gmail.com.

  • Do you want to learn how to publish scientific software? SAF is developed and maintained using the Level 5 package standards provided in scikit-package.

How you can contribute to SAF

  • Did you find SAF helpful? You can show support by starring the GitHub repository and recommending it to colleagues.

  • Did you find any bugs? Please feel free to report them by creating a new issue so that we can fix them as soon as possible.

See also

Do you want to learn how to use GitHub and develop Python packages to reuse your code? Please feel free to reach out to Sangjoon Bob Lee (bobleesj@gmail.edu). There are resources you can use to get started, such as scikit-package

scikit-package logo

Contributors

  • Anton Oliynyk - CUNY Hunter College

  • Arnab Dutta - IIT Kharagpur

  • Nikhil Kumar Barua - University of Waterloo

  • Nishant Yadav - IIT Kharagpur

  • Sangjoon Bob Lee - Columbia University

  • Siddha Sankalpa Sethi - IIT Kharagpur

Acknowledgements

scikit-package is used to accelerate maintaining and developing this Python package. cifkit is used to determine the coordination number and environment of each crystallographic site from each .cif file.