Framework
From raw DFT records to a calibrated two-stage predictor.
The dissertation decomposes bandgap prediction into the parts the data actually asks for: a phase decision for the zero-inflated spike, then a positive-gap regressor for nonmetallic materials.
52.2%
Metallic share
Eg = 0eV ENTRIES
95,920
Nonmetal subset
SENT TO STAGE 2
72% / 8% / 20%
Train / Calibration / Test
PROPORTIONAL PHASE SPLIT
40,098
Holdout test
WITHHELD MATERIALS
Feature path
The feature matrix is small enough to train fast, but still grounded in chemistry.
The notebook begins with Materials Project entries and uses compositional statistics as a portable representation of atomic, compositional, and structural information. The final 87-feature matrix is designed for fast training, repeatable scaling, and leakage-safe validation.
Materials Project API records provide composition, crystal structure, and DFT-PBE bandgap labels.
GPU chunk-wise featurization expands each material into 145 Magpie-style compositional descriptors.
Variance, collinearity, and intra-group filtering compress the matrix to 87 high-signal descriptors.
Target encoding and scaling are fit only on training data to avoid leakage into calibration or test splits.
Pipeline Architecture
Two-stage hybrid pipeline overview.
The diagram illustrates data ingestion from Materials Project records, GPU-resident featurization, a stage-one classifier that routes metals and nonmetals, followed by a stage-two ensemble regressor and a conformal prediction layer producing calibrated uncertainty intervals.

PIPELINE EXECUTION
The Two-Stage Hybrid Architecture.
GPU-resident curation
The corpus is processed on an NVIDIA GeForce RTX 3050 with an approximately 280 MB in-memory footprint, making 200k-entry screening practical on consumer-grade hardware.
Classifier hurdle
A tuned binary classifier separates metals from nonmetals. Lowering the decision threshold to 0.28 prioritizes nonmetal recall, reducing false negatives from 976 to 411.
Nonmetal regressor
Only positive-bandgap entries are passed to an Optuna-tuned XGBoost ensemble trained on log(1 + Eg), isolating the continuous prediction problem from the zero spike.
Bias and uncertainty layer
Bin-wise correction reduces high-energy tail bias, while split conformal prediction converts residuals into calibrated 90% and 95% prediction intervals.
MATHEMATICAL MODELING
Target Transformations & Objective Functions.
Stage 1 target
y maps to 0 for metals, y maps to 1 for nonmetals
The classifier assigns Eg = 0 eV directly when a material is routed as metallic. This prevents the continuous regressor from being trained against the dense zero spike and reduces regression-toward-zero behavior.
Stage 2 target
log(1 + Eg) for positive bandgaps
The positive-gap regressor works on a transformed target to stabilize right-skewed errors, then maps predictions back to electron volts for downstream screening and conformal prediction intervals.