Framework

From raw DFT records to a calibrated two-stage predictor.

The dissertation decomposes bandgap prediction into the parts the data actually asks for: a phase decision for the zero-inflated spike, then a positive-gap regressor for nonmetallic materials.

52.2%

Metallic share

Eg = 0eV ENTRIES

95,920

Nonmetal subset

SENT TO STAGE 2

72% / 8% / 20%

Train / Calibration / Test

PROPORTIONAL PHASE SPLIT

40,098

Holdout test

WITHHELD MATERIALS

Feature path

The feature matrix is small enough to train fast, but still grounded in chemistry.

The notebook begins with Materials Project entries and uses compositional statistics as a portable representation of atomic, compositional, and structural information. The final 87-feature matrix is designed for fast training, repeatable scaling, and leakage-safe validation.

a.

Materials Project API records provide composition, crystal structure, and DFT-PBE bandgap labels.

b.

GPU chunk-wise featurization expands each material into 145 Magpie-style compositional descriptors.

c.

Variance, collinearity, and intra-group filtering compress the matrix to 87 high-signal descriptors.

d.

Target encoding and scaling are fit only on training data to avoid leakage into calibration or test splits.

Pipeline Architecture

Two-stage hybrid pipeline overview.

The diagram illustrates data ingestion from Materials Project records, GPU-resident featurization, a stage-one classifier that routes metals and nonmetals, followed by a stage-two ensemble regressor and a conformal prediction layer producing calibrated uncertainty intervals.

Pipeline architecture diagram

PIPELINE EXECUTION

The Two-Stage Hybrid Architecture.

01RAPIDS cuDF

GPU-resident curation

The corpus is processed on an NVIDIA GeForce RTX 3050 with an approximately 280 MB in-memory footprint, making 200k-entry screening practical on consumer-grade hardware.

02XGBoost gate

Classifier hurdle

A tuned binary classifier separates metals from nonmetals. Lowering the decision threshold to 0.28 prioritizes nonmetal recall, reducing false negatives from 976 to 411.

035-model ensemble

Nonmetal regressor

Only positive-bandgap entries are passed to an Optuna-tuned XGBoost ensemble trained on log(1 + Eg), isolating the continuous prediction problem from the zero spike.

04Conformal PI

Bias and uncertainty layer

Bin-wise correction reduces high-energy tail bias, while split conformal prediction converts residuals into calibrated 90% and 95% prediction intervals.

MATHEMATICAL MODELING

Target Transformations & Objective Functions.

Stage 1 target

y maps to 0 for metals, y maps to 1 for nonmetals

The classifier assigns Eg = 0 eV directly when a material is routed as metallic. This prevents the continuous regressor from being trained against the dense zero spike and reduces regression-toward-zero behavior.

Stage 2 target

log(1 + Eg) for positive bandgaps

The positive-gap regressor works on a transformed target to stabilize right-skewed errors, then maps predictions back to electron volts for downstream screening and conformal prediction intervals.