In-Class Kaggle_Predicting crystal structure from X-ray diffraction

Published:

  • Who: the solo- and in-class Kaggle competition project
  • When: Sept. 2020 - Mar. 2021
  • Where: NANO 281 Fall 2020 Lab3
  • What: Prediction the crystal structures (14 bravais lattice) from table-type XRD data
  • How: TensorFlow 2.x (Conv-1D). CV-Stacking. Ensemble. Scikit-learn. LightGradientBoost. CatBoost. XGBoost. AdaBoost. RandomForest. Data transformation (n-root transformation). 3-step fine-tuning(Adam_custom, Adamax, Adadelta) of self-building models. Learning rate control(Reduced learing rate on the plateau).
  • Learn: Handling structured-type dataset (table data). Building Conv-1D structure from the scratch. Statistically data analysis. Data augmentation method for XRD dataset.

  • Through several testing with hyperparameter optimized ML models including Boosting (LightGBM, CatBoost, XGBoost, and others) Bagging(RandomForest, and ExtraTrees) and others, the best accuracy was achieved from 1D-CNN structure. Based on the 1D-CNN models with 3-step fine tuning and CV-stacking(n-folds: 40), I reached at the 2nd-place in the public LB, and finally ranked 1st-place in the private LB! Yeah! nano281
  • The result was built on the ‘no augmented dataset’ and ‘no ensemble with other ML models.’ I believe there are spaces to enhance the accuracy if we can add more dataset (augmentation). LightGBM was very powerful model with this structured dataset, but my self-built pseudo VGG16 model showed the better performance.
  • The repository is opened to the public, but not including the entire materials and information. This is because the project is still going on, and its proprietary is under Prof. Ong, even though I am one of the contributor to this repository and project. Please contact me if you want to see in detail.

  • After the Kaggle competition, following techniques were added, merged, and enhanced the original codes. All makes the significant performance enhancement to show the 92.5% classification accuracy.
    • (1) Data augmentation - from class-mate
    • (2) ICSD dataset - dataset quality improved (from class-mate)
    • (3) Deeper model built with shorter FC layers - from pseudo VGG16 to pseudo VGG19
    • (4) Global Average Pooling layers for top layers - replaced FC layers
    • (5) Data transformation (n-root transformation)
  • Featured slides xrd_1 xrd_2 xrd_3 xrd_4 xrd_5 xrd_6 xrd_7 xrd_8

  • Models’ backbone structures with FC-layers p16 p19

  • Models’ backbone structures with GAP layer p16_GAP p19_GAP