Distributed machine learning pipeline on the FRACTAL (French Clouds from Targeted Landscapes) LiDAR dataset containing 9.3 billion 3D points. A Random Forest classifier was trained for 7-class semantic segmentation using Spark-based preprocessing and feature engineering.
Key feature engineering steps included vertical coordinate normalization, NDVI computation from spectral bands, and intensity metrics. Scalability was evaluated across different executor and memory configurations on Spark clusters with S3 integration.
Apache Spark, PySpark, Random Forest, Docker, AWS S3, LiDAR point cloud data