Cloud-Enabled MQTT Data Analytics and Attack Classification System
14763/18763 System and Tool Chain for AI Engineers Final Project
This individual final project for my Systems and Tool Chains course focused on detecting and classifying cyberattacks by analyzing large-scale MQTT (Message Queuing Telemetry Transport) network traffic. MQTT is a lightweight messaging protocol widely used in Internet of Things (IoT) applications, and its broad adoption makes it a frequent target of cyberattacks. The objective was to create a cloud-deployable system capable of ingesting and processing MQTT network data to identify and classify network-based attacks.
I used Apache Spark to import a 10-million-row MQTT dataset into a Jupyter environment, then wrote the full dataset to a PostgreSQL database for structured analysis. I queried the database to understand and document feature semantics, which supported the creation of a Spark-based preprocessing pipeline. The pipeline included column pruning, outlier handling, and vectorization of features for model input. I trained and compared multiple machine learning models, including logistic regression and random forest in SparkML and shallow and deep neural networks in PyTorch. To improve performance, I applied k-fold cross-validation and hyperparameter tuning. The entire project was deployed to Google Cloud Platform using Dataproc clusters to enable scalable parallel processing and end-to-end pipeline evaluation.