Cloud-Enabled MQTT Data Analytics and Attack Classification System

14763/18763 System and Tool Chain for AI Engineers Final Project

This individual final project for my Systems and Tool Chains course focused on detecting and classifying cyberattacks by analyzing large-scale MQTT (Message Queuing Telemetry Transport) network traffic. MQTT is a lightweight messaging protocol widely used in Internet of Things (IoT) applications, and its broad adoption makes it a frequent target of cyberattacks. The objective was to create a cloud-deployable system capable of ingesting and processing MQTT network data to identify and classify network-based attacks.

I used Apache Spark to import a 10-million-row MQTT dataset into a Jupyter environment, then wrote the full dataset to a PostgreSQL database for structured analysis. I queried the database to understand and document feature semantics, which supported the creation of a Spark-based preprocessing pipeline. The pipeline included column pruning, outlier handling, and vectorization of features for model input. I trained and compared multiple machine learning models, including logistic regression and random forest in SparkML and shallow and deep neural networks in PyTorch. To improve performance, I applied k-fold cross-validation and hyperparameter tuning. The entire project was deployed to Google Cloud Platform using Dataproc clusters to enable scalable parallel processing and end-to-end pipeline evaluation.

A block diagram of the detection system denial of service via the MQTT protocol

The best-performing model achieved 84% accuracy in classifying various network attacks. The system was capable of processing high-volume MQTT traffic and accurately identifying potential threats in a cloud environment. The pipeline effectively handled ingestion, preprocessing, model inference, and evaluation in a scalable and maintainable way.

This project integrated cloud computing, database design, machine learning, and data engineering to solve a real-world network security problem. It demonstrated how scalable tools like Apache Spark, PostgreSQL, and GCP can be combined to build robust analytics pipelines. Through this work, I developed practical experience in cloud-based ML deployment and real-time attack classification, and gained a deeper understanding of designing secure, scalable systems for IoT networks.