Gesture Recognition Based Screen Control

24-782 Machine Learning and Artificial Intelligence for Engineers

Hand gesture recognition (HGR) is an intuitive and increasingly relevant interface for human-computer interaction, with applications ranging from AR/VR to assistive technology. However, many high-performing systems rely on depth sensors or optical flow, which increase latency, cost, and complexity. This project aimed to develop a lightweight, real-time gesture recognition pipeline that operates solely on RGB video input, enabling accurate screen control with minimal computational overhead. By combining MediaPipe Hands for landmark extraction with a Slow–Fast temporal convolutional network (TCN), we built a system designed for everyday environments and commodity hardware.

We used the IPN Hand dataset and a custom seven-class screen control dataset as benchmarks. MediaPipe Hands provided frame-wise keypoints, which were fed into a dual-path Slow–Fast CNN for temporal gesture classification. The slow path captured long-term dynamics, while the fast path responded to short-term motion. We trained the model first on the IPN dataset and then fine-tuned it on clean and messy versions of our custom dataset to adapt to real-world screen-control gestures. The model was implemented using PyTorch, with training governed by learning rate scheduling and iterative fine-tuning. Performance was compared against the IPN benchmark model using RGB-only and RGB+flow inputs.

Our model architecture.

Our RGB-only model achieved 88.84% accuracy on the IPN dataset with 18.85 ms latency—outperforming the RGB-only benchmark (83.59%) while matching its speed and exceeding the RGB+flow benchmark in both accuracy and efficiency. On our custom messy dataset, the model reached 98.85% accuracy, demonstrating strong generalization to noisy, real-world inputs. The dual-path architecture and use of MediaPipe landmarks proved key in maintaining performance under occlusion, variation in hand position, and camera angle.

This project demonstrated that high-accuracy real-time gesture recognition is achievable using RGB-only input and lightweight TCNs. Our system rivals state-of-the-art multimodal approaches while remaining deployable on commodity hardware. The modular design, use of few-shot learning, and robustness to noise make it practical for real-world screen control applications. Limitations include handedness imbalance, dependency on stable landmark detection, and evaluation on only laptop-class GPUs. Future work will address these by expanding the dataset, incorporating domain adaptation, and optimizing for mobile deployment using pruning and quantization.