When Text Is Not Enough: Structural Limits of Text-Only Transformer-Based Emotion Classification
Synopsis
This study investigates whether limitations observed in text-only transformer-based emotion classification pipelines reflect implementation shortcomings or structural constraints inherent to unimodal modeling. A pipeline was constructed using unscripted dialogue from MasterChef Polska, incorporating automated speech-to-text transcription, neural machine translation, and benchmarking across SVM, Bi-LSTM, and RoBERTa architectures. While the fine-tuned RoBERTa model achieved substantially higher accuracy (0.755), confusion matrix analysis and explainable AI techniques revealed persistent structural asymmetries, including uneven performance across emotion categories, high-arousal anger-joy confusion, and translation-induced distortions. Evaluation against automated labels further exposed a “Ground Truth Paradox,” where models are validating each other rather than a human-verified set of conclusions. Increased architectural capacity improves performance but does not resolve structural limitations of text-only emotion classification.






