Comparative Evaluation of Attention Mechanisms Across CNN Architectures and Image Classification Datasets
Convolutional Neural Networks (CNNs) have established themselves as fundamental tools in computer vision, demonstrating strong performance across tasks including image classification, object detection, and medical image analysis. Their key strength lies in the ability to automatically extract hierarchical feature representations from visual inputs while preserving spatial structure. Incorporating attention mechanisms into CNN architectures has led to notable performance gains by enabling networks to focus on the most informative features, inspired by biological visual attention. Among recent attention approaches, Squeeze-and-Excitation (SE), Convolutional Block Attention Module (CBAM), Coordinate Attention (CA), and Efficient Multi-scale Attention (EMA) have shown considerable promise across a range of network designs. This study conducts a systematic comparative evaluation of these four attention mechanisms integrated into four architecturally distinct CNNs (ResNet-18, ResNet-50, EfficientNet-B0, and GoogLeNet), tested on datasets of increasing complexity: MNIST (handwritten digits), CIFAR-10 (natural objects), CIFAR-100 (fine-grained categories), and AppleLeaf9 (plant disease classification). All models were trained under uniform CUDA-accelerated settings and assessed using accuracy or F1-score, chosen according to dataset characteristics. Findings indicate that the benefit of attention mechanisms is closely tied to dataset complexity, yielding substantial improvements on challenging multi-class datasets such as CIFAR-100 while offering limited gains on simpler benchmarks ike MNIST. Performance also varied across architectures, with deeper networks and different design paradigms responding distinctly to each attention module. These results offer practical guidance for selecting suitable attention-architecture combinations in both research and applied settings.