Ensemble methods for RF signal classification combine
predictions from multiple neural networks to achieve superior
accuracy over individual models. However, modern neural networks often exhibit poor calibration—their confidence scores
do not reflect actual prediction accuracy [1]. This miscalibration becomes particularly problematic in weighted ensemble
voting, where model probabilities directly influence the final
decision.
We address confidence calibration in RF ensemble classifiers through temperature scaling applied to individual model
logits before weighted aggregation. Our contributions include:
(1) systematic measurement of calibration quality using ECE
and MCE metrics, (2) analysis of how miscalibration affects utility under confidence-based abstention, (3) temperature
scaling optimization for ensemble probability paths, and (4)
integration hooks for production RF classification systems.