Performance Metrics.** Key Findings includes a bullet point stating, “Stacked voting showed the best calibration (lowest ECE) at $K=3$ .” Accuracy vs. Models shows a simplified line chart illustrating that all three methods (majority, weighted, stacked) reach 1.0 accuracy at $K=3$ models, with stacked showing higher intermediate accuracy at $K=2$1. Performance Metrics is a table showing: Metric: TTFB (p50) at $K=4$, Majority: 3.2 ms, Weighted: 3.2 ms, Stacked: 3.4 ms222. Another row shows: Metric: ECE at $K=3$, Majority: 0.654, Weighted: 0.654, Stacked: 0.333333. A final note states, “Weighted voting typically dominates majority when confidences are calibrated; stacked can surpass both given diverse base-model errors and sufficient meta-data4.”]
🖼️ RF Modulation Ensembles: Voting Strategy Comparison
This image summarizes the core results from the paper “Majority vs Weighted vs Stacked Voting in RF Modulation Ensembles”5555.
1. Key Findings
- Accuracy: All three methods (majority, weighted, and stacked) achieved 1.000 accuracy at $K=3$ models6.
- Calibration: Stacked voting demonstrated the best calibration, with the lowest Expected Calibration Error (ECE = 0.333) at $K=3$, compared to $0.654$ for majority and weighted777.
- General Performance: Weighted voting is generally expected to outperform majority when confidences are calibrated, while stacked can exceed both if there are diverse base-model errors and enough meta-data8.
2. Accuracy vs. Model Count ($K$)
| # Models (K) | Majority | Weighted | Stacked |
| 1 | $\approx 0.0$ | $\approx 0.0$ | $\approx 0.0$ |
| 2 | $\approx 0.0$ | $\approx 0.0$ | $\approx 0.35$ |
| 3 | 1.000 | 1.000 | 1.000 |
| 4 | 1.000 | 1.000 | 1.000 |
Observation: Stacked voting showed a noticeable accuracy advantage at $K=2$ models before all three methods converged at $K=3$9.
3. Performance Metrics at Max Model Count
| Metric | Majority | Weighted | Stacked |
| Time-to-First-Byte (TTFB, p50) at $K=4$ | 3.2 ms | 3.2 ms | 3.4 ms |
| Expected Calibration Error (ECE) at $K=3$ | 0.654 | 0.654 | 0.333 |
| Macro-F1 at $K=3$ | 0.400 | 0.400 | 0.400 |
The data suggests that stacked voting is slightly slower in terms of median TTFB at $K=4$ models but provides significantly better calibration101010101010101010.
Would you like to know more about the Stacked meta-learner used in the study or the different input types to the classifier?