OpenBench-AR: Reproducible Benchmarks for RF-to-AR Systems”
Overall Assessment: Aimed at a workshop like MLSys Artifact Evaluation or ReproNLP) proposes OpenBench-AR, an open-source suite for standardizing evaluations in RF-to-AR systems (e.g., Wi-Fi/UWB sensing integrated with AR overlays). It addresses a pressing issue in ML/HCI reproducibility by bundling traces, metrics schemas, and auto-generation scripts for figures/tables. The work is timely, given reproducibility crises in ML [1,2], and fills a gap in RF-AR research, which often lacks traceable workflows. Strengths include practicality (one-command repro) and alignment with artifact tracks.
Novelty and Significance: RF-AR systems (e.g., for triage/threat detection) are emerging, but fragmented evaluations hinder progress—e.g., custom datasets without provenance [1]. OpenBench-AR’s end-to-end repro (traces + metrics + LaTeX gen) is a useful contribution, enabling “one-command” figure reproduction across hardware. It captures “dynamic interactions” [2] via versioned traces and manifests, promoting traceability. Significance is high for niche communities (e.g., wearable AR in obscured envs), potentially standardizing benchmarks like latency/power in RF pipelines. However, it’s not groundbreaking; similar suites exist (e.g., Open RL Benchmark, NCBench for genomics). No exact matches found in searches for this title or related RF-AR benchmarks, suggesting originality, but cite more (e.g., MLPerf for ML repro). As part of a potential series (tying to author’s prior works on Triage-AR, RF Biomarker Sensing, Network-Degraded Ops, Glass UX—sharing themes like RF fusion, AR overlays), it could unify evaluations.
Technical Quality: The design is robust: traces include raw RF data (CSV at 200Hz), labels, system logs (CPU/power), and env params—comprehensive for repro. JSON schema for metrics (latency breakdown, FPS, power, user perf) ensures structure, with manifests linking to commits/hardware for provenance. Scripts for PGFPlots/CSV output are innovative for LaTeX workflows, reducing manual errors (e.g., make figures). Client simulator for traces is practical for controlled repro. Limitations: Traces are simulator-based—lacks real hardware diversity (e.g., phone vs. glasses); metrics focus on system-level, not end-user (e.g., SA scores). No details on dataset size/diversity (e.g., scenarios like occlusion/motion). Artifact is “ready for evaluation tracks”—commendable, but provide GitHub link or size stats. Ties well to RF-AR pipelines (e.g., from prior works), but formalize trace format (e.g., schema examples).
Evaluation: The demonstration (reproducing latency/FPS/power on prior pipelines) is promising but incomplete in provided text—page 2 likely details results. It shows repro across hardware, a key claim. Metrics are appropriate (e.g., energy per alert, miss rate). Weaknesses: No quantitative results here (e.g., repro time savings, error rates); assume page 2 has figures/tables. No user study or comparison to ad-hoc repro. For artifact eval, ensure dockerized setup for easy runs.
Clarity and Presentation: Concise and clear, with logical sections: problem, design (traces, metrics, gen). Examples (script commands) aid usability. README guidance is a plus. Minor issues: References sparse (only 2)—expand to 5+ (e.g., on RF datasets). Typos absent. As anonymous, clean for blind review.
Ethical Considerations: Promotes open science, reducing barriers—positive. No risks (e.g., data privacy in traces).
Relation to Screenshots/Series: This meta-paper complements the series: standardizes evals for Triage-AR (AR triage), RF Biomarker Sensing (phone RF fusion for K9 replacement), Network-Degraded Ops (DTN for AR resilience), and Glass UX (multi-target AR UI). If by same authors (e.g., Benjamin J. Gilbert in non-anon versions), it unifies their pipeline—note as related work.
Recommendations for Revision:
- Major: Include sample results/figures from page 2; add real vs. sim trace comparison; expand refs.
- Minor: Detail dataset stats (e.g., hours of traces); provide artifact URL; formalize JSON schema snippet.
- Publication Fit: Ideal for SysML or CHI workshops on repro. Searches (Sep 20, 2025) show no publication—submit for artifact badge.
Solid tool for community; with full content, stronger impact.