Flash Attention-Inspired Queuing for Ultra-Low Latency Communication Networks
We present a queuing subsystem that adapts FlashAttention ideas to message middleware: (i) FlashQueue, a priority
queue with optional async event loop; and (ii) MemoryMappedFlashQueue, which adds a “hot” SRAM-like buffer and a cold
HBM-like backing priority queue. We report latency, cachehit ratio, and throughput under synthetic workloads, showing
predictable wins from hot-buffer admission control and async
event loops
Exploring Frequency-Inspired Optimization in Transformer for …
Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range …
arxiv.org
Optimizing Low-Latency Applications with Swift Packet Queuing – arXiv
Inspired by this, we present SwiftQueue, a new L4S queue-selection system driven by a custom Transformer for per-packet latency prediction.
arxiv.org
[PDF] Transformer-Based Wireless Traffic Prediction and Network … – arXiv
Abstract—This paper introduces an innovative method for predicting wireless network traffic in concise temporal intervals.
arxiv.org
Reducing Vision Transformer Latency on Edge Devices via GPU …
This paper investigates how to efficiently deploy transformer-based neural networks on edge devices. Recent methods reduce the latency of …
arxiv.org
Communication-Efficient Multi-Device Inference Acceleration … – arXiv
We propose Astra, a communication-efficient framework that accelerates Transformer inference through a novel integration of sequence parallelism …
arxiv.org
Decision Transformers for Wireless Communications: A New … – arXiv
In this paper, we adopt an alternative AI technology, namely, Decision Transformer (DT), and propose a DT-based adaptive decision architecture for wireless …
arxiv.org
JPPO++: Joint Power and Denoising-inspired Prompt Optimization …
We propose Joint Prompt and Power Optimization (JPPO), a framework that jointly optimizes prompt compression and wireless transmission power for mobile LLM …
arxiv.org
Quantized Spike-driven Transformer – arXiv
Optimized potential initialization for low-latency spiking neural networks. Proceedings of the AAAI Conference on Artificial Intelligence …
arxiv.org
[PDF] Meta-Learning Inspired Transformer Selection for Green Semantic …
This evolution promises significant benefits, including reduced latency, lower bandwidth usage, and higher throughput compared to traditional …
arxiv.org
Vision Transformers on the Edge: A Comprehensive Survey … – arXiv
We systematically categorize and analyze the latest advancements in pruning, quantization, knowledge distillation, and hardware-aware optimizations. Furthermore …
arxiv.org
Is Flash Attention Stable? – arXiv
Flash Attention is a widely-adopted technique used to speed up the attention mechanism, often considered a system bottleneck in transformer …
arxiv.org
Introduction to Flash Attention: A Breakthrough in Efficient … – Medium
Flash Attention marks a significant advancement in attention mechanisms, addressing efficiency concerns and enabling faster and more memory-efficient training …
medium.com
Flash-Attention-Enhanced Multi-Agent Deep Deterministic Policy …
To improve performance in a MEC scenario, this paper proposes a Flash-Attention-enhanced MADDPG algorithm (FA-MADDPG) for decision making, and its time …
mdpi.com
An end-to-end attention-based approach for learning on graphs
GraphGPS natively supports Flash attention, while Graphormer requires specific modifications to the attention matrix that are not currently …
nature.com
[PDF] Is Flash Attention Stable? – arXiv
We find that Flash Attention sees roughly an order of magnitude more numeric deviation as compared to. Baseline Attention at BF16 when measured …
arxiv.org
Jagged Flash Attention Optimization | Shaped Blog
By combining jagged tensors with flash attention, this innovation achieves up to 9× speedup and 22× memory reduction compared to dense attention …
shaped.ai
64. Breaking the Attention Barrier: A Deep Dive into Scaling LLM …
Flash Attention is an algorithm designed to address the memory and computational bottlenecks associated with attention mechanisms in large …
machinelearningatscale.substack.com
FlashAttention: Fast and Memory-Efficient Exact Attention with IO …
It would interesting to see a roofline plot to demonstrate the compute-bound and memory access trade-off with and without flash-attention.
openreview.net
Rethinking Dynamic Networks and Heterogeneous Computing with …
A classic example is Flash Attention[8], which combines originally independent operations such as matmul, dropout, softmax, and mask into a …
dl.acm.org
Flash Attention with CUDA. Introduction | by Damien J | Medium
Flash Attention, as the name suggests, brings a fast and memory-efficient solution to attention mechanisms. It addresses some of the …
medium.com
Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization.
arxiv.org
HierMoE: Accelerating MoE Training with Hierarchical Token … – arXiv
The mixture-of-experts (MoE) architecture with sparse activation has gained significant research interest in large language models (LLMs) [1, 2, …
arxiv.org
X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts …
Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained …
arxiv.org
[PDF] X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts …
X-MoE accomplishes this via a combination of techniques, such as padding-free MoE training with cross-platform kernels for improved memory and.
arxiv.org
[PDF] Lancet: Accelerating Mixture-of-Experts Training via Whole Graph …
ABSTRACT. The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters.
arxiv.org
FSMoE: A Flexible and Scalable Training System for Sparse Mixture …
As the experts are distributed across multiple devices, the dispatch operation uses a collective communication technique called AlltoAll …
arxiv.org
FlashDMoE: Fast Distributed MoE in a Single Kernel – arXiv
This work introduces FlashDMoE, the first system to fuse the entire Mixture-of-Experts (MoE) operator into a single, persistent GPU kernel. We …
arxiv.org
Consistent and Efficient Tensor Programming with Eager-Mode SPMD
… dispatch overhead per operator is intolerable, particularly in models containing Mixture of Experts (MoE). Because thousands of lightweight …
arxiv.org
Middleware for LLMs: Tools Are Instrumental for Language Agents …
In particular, Mixtral represents an advanced mixture-of-experts model that has demonstrated superior performance and even surpasses GPT-3.5-turbo on Chatbot …
arxiv.org
1 Introduction – arXiv
The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of …
arxiv.org
Communication Efficient Parallel MoE Inference with Speculative …
Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens’ …
arxiv.org
Speculative Decoding and Beyond: An In-Depth Review of … – arXiv
This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks.
arxiv.org
Communication-Efficient Collaborative LLM Inference via Distributed …
Abstract:Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft …
arxiv.org
[1801.01203] Spectre Attacks: Exploiting Speculative Execution – arXiv
Spectre attacks exploit speculative execution, inducing a victim to perform operations that leak confidential information via side channels.
arxiv.org
[PDF] Speculative Decoding and Beyond: An In-Depth Survey of Techniques
Speculative decoding (SD) uses a two-phase process: a draft model predicts multiple tokens in parallel, followed by verification using the …
arxiv.org
A Survey of Speculative Execution in Large Language Models – arXiv
We present the very first survey paper that reviews and unifies literature of speculative execution in LLMs (eg, blockwise parallel decoding, speculative …
arxiv.org
Speeding up Speculative Decoding via Sequential Approximate …
Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs).
arxiv.org
[PDF] On the Correctness of Speculative Consensus – arXiv
Consensus protocols allow changes supported by the majority. The Proof-of-Execution (PoE) protocol uses speculative execution to minimize …
arxiv.org
A Speculative LLM Decoding Framework for Efficient Edge Serving
This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration …
arxiv.org
Collaborative Speculative Inference for Efficient LLM Inference Serving
Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, …
arxiv.org
Ring Attention with Blockwise Transformers for Near-Infinite Context
Our proposed approach Ring Attention allows training up to device count times longer sequence than baselines and enables the training of sequences that exceed …
arxiv.org
[PDF] Ring Attention with Blockwise Transformers for Near-Infinite Context.
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a …
arxiv.org
[PDF] Striped Attention: Faster Ring Attention for Causal Transformers – arXiv
We propose Striped Attention, a variant of Ring Attention which permutes the input sequence in a way which almost entirely eliminates the …
arxiv.org
Distributed Memory-efficient Attention for Long-context LLMs Training
In this paper, we introduce DistFlashAttn, a distributed memory-efficient attention mechanism optimized for long-context LLMs training.
arxiv.org
Communication Efficient Distributed Self-Attention Mechanism – arXiv
The original Ring Attention’s [20] block distribution caused load imbalances when applying causal attention. … Large scale distributed deep …
arxiv.org
TokenRing: An Efficient Parallelism Framework for Infinite-Context …
TokenRing addresses a critical challenge in distributed systems—such as in Ring Attention—where communication and computation cannot be …
arxiv.org
LV-XAttn: Distributed Cross-Attention for Long Visual Inputs … – arXiv
Figure 2 shows that cross-attention operations distributed with Ring Attention (Liu et al., 2024a) can account for up to 87% of the …
arxiv.org
Star Attention: Efficient LLM Inference over Long Sequences – arXiv
Among these, only Ring Attention is a distributed algorithm designed to scale inference across multiple GPUs. Since Star Attention also targets distributed …
arxiv.org
LASP-2: Rethinking Sequence Parallelism for Linear Attention and …
In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models.
arxiv.org
Increasing Transformer Context Length with Sparse Graph … – arXiv
Ring attention achieves sequence parallelism … networks with global attention,” Advances in Neural Information Processing Systems, vol.
arxiv.org