HotInfra 2024

About

The 2nd Workshop on Hot Topics in System Infrastructure (HotInfra'24) provides a unique forum for cutting-edge research on system infrastructure and platforms. Researchers and engineers can share their recent research results and experiences, as well as discuss new challenges and opportunities in building next-generation system infrastructures, such as AI infrastructure, sustainable data centers, and edge/cloud computing infrastructure. The topics span across the full system stack with a focus on the design and implementation of system infrastructures. Relevant topics include hardware architecture, operating systems, runtime systems, and applications.

Call for Papers

The HotInfra workshop is soliciting three types of paper submissions: regular research papers, industry papers, and work-in-progress papers:

The regular research papers may include studies that have been published in top-tier systems and architecture conferences in the past year. We encourage submissions that showcase new concepts and approaches for developing new and emerging system infrastructures.
The industry papers are encouraged to demonstrate the recent trends and demands of real systems infrastructures from the industry and have insightful discussions on the challenges and experiences of developing real system infrastructures from industry perspectives.
The work-in-progress papers are encouraged to have new and crazy ideas in building future system infrastructure. We will favor submissions that have great potential to inspire interesting discussions, so it is fine if the work has only an early version of the system prototype.

HotInfra'24 welcomes submissions on any topics related to system infrastructure and platforms. Specific areas include but are not limited to:

Systems architecture and hardware devices
Operating systems and runtime systems support
Resource management and task scheduling
Empirical evaluation of real infrastructures
Security and reliability of new infrastructures
Energy efficiency and renewable energy supply
Emerging applications and cloud services enabled
Hardware virtualization
Rack-scale infrastructure
Software-defined data centers
System-building approaches

Submission Guidelines

HotInfra'24 submissions must be no longer than three double-column pages excluding references. For regular research papers, we follow the single-blind review policy. For industry papers and work-in-progress papers, we follow the double-blind review policy. All the accepted papers can be presented in the poster session by default. We will post presentation slides and accepted papers on the workshop website. The authors can extend and publish their work in other conferences and journals after HotInfra. The HotInfra'24 workshop will also invite talks from industry and academia.

Please submit your work here.

Camera-Ready Guidelines

Please use the provided LaTeX template to prepare your camera-ready paper. The camera-ready paper should be no longer than three double-column pages excluding references. Please submit your camera-ready paper via HotCRP by clicking the "Edit submission" button and uploading your PDF file in the "Final version" field.

Important Dates

Submission Deadline: ~~September 1, 2024~~ September 8, 2024
Author Notifications: September 30, 2024
Camera-ready Paper due: October 14, 2024
Workshop: November 3, 2024

Workshop Program

Location: Salon K, Hilton Austin

8:00 AM - 9:00 AM

Breakfast

9:00 AM - 9:05 AM

Opening Remark

9:05 AM - 10:05 AM

Keynote Talk

A Case for Scale-out AI

Christos Kozyrakis (Stanford)

Abstract
AI workloads, both training and inference, are now driving datacenter infrastructure development. With the rise of large foundational models, the prevalent approach to AI systems has mirrored that of supercomputing. We are designing hardware and software systems for large, synchronous HPC jobs with significant emphasis on kernel optimization for linear algebra operations. This talk will argue that AI infrastructure should instead adopt a scale-out systems approach. We will review motivating examples across the hardware/software interface and discuss opportunities for further improvements in scale-out AI systems.

Bio
Christos Kozyrakis is a computer architecture researcher at Nvidia Research and a professor of Electrical Engineering and Computer Science at Stanford University. He is currently working on cloud computing technology, systems design for artificial intelligence, and artificial intelligence for systems design. Christos holds a PhD from UC Berkeley. He is a fellow of the ACM and the IEEE. He has received the ACM SIGARCH Maurice Wilkes Award, the ISCA Influential Paper Award, the ASPLOS Influential Paper Award, the NSF Career Award, the Okawa Foundation Research Grant, and faculty awards by IBM, Microsoft, and Google.

10:05 AM - 10:15 AM

Short Break

10:15 AM - 10:45 AM

Session I: AI Platforms: Challenges and Opportunities

Session Chair: Christina Giannoula (University of Toronto)

Silent Data Corruptions in AI Systems: Evaluation and Mitigation

Xun Jiao, Fred Lin, Harish D. Dixit, Joel Coburn, Daniel Moore, Sriram Sankar (Meta)

Abstract: The increasing complexity, heterogeneity, and scale of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruption (SDC). Tackling this challenge requires answering the question: How to evaluate and mitigate the impact of SDCs on AI systems? For evaluation, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture, aiming to standardize the evaluation of SDC impact on AI models. PVF focuses on SDC occurring in model parameters – we define a model parameter's PVF as the probability that a corruption in that particular model parameter would result in an incorrect output. Through extensive fault injection (FI), we obtain PVF for a set of open-source models including DLRM, CNN, and BERT, as well as three Meta's production ranking and recommendation model, based on which we present unique insights. For mitigation, we propose Dr. DNA, a novel approach to detect and mitigate SDCs by formulating and extracting a set of unique SDC signatures from the Distribution of neuron activations (DNA). Through an extensive evaluation across 10 different models, results show that Dr. DNA achieves 100% SDC detection rate for most cases, 95% detection rate on average and >90% detection rate across all cases, representing 20% - 70% improvement over baselines. Dr. DNA also mitigates the impact of SDCs by recovering model performance with <1% memory overhead and <2.5% latency overhead.
Hardware-Assisted Virtualization of Neural Processing Units for Cloud Platforms

Yuqi Xue, Yiqi Liu (UIUC); Lifeng Nai (Google); Jian Huang (UIUC)

Abstract: Cloud platforms today have been deploying hardware accelerators like neural processing units (NPUs) for powering machine learning (ML) inference services. To maximize the resource utilization while ensuring reasonable quality of service, a natural approach is to virtualize NPUs for efficient resource sharing for multi-tenant ML services. However, virtualizing NPUs for modern cloud platforms is not easy. This is not only due to the lack of system abstraction support for NPU hardware, but also due to the lack of architectural and ISA support for enabling fine-grained dynamic operator scheduling for virtualized NPUs.

We present Neu10, a holistic NPU virtualization framework. We investigate virtualization techniques for NPUs across the entire software and hardware stack. Neu10 consists of (1) a flexible NPU abstraction called vNPU, which enables fine-grained virtualization of the heterogeneous compute units in a physical NPU (pNPU); (2) a vNPU resource allocator that enables pay-as-you-go computing model and flexible vNPU-to-pNPU mappings for improved resource utilization and cost-effectiveness; (3) an ISA extension of modern NPU architecture for facilitating fine-grained tensor operator scheduling for multiple vNPUs. We implement Neu10 based on a production-level NPU simulator. Our experiments show that Neu10 improves the throughput of ML inference services by up to 1.4x and reduces the tail latency by up to 4.6x, while improving the NPU utilization by 1.2x on average, compared to state-of-the-art NPU sharing approaches.
Decluttering the Data Mess in LLM Training

Maximilian Böther, Dan Graur, Xiaozhe Yao, Ana Klimovic (ETH Zurich)

Abstract: Training large language models (LLMs) presents new challenges for managing training data due to ever-growing model and dataset sizes. State-of-the-art LLMs are trained over trillions of tokens that are aggregated from a cornucopia of different datasets, forming collections such as RedPajama, Dolma, or FineWeb. However, as the data collections grow and cover more and more data sources, managing them becomes time-consuming, tedious, and prone to errors. The proportion of data with different characteristics (e.g., language, topic, source) has a huge impact on model performance.

In this work-in-progress paper, we discuss three challenges we observe for training LLMs due to the lack of system support for managing and mixing data collections. Then, we present Mixtera, a system to support LLM training data management.

10:45 AM - 11:00 AM

Coffee Break

11:00 AM - 12:00 PM

Session II: Cloud Efficiency: New Approaches New Opportunities

Session Chair: Apoorve Mohan (IBM)

SoK: Virtualization Challenges and Techniques in Serverless Computing

Vasudha Devarakonda, Aleksandr Earnest, Chia-Che Tsai (Texas A&M University)

Abstract: This systematization of knowledge (SoK) paper summarizes the discussion of virtualization challenges and the corresponding techniques specific to serverless computing. We examine virtualization solutions, including paravirtualization, containers, lightweight hypervisors and kernels, and unikernels, and their applicability to serverless. Then, we discuss several challenges, including cold-start optimization, resource co-location, benchmarking, and the research-production gap, hoping to inspire future research.
Harmonizing Diverse Compute Resources for Efficiency

Dilina Dehigama, Shyam Jesalpura (University of Edinburgh); Marios Kogias (Imperial College London); Boris Grot (University of Edinburgh)

Abstract: Online services are characterized by significant load fluctuations at find-grained intervals when coarse-grained load measurements indicate a relatively stable load. Running such services on virtual machines (VMs) rented from a cloud provider like AWS, which is a typical way to deploy online applications today, is inefficient due to the need to overprovision VM capacity to meet the SLO under variable load. In contrast, serverless computing is highly elastic but is prohibitively expensive for serving a large volume of requests. We thus argue for combining the different types of compute (i.e., VM and serverless instances) to achieve both cost-efficiency and elasticity. Our results show that hybrid compute is more cost effective than an optimal VM-only allocation that provisions just enough resource to meet the SLO using perfect knowledge of future load.
Designing Cloud Server for Lower Carbon

Jaylen Wang (Carnegie Mellon University); Daniel S. Berger (Microsoft; University of Washington); Fiodar Kazhamiaka, Celine Irvene, Chaojie Zhang, Esha Choukse, Kali Frost, Rodrigo Fonseca, Brijesh Warrier, Chetan Bansal, Jonathan Stern, Ricardo Bianchini (Microsoft); Akshitha Sriraman (Carnegie Mellon University)

Abstract: To mitigate climate change, we must reduce carbon emissions from hyperscale cloud computing. We find that cloud compute servers cause the majority of emissions in a general-purpose cloud. Thus, we motivate designing carbon-efficient compute server SKUs, or GreenSKUs, using recently-available low-carbon server components. To this end, we design and build three GreenSKUs using low-carbon components, such as energy-efficient CPUs, reused old DRAM via CXL, and reused old SSDs.

We detail several challenges that limit GreenSKUs' carbon savings at scale and may prevent their adoption by cloud providers. To address these challenges, we develop a novel methodology and associated framework, GSF (Green SKU Framework), that enables a cloud provider to systematically evaluate a GreenSKU's carbon savings at scale. We implement GSF within Microsoft Azure's production constraints to evaluate our three GreenSKUs' carbon savings. Using GSF, we show that our most carbon-efficient GreenSKU reduces emissions per core by 28% compared to currently-deployed cloud servers. When designing GreenSKUs to meet applications' performance requirements, we reduce emissions by 15%. When incorporating overall data center overheads, our GreenSKU reduces Azure's net cloud emissions by 8%.
OpenInfra: A Co-simulation Framework for the Infrastructure Nexus

Jiaheng Lu, Yunming Xiao, Shmeelok Chakraborty (University of Michigan); Silvery Fu (UC Berkeley); Yoon Sung Ji, Ang Chen, Mosharaf Chowdhury (University of Michigan); Nalini Rao (EPRI); Sylvia Ratnasamy (UC Berkeley); Xinyu Wang (University of Michigan)

Abstract: Critical infrastructures like datacenters, power grids, and water systems are interdependent, forming complex "infrastructure nexuses" that require co-optimization for efficiency, resilience, and sustainability. We present OpenInfra, a co-simulation framework designed to model these interdependencies by integrating domain-specific simulators for datacenters, power grids, and cooling systems but focusing on stitching them together for end-to-end experimentation. OpenInfra enables seamless integration of diverse simulators and flexible configuration of infrastructure interactions. Our evaluation demonstrates its ability to simulate large-scale infrastructure dynamics, including 7,392 servers over 100+ hours.

12:00 PM - 1:00 PM

Lunch

1:00 PM - 2:00 PM

Keynote Talk

Challenges and Opportunities in Datacenter Power and Sustainability in the AI Era

Ricardo Bianchini (Microsoft)

Abstract
As society's interest in generative AI models and their capabilities continues to soar, we are witnessing an unprecedented surge in compute demand. This surge is stressing every aspect of the cloud ecosystem at a time when hyperscale providers are striving to become carbon-neutral. In this talk, I will address the challenges in managing the power, energy, and sustainability of this expanding AI infrastructure. I will also overview some of my team's early efforts to tackle these challenges and explore potential research avenues going forward. Ultimately, we will need a massive research and development effort to create a more sustainable and efficient future for AI.

Bio
Dr. Ricardo Bianchini is a Technical Fellow and Corporate Vice President at Microsoft Azure, where he leads the team responsible for managing Azure's Compute workload, server capacity, and datacenter infrastructure with a strong focus on efficiency and sustainability. Before joining Azure in 2022, Ricardo led the Systems Research Group and the Cloud Efficiency team at Microsoft Research (MSR). During his tenure at MSR, he created research projects in power efficiency and intelligent resource management that resulted in large-scale production systems across Microsoft. Prior to joining Microsoft in 2014, he was a Professor at Rutgers University, where he conducted research in datacenter power and energy management, energy-aware storage systems, energy-aware load distribution across datacenters, and leveraging renewable energy in datacenters. Ricardo is a Fellow of both the ACM and IEEE.

2:00 PM - 2:15 PM

Short Break

2:15 PM - 3:15 PM

Session III: AI Systems: Algorithms, Networking, and Hardware

Session Chair: Yuqi Xue (UIUC)

Immediate Communication for Distributed AI Tasks

Jihao Xin (KAUST); Seongjong Bae, KyoungSoo Park (Seoul National University); Marco Canini (KAUST); Changho Hwang (Microsoft Research)

Abstract: Large AI models have necessitated efficient communication strategies across multi-GPU and multi-node infrastructures due to their increasing complexity. Current methods focusing on inter-operator overlaps fail when dependencies exist, leading to underutilized hardware. DistFuse addresses this by enabling fine-grained overlapping of computation and communication, triggering communication as soon as data is ready, and reducing latency. Initial experiments show up to 44.3% reduction in communication latency of Llama3-70B inference on a single node, demonstrating its potential to accelerate diverse AI workloads.
Do Large Language Models Need a Content Delivery Network?

Yihua Cheng, Kuntai Du, Jiayi Yao, Junchen Jiang (The University of Chicago)

Abstract: As the use of large language models (LLMs) expands rapidly, so does the range of knowledge needed to supplement various LLM queries. Thus, enabling flexible and efficient injection of new knowledge in LLM inference is critical. Three high-level options exist: (i) embedding the knowledge in LLM's weights (i.e., fine-tuning), (ii) including the knowledge as a part of LLM's text input (i.e., in-context learning), or (iii) injecting the KV caches of the new knowledge to LLM during prefill. This paper argues that, although fine-tuning and in-context learning are popular, using KV caches as the medium of knowledge could simultaneously enable more modular management of knowledge injection and more efficient LLM serving with low cost and fast response. To realize these benefits, we envision a Knowledge Delivery Network (KDN), a new system component in LLM services that dynamically optimizes the storage, transfer, and composition of KV cache across LLM engines and other compute and storage resources. We believe that, just like content delivery networks (CDNs), such as Akamai, enabled the success of the Internet ecosystem through their efficient data delivery, KDNs will be critical to the success of LLM applications through their efficient knowledge delivery. We have open-sourced a KDN prototype in https://github.com/LMCache/LMCache.
Scaling Deep Learning Computation over the Inter-core Connected Intelligence Processor

Yiqi Liu, Yuqi Xue (UIUC); Yu Cheng (Peking University; Microsoft); Lingxiao Ma, Ziming Miao, Jilong Xue (Microsoft); Jian Huang (UIUC)

Abstract: As AI chips incorporate numerous parallelized cores to scale deep learning (DL) computing, chips like Graphcore IPU enable high-bandwidth and low-latency inter-core links. They allow each core to directly access other cores' scratchpad memory, which enables new parallel computing paradigms. However, without proper support for the inter-core connections in current DL compilers, it is hard to exploit the benefits of this new architecture. We present T10, the first DL compiler to exploit the inter-core bandwidth and distributed on-chip memory on AI chips. To formulate the computation and communication patterns of tensor operators in the new architecture, T10 introduces a distributed tensor abstraction rTensor. T10 maps a DNN model to execution plans with a compute-shift pattern, by partitioning DNN computation into sub-operators and mapping them to cores, so that the cores can exchange data following predictable patterns. T10 alleviates unnecessary inter-core communications, makes globally optimized trade-offs between on-chip memory usage and inter-core communication overhead, and selects the best execution plan from a vast optimization space.
Optimizing Transformer Inference with Selective Distillation: Layerwise Conversion to Linear Attention

Yeonju Ro, Zhenyu Zhang, Vijay Chidambaram, Aditya Akella (UT Austin)

Abstract: Transformer-like architectures with softmax attention have demonstrated exceptional performance in language modeling by capturing long-term token dependencies and ensuring efficient training. However, their scalability is constrained by the quadratic computation costs incurred during inference. To mitigate this issue, subquadratic models with SSMs and linear attentions have been introduced; however, such models achieve lower accuracy than Transformers. In this work, we aim to get the best of both worlds: we seek to convert carefully selected attention layers in a Transformer to gated linear attention layers, without sacrificing accuracy. Specifically, we analyze the performance benefits of subquadratic models and propose a distillation method that progressively converts the attention layer to a gated linear attention layer. While selecting layers to convert, we leverage downstream task accuracy as a criterion to minimize the accuracy loss caused by conversion. Our evaluation demonstrates that task accuracy is an effective criteria that can be used in adaptive distillation.

3:15 PM - 3:45 PM

Coffee Break

3:45 PM - 4:30 PM

Session IV: Resource Scheduling: Not Only for Efficiency But Also for Societal Good

Session Chair: Jinghan Sun (UIUC)

Adaptive Resource Allocation to Enhance the Kubernetes Performance for Large-Scale Clusters

Jiayin Luo, Xinkui Zhao, Yuxin Ma, Shengye Pang, Shuiguang Deng, Jianwei Yin (Zhejiang University)

Abstract: The advent of cloud computing has led to a dramatic increase in the deployment of hyper-scale, diverse workloads in containerized form on cloud infrastructures. This expansion necessitates the management of numerous large-scale clusters. However, Kubernetes (k8s), the industry standard for container orchestration, faces challenges with low scheduling throughput and high request latency in such environments. We identify resource contention among components co-located on the same master node as a primary bottleneck. To address this, we introduce a lightweight framework designed to enhance the performance and scalability of Kubernetes clusters. Our approach adaptively allocates resources among co-located components, improving their overall performance. Implemented as a non-intrusive solution, our evaluations across various cluster scales show significant improvements, with a 7.3x increase in cluster scheduling throughput and a 37.3% reduction in request latency, surpassing the performance of vanilla Kubernetes and baseline resource allocation strategies.
Mewz: Lightweight Execution Environment for WebAssembly with High Isolation and Portability using Unikernels

Soichiro Ueda (Kyoto University); Ai Nozaki (University of Tokyo); Daisuke Kotani, Yasuo Okabe (Kyoto University)

Abstract: Cloud computing requires isolation and portability for workloads. Virtual machines, for isolation, and containers, for portability, are widely used to achieve these requirements these days. However, using VMs and containers together entails two problems. First, VMs and containers have overheads that degrade performance. In addition, container images depend on host operating systems and architectures. To solve these problems, we propose a new system that distributes applications as WebAssembly (Wasm) and runs them as unikernels. Additionally, we propose a mechanism to convert a Wasm application into a unikernel image with the Wasm binary AoT-compiled to native code. We evaluated the performance of the system by running a simple HTTP server compiled into Wasm. The result showed that it ran Wasm applications with lower overhead than existing technologies.
Demographic Bias in Web Scheduling Systems

Sara Mahdizadeh Shahri, Akshitha Sriraman (Carnegie Mellon University)

Abstract: Modern web systems, especially scheduling systems, often adopt a "performance-first" approach, where they prioritize sending quick responses to the end user to improve Quality of Experience (QoE). We posit that putting performance first, however, might make scheduling systems introduce request priorities that may cause responses to be biased against certain user demographics. For example, to improve QoE, existing scheduling systems often prioritize requests that face lower network delays, which may implicitly cause biases against some requests that originate from rural areas.

To validate our hypothesis, we systematically study and define demographic bias for scheduling systems. We investigate whether existing scheduling systems can (unintentionally) introduce demographic bias to improve performance, precipitating discrimination against certain user demographics. We detail two case studies to show that demographic bias can occur in open-source schedulers. We then design BiasCheck, a scheduler that meets performance goals while reducing bias. We demonstrate that BiasCheck significantly reduces demographic bias while causing only a few SLO violations, compared to existing schedulers.

4:30 PM - 4:45 PM

Short Break

4:45 PM - 5:30 PM

Session V: Infrastructure Development: New Tools with New Techniques

Session Chair: Pantea Zardoshti (Microsoft)

Revisiting Distributed Programming in the CXL Era

Teng Ma (Alibaba Group); Mingxing Zhang, Kang Chen, Jialiang Huang (Tsinghua University); Zheng Liu (Alibaba Group); Yongwei Wu (Tsinghua University)

Abstract: As Moore's Law slows, distributed programming is increasingly prioritized to enhance system performance through horizontal scaling. This paper revisits the paradigms of message passing (MP) and distributed shared memory (DSM) regarding evolving interconnect technologies, particularly Compute Express Link (CXL). DSM's unified memory space offers simplicity in distributed programming by abstracting data communication complexities, but MP remains more prevalent due to its flexibility and programming-friendliness. We explore the memory sharing and pooling of CXL, which enable low-latency, coherent data transfers between multiple hosts. We address the complexities involved in managing shared states, particularly in the context of partial failures – termed Partial Failure Resilient DSM (RDSM). Thus we propose a memory management system named CXLSHM that leverages reference counting and distributed vector clocks for robust memory management. To be compatible with the message passing model, we introduce a CXL-based RPC named HydraRPC, which utilizes CXL shared memory to bypass data copying and serialization.
Towards Optimal Remote JIT Compilation Scheduling for the JVM

Richard Kha, Nikhil Sreekumar (University of Toronto); Alexey Khrabrov (University of Toronto; IBM); Eyal de Lara, Angela Demke Brown, Moshe Gabel (University of Toronto); Marius Pirvu (IBM)

Abstract: In the Java Virtual Machine (JVM), Just-In-Time (JIT) compilation is used to speed up Java applications, however, JIT compilation incurs significant memory and CPU runtime overheads. JITServer is a disaggregated JIT compiler in the Eclipse OpenJ9 JVM – it decouples the JIT compiler from the JVM, thereby reducing CPU and memory requirements of JVM client applications.

JITServer schedules remote compilation requests from connected clients in first-come, first-served (FCFS) order. We show that when a JITServer instance is under high load, the scheduler performs better when it considers client information for each compilation request. Prioritizing requests from newly connected clients helps those clients warm-up rapidly, but risks starving requests from older clients. We developed a scheduling algorithm we call ALDCF (Alternating Least Done Client First) that prioritizes requests from clients with fewer completed compilation requests, while reducing starvation of older clients through alternating with FCFS. In our experiments, ALDCF reduces the average client JVM runtime by up to 9% compared to FCFS, while controlling starvation.
Lotus: Characterize Architecture Level CPU-based Preprocessing in Machine Learning Pipelines

Rajveer Bachkaniwala, Harshith Lanka, Kexin Rong, Ada Gavrilovska (Georgia Institute of Technology)

Abstract: With the slowdown of Moore's law, the performance of data-intensive workloads require significant compute capacity and power. In machine learning (ML) pipelines, preprocessing has significant compute and power requirement, involving tasks such as loading, decoding, and applying transformations. To combat the preprocessing bottleneck, substantial efforts have been made with prior works introducing optimizations such as CPU and I/O parallelism, accelerator offloading, on-node/distributed computing as well as caching strategies. However, less attention has been focused on optimizing the preprocessing pipeline as a function of the CPU architecture's efficiency in executing it. The performance of preprocessing tasks is intrinsically linked to the CPU's microarchitecture, memory hierarchy, and the efficiency of its instruction pipeline.

Our work contributes to alleviate the gap by introducing LOTUS, a profiling tool specifically designed for the preprocessing stage of ML pipelines to aid future optimizations across the stack. The insights provided by LOTUS enable practitioners to evaluate the limitations of their CPU architecture and the efficiency of their preprocessing pipelines across different configurations, such as varying the number of preprocessing workers.

5:30 PM - 6:00 PM

Short Talk Session

Session Chair: Jinghan Sun (UIUC)

LLM Inference Performance on Chiplet-based Architectures and Systems

Surim Oh (UC Santa Cruz); Eric Qin, Yang Yang, Mengchi Zhang, Raj Parihar, Ashish Pandya (Meta)

Abstract: Large Language Models (LLMs) have become increasingly prevalent, enabling a wide range of tasks across various platforms, from handheld devices and wearables to large-scale datacenters. Many of these applications, such as co-pilot and chatbot, rely on decoder-only style LLMs with multibillion parameters, which require significant computational resources to achieve desired performance metrics. As LLM workloads continue to evolve and demand more substantial computational resources, it is essential to explore innovative approaches to improve their performance.

One promising approach is the Multi-Chip-Module (MCM) architecture, which offers high-performance computing, storage, and network capabilities. However, the performance characteristics of LLMs on MCM architectures are not yet fully understood. To address this knowledge gap, we conducted a series of carefully designed experiments to investigate LLM inference performance on various MCM architectures. Our study provides detailed sensitivity analyses of dieto-die bandwidth, cache policies, and chiplet configurations, offering valuable insights into optimizing LLM performance on MCM architectures.
Thallus: An RDMA-based Columnar Data Transport Protocol

Jayjeet Chakraborty (UC Santa Cruz); Matthieu Dorier, Philip Carns, Robert Ross (Argonne National Laboratory); Carlos Maltzahn, Heiner Litz (UC Santa Cruz)

Abstract: The volume of data generated and stored in contemporary global data centers is experiencing exponential growth. This rapid data growth necessitates efficient processing and analysis to extract valuable business insights. In distributed data processing systems, data undergoes exchanges between the compute servers that contribute significantly to the total data processing duration in adequately large clusters, necessitating efficient data transport protocols.

Traditionally, data transport frameworks such as JDBC and ODBC have used TCP/IP-over-Ethernet as their underlying network protocol. Such frameworks require serializing the data into a single contiguous buffer before handing it off to the network card, primarily due to the requirement of contiguous data in TCP/IP. In OLAP use cases, this serialization process is costly for columnar data batches as it involves numerous memory copies that hurt data transport duration and overall data processing performance. We study the serialization overhead in the context of a widely-used columnar data format, Apache Arrow, and propose leveraging RDMA to transport Arrow data over Infiniband in a zero-copy manner. We design and implement Thallus, an RDMA-based columnar data transport protocol for Apache Arrow based on the Thallium framework from the Mochi ecosystem, compare it with a purely Thallium RPC-based implementation, and show substantial performance improvements can be achieved by using RDMA for columnar data transport.
Towards Static Analysis of Interrupt Blocking Time in Operating System Kernels

Thomas Preisner, Dustin Nguyen, Jonathan Krebs, Rüdiger Kapitza, Wolfgang Schröder-Preikschat (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Abstract: Interrupts are integral to the interaction between an OS and its managed devices. They notify the OS that something requires immediate attention. However, they also disrupt ongoing executions by design and, therefore, have to be inhibited in some cases. This process requires utmost care as unnecessarily long interrupt blocking times negatively impact system performance. In the worst case, missing those may even result in catastrophic failures. The associated cause analysis for delayed interrupt reception is often time-consuming and costly due to the complex interaction between devices and the system. Especially on commodity server systems with many devices, the cause of problems related to interrupt latencies is often masked.

In this paper, we make an initial effort towards providing means to developers of general-purpose OSs to minimize the worst-case interrupt-delivery latency by identifying excessive interrupt blocking times. We provide insights for the system developer into potentially problematic code paths where interrupts are masked by leveraging symbolic execution of LLVM IR for general-purpose OSs. We currently work with x86_64, but the analysis is applicable to other architectures with minimal adaptions.
Delayed Privilege Escalation for Fast System Calls

Maximilian Ott, Phillip Raffeck, Rüdiger Kapitza, Wolfgang Schröder-Preikschat (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Abstract: Executing system calls requires switching the privilege level when entering the kernel. For high-performance applications, these costly switches can be omitted by directly linking with the kernel in the form of library operating systems or unikernels. However, this approach comes at the cost of compromising security as it removes user-/kernel-space separation. In other contexts, on-demand rather than proactive action allows paying overhead costs only when necessary.

We propose that single-address systems allow for an on-demand approach for system calls. When the operating-system data is hidden by the sheer size of the address space, elevated privileges are only necessary when performing privileged hardware operations. We believe that single-address systems would benefit from such an on-demand approach.

We introduce a concept that allows us to defer the privilege-level switch until it is needed, that is when the first privileged instruction is encountered. We present a prototype implementation of the concept as an adaption of the Linux kernel.
Redesigning Edge-Centric Micro-Datacenters for Efficient Multitenancy

Sudarsun Kannan (Rutgers University); River Bartz, Ram Durairajan (University of Oregon); Uli Kremer (Rutgers University)

Abstract: Hazard monitoring systems depend on micro datacenters (MicroDCs) in remote areas for early disaster detection and response, and continuous environmental monitoring. These MicroDCs, limited by resources and energy, require efficient multi-tenant resource management, as traditional datacenter solutions for isolation and resource partitioning are insufficient in their dynamic environments. This paper focuses on memory resource sharing across multitenant applications in MicroDCs, demonstrating through experimental analysis of a real-time object detection framework and a NoSQL database that static partitioning leads to suboptimal performance and energy consumption. Our results reveal a non-linear relationship between performance, energy consumption and memory allocation, highlighting the potential for significant energy savings through optimized resource management.

6:00 PM

SOSP Conference Reception

List of Papers Accepted as Posters

LLM Inference Performance on Chiplet-based Architectures and Systems

Surim Oh (UC Santa Cruz); Eric Qin, Yang Yang, Mengchi Zhang, Raj Parihar, Ashish Pandya (Meta)

Abstract: Large Language Models (LLMs) have become increasingly prevalent, enabling a wide range of tasks across various platforms, from handheld devices and wearables to large-scale datacenters. Many of these applications, such as co-pilot and chatbot, rely on decoder-only style LLMs with multibillion parameters, which require significant computational resources to achieve desired performance metrics. As LLM workloads continue to evolve and demand more substantial computational resources, it is essential to explore innovative approaches to improve their performance.

One promising approach is the Multi-Chip-Module (MCM) architecture, which offers high-performance computing, storage, and network capabilities. However, the performance characteristics of LLMs on MCM architectures are not yet fully understood. To address this knowledge gap, we conducted a series of carefully designed experiments to investigate LLM inference performance on various MCM architectures. Our study provides detailed sensitivity analyses of dieto-die bandwidth, cache policies, and chiplet configurations, offering valuable insights into optimizing LLM performance on MCM architectures.
Towards Static Analysis of Interrupt Blocking Time in Operating System Kernels

Thomas Preisner, Dustin Nguyen, Jonathan Krebs, Rüdiger Kapitza, Wolfgang Schröder-Preikschat (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Abstract: Interrupts are integral to the interaction between an OS and its managed devices. They notify the OS that something requires immediate attention. However, they also disrupt ongoing executions by design and, therefore, have to be inhibited in some cases. This process requires utmost care as unnecessarily long interrupt blocking times negatively impact system performance. In the worst case, missing those may even result in catastrophic failures. The associated cause analysis for delayed interrupt reception is often time-consuming and costly due to the complex interaction between devices and the system. Especially on commodity server systems with many devices, the cause of problems related to interrupt latencies is often masked.

In this paper, we make an initial effort towards providing means to developers of general-purpose OSs to minimize the worst-case interrupt-delivery latency by identifying excessive interrupt blocking times. We provide insights for the system developer into potentially problematic code paths where interrupts are masked by leveraging symbolic execution of LLVM IR for general-purpose OSs. We currently work with x86_64, but the analysis is applicable to other architectures with minimal adaptions.
Evaluating Infrastructure as Code: Key Metrics and Performance Benchmarks

Aditya Gupta, Paras Mittal, Dr. Kunal Korgaonkar (BITS Pilani, K K Birla Goa Campus)

Abstract: Organizations are increasingly adopting Infrastructure as Code (IaC) to automate the management and provisioning of cloud resources, moving away from manual configurations. Despite its growing usage, there is a lack of comprehensive research evaluating the performance of IaC tools in large-scale, real-world scenarios with complex architectures. This paper addresses this gap by comparing Terraform and AWS CloudFormation, two leading IaC tools, across key performance metrics such as CPU usage, memory consumption, system time, and user time. Using the TrainTicket project, which encompasses 47 microservices, we evaluate both tools in a controlled environment to provide insights into their effectiveness in managing large-scale cloud infrastructures. Our findings offer valuable guidance for organizations choosing between Terraform and CloudFormation in enterprise-scale deployments.
Thallus: An RDMA-based Columnar Data Transport Protocol

Jayjeet Chakraborty (UC Santa Cruz); Matthieu Dorier, Philip Carns, Robert Ross (Argonne National Laboratory); Carlos Maltzahn, Heiner Litz (UC Santa Cruz)

Abstract: The volume of data generated and stored in contemporary global data centers is experiencing exponential growth. This rapid data growth necessitates efficient processing and analysis to extract valuable business insights. In distributed data processing systems, data undergoes exchanges between the compute servers that contribute significantly to the total data processing duration in adequately large clusters, necessitating efficient data transport protocols.

Traditionally, data transport frameworks such as JDBC and ODBC have used TCP/IP-over-Ethernet as their underlying network protocol. Such frameworks require serializing the data into a single contiguous buffer before handing it off to the network card, primarily due to the requirement of contiguous data in TCP/IP. In OLAP use cases, this serialization process is costly for columnar data batches as it involves numerous memory copies that hurt data transport duration and overall data processing performance. We study the serialization overhead in the context of a widely-used columnar data format, Apache Arrow, and propose leveraging RDMA to transport Arrow data over Infiniband in a zero-copy manner. We design and implement Thallus, an RDMA-based columnar data transport protocol for Apache Arrow based on the Thallium framework from the Mochi ecosystem, compare it with a purely Thallium RPC-based implementation, and show substantial performance improvements can be achieved by using RDMA for columnar data transport.
Delayed Privilege Escalation for Fast System Calls

Maximilian Ott, Phillip Raffeck, Rüdiger Kapitza, Wolfgang Schröder-Preikschat (Friedrich-Alexander-Universität Erlangen-Nürnberg)

Abstract: Executing system calls requires switching the privilege level when entering the kernel. For high-performance applications, these costly switches can be omitted by directly linking with the kernel in the form of library operating systems or unikernels. However, this approach comes at the cost of compromising security as it removes user-/kernel-space separation. In other contexts, on-demand rather than proactive action allows paying overhead costs only when necessary.

We propose that single-address systems allow for an on-demand approach for system calls. When the operating-system data is hidden by the sheer size of the address space, elevated privileges are only necessary when performing privileged hardware operations. We believe that single-address systems would benefit from such an on-demand approach.

We introduce a concept that allows us to defer the privilege-level switch until it is needed, that is when the first privileged instruction is encountered. We present a prototype implementation of the concept as an adaption of the Linux kernel.
Industrial Trending on AI System Reliability: A Brief Review

Xun Jiao, Abhinav Pandey, Fred Lin (Meta)

Abstract: As AI systems grow in complexity and scale, particularly with the training of large language models (LLMs) that require large-scale GPU clusters, ensuring their reliability becomes increasingly critical. In the large-scale training job, a single GPU failure can disrupt an entire training job, affecting thousands of interconnected GPUs. We present a brief literature review of the latest industrial trending and efforts, focusing on reliability and fault tolerance of large-scale AI infrastructure, from companies such as Meta, Microsoft, ByteDance, and Alibaba. The review is structured around two main themes: (1) hardware failure analysis and top root causes, and (2) fault tolerance with hardware validation, failure detection, and checkpointing-based recovery.
Towards Using Partitioned GPU Virtual Functions for Mixture of Experts

Tony Yi, Vignesh Chander, Vamsi Alla, Jerry Jiang (AMD)

Abstract: Recent advancements in large language models (LLMs) have shown that smaller, fine-tuned models have comparable or better performance against larger general-purpose models in domain-specific knowledge, even when quantized. However, these models suffer from several issues in production systems: under-utilizing memory and potential data security risks. We propose a new method of mixture of experts (MoE) inference utilizing GPU partitioning combined with single-root IO virtualization (SRIOV), enabling better utilization of GPU memory and scalability, while ensuring model weights remain secure. LLMs today come in a variety of sizes and quantization levels, each with its own memory requirement. Using SRIOV, we can partition the GPU into one or more virtual functions (VFs), altering allocated memory and compute to fit the needs of these LLMs. With AMD Instinct™ MI300X, for example, one VF can have 24 to 192GB of high bandwidth memory (HBM), scaling into 1.5TB per node. These SRIOV-enabled virtual machines also address the load imbalance inherent in MoE models, eliminating the need for an auxiliary loss for load balancing, while maintaining fast interconnect between all components, providing low latency during inference. Additionally, isolation capabilities built into SRIOV ensure native data security as virtual functions are isolated from each other, creating the possibility of new use cases where different vendors may provide their own expert to the mixture.
Redesigning Edge-Centric Micro-Datacenters for Efficient Multitenancy

Sudarsun Kannan (Rutgers University); River Bartz, Ram Durairajan (University of Oregon); Uli Kremer (Rutgers University)

Abstract: Hazard monitoring systems depend on micro datacenters (MicroDCs) in remote areas for early disaster detection and response, and continuous environmental monitoring. These MicroDCs, limited by resources and energy, require efficient multi-tenant resource management, as traditional datacenter solutions for isolation and resource partitioning are insufficient in their dynamic environments. This paper focuses on memory resource sharing across multitenant applications in MicroDCs, demonstrating through experimental analysis of a real-time object detection framework and a NoSQL database that static partitioning leads to suboptimal performance and energy consumption. Our results reveal a non-linear relationship between performance, energy consumption and memory allocation, highlighting the potential for significant energy savings through optimized resource management.