In early 2024, the 1 Billion Row Challenge (1BRC) went viral. The goal was simple yet daunting: process a file with 1 billion rows of temperature measurements and calculate the minimum, maximum, and mean temperatures for each weather station in the dataset. This benchmark pushed developers to explore extreme optimization techniques. The winning solution completed this challenge in just a mere 1.535 seconds, decisively debunking the long-standing myth that Java is inherently slow.
The question I aim to answer in this blog is: how was this incredible feat achieved, and what techniques can we apply to make Java code faster?
The first and simplest approach is to leverage the ability of modern processors to handle multiple tasks simultaneously—also known as parallelization. Parallelization makes work faster by dividing a task into smaller parts that can be completed simultaneously. Instead of processing everything one step at a time (sequentially), parallelization assigns different parts of the work to multiple processors or cores.
Java offers powerful libraries for parallelization, making it easy to divide tasks and process them simultaneously. Key tools include the Fork/Join Framework for divide-and-conquer problems, Parallel Streams for effortless parallel data operations, and the Executor Framework for managing threads. Features like CompletableFuture enable asynchronous parallel tasks, while third-party libraries like Akka and RxJava support advanced concurrency. These tools simplify parallelization, boosting performance by leveraging modern multi-core processors.
The fastest solutions took this approach to the next level. A common bottleneck in parallel processing is that files are often read sequentially, even when the data is processed in parallel. Without delving too deeply into the details, the fastest developers overcame this by writing custom code to split the text into chunks optimized for the number of cores and CPU cache sizes, allowing the file to be read and processed simultaneously across all cores—fully maximizing their utilization. Understanding your architecture is key.
Working hard isn’t enough—you need to work smart. While reading each byte of every line, the data must also be understood. Since they were dealing with a CSV file, this involves identifying the separators and efficiently parsing each line into meaningful pieces for further processing.
A simple way to handle this is by using regular expressions, often the go-to approach in everyday programming. However, for extreme performance, regular expressions are too slow due to their sequential nature. To optimize processing, it’s necessary to break down each line into bytes and process them in parallel, leveraging the capabilities of modern CPUs.
The first approach is using Java’s Vector API, which enables SIMD (Single Instruction, Multiple Data) operations. SIMD allows a single instruction to process multiple bytes simultaneously, significantly boosting performance. For example, instead of checking each character one by one, the Vector API processes entire chunks of data (e.g., 128 bits) in parallel, identifying target values like commas in a single operation. This method is ideal for maximizing CPU core utilization; however, it does have some limitations, such as incompatibility with GraalVM.
The second approach involves masks, which efficiently process data at the byte level. A mask is an array that marks positions of target values (e.g., 1 for a match, 0 otherwise) using branchless comparisons. Masks avoid the performance penalties of conditional logic, ensuring predictable, high-speed processing. While less abstract than the Vector API, masks provide a scalable and hardware-agnostic solution for parallel data parsing.
When processing data, temporary results need to be stored efficiently while handling the next piece of data. A commonly used data structure for this purpose is the standard Java hash map (HashMap), a reliable solution for key-value storage. However, while HashMap is versatile, it can become a performance bottleneck due to hash collisions and the overhead involved in resolving them.
To address this, some developers explored alternative storage methods. One particularly intriguing approach was creating a custom hash map with forward probing. A standard Java HashMap works by transforming a key into a hash value and storing the value at an index determined by the hash. However, hash collisions occur when multiple keys are assigned the same index. In such cases, the HashMap stores these values in a list. Retrieving a value then requires accessing the list and traversing it to find the desired value, which can slow down performance.
In contrast, a hash map using forward probing avoids storing values with the same index in a list. Instead, it searches for the next available slot in its array and places the value there. This approach is faster, provided collisions are kept to a minimum. Since the set of weather stations in this case was fixed, developers could design a custom hash function optimized to minimize collisions. By fully understanding the constraints of their data, they were able to implement extremely efficient organized data.
This approach is less about optimizing the code itself and more about optimizing how Java runs the code. As you may know, Java source code is first compiled into an intermediate format called bytecode, which is platform-independent. At runtime, the JVM (Java Virtual Machine) uses a Just-In-Time (JIT) compiler to convert this bytecode into native machine code specific to the underlying hardware. This process allows Java programs to achieve performance close to that of native applications.
However, JIT compilation happens during program execution, which can create overhead. The JVM spends time analyzing and optimizing code paths at runtime, leading to initial "warm-up" delays. This can slow down execution, especially for performance-critical or short-lived applications.
To address this, contestants used GraalVM, which improves runtime performance by leveraging an Ahead-Of-Time (AOT) compiler. GraalVM compiles Java bytecode into native machine code before the program runs, eliminating the need for JIT compilation during execution. This reduces warm-up times and improves startup performance. Additionally, GraalVM applies advanced optimization techniques to generate highly efficient native code, resulting in faster execution overall, particularly for applications that demand high performance or quick startup times.
For some developers, even this wasn’t fast enough. As you may know, Java uses a Garbage Collector (GC) to clean up objects that are no longer referenced. Without this process, a program running long enough would eventually run out of memory. However, while the Garbage Collector is working, the CPU is occupied and not running your program, which can cause slowdowns.
To remedy this, some developers simply turned off the Garbage Collector. (Technically, they used the Epsilon GC that just doesn’t do anything.) This is feasible for short-lived programs because all memory is automatically released once the program finishes execution. By disabling the GC, developers eliminated this overhead, allowing the CPU to focus entirely on the program and achieve faster execution times.
One of the key benefits of using Java over languages like C or C++ is its strong emphasis on safety. Java is designed to minimize the risk of causing harm to your system while running code. Each Java application operates within its own dedicated memory space and cannot access or interfere with memory allocated to other processes. This isolation, often referred to as a sandbox, ensures that Java programs cannot disrupt other processes or compromise system stability.
Java is also memory-safe, meaning it prevents unauthorized access to memory. Unlike C or C++, where developers manage memory manually, Java uses built-in mechanisms to handle memory allocation and cleanup. For example, Java’s automatic Garbage Collector ensures that unused objects are safely removed without risking memory leaks or invalid memory references. Combined with its robust runtime checks and strict type system, Java significantly reduces vulnerabilities, making it a highly secure choice for developers.
However, this level of safety comes at a cost: it can slow down your code. To achieve even greater performance, some developers bypassed these safety features by using Java's Unsafe library for direct memory manipulation. The Unsafe library allows developers to perform low-level operations, such as directly allocating and accessing memory, circumventing Java’s built-in memory management and safety checks.
One specific optimization involved bypassing the unmapping of data, which is typically handled automatically by Java’s memory management system. By taking control of these processes, developers were able to reduce overhead and execute operations significantly faster.
The above-mentioned techniques sacrifice safety for speed and should, therefore, be used only in extreme cases when a developer truly knows what they are doing.
How to write fast code: