1 Billion Row Challenge Napkin Math

Cover image

1 Billion Row Challenge Napkin Math

Photo by Clinton Naik on Unsplash

Problem Overview

You need to write a program that processes 1 billion rows of weather station data to calculate the minimum, mean, and maximum temperatures for each station. This data is stored in a text file with over 10 GB of data, which presents significant computational challenges1.

This problem is an excellent exercise in data structure optimization and performance engineering. It offers a chance to explore various techniques across different programming languages to achieve efficient processing while adhering to constraints.

Key Challenges

Napkin Math of Input and Output Data

Are there always 10GB?

We can try to calculate the size of the input with more precision. While there is a draft calculation of the input.txt file, let’s calculate it ourselves.

Each row consists of a weather station name and a temperature value, separated by a semicolon. Example:

Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
Hamburg;34.2
St. John's;15.2
Cracow;12.6
...

Official Constraints

Here are summary of constraints from the task description.

These constraints are over-optimistic and cover more real-life cases with a lot of capacity.

Nature of the Name

The input data consists of real-time entries where the first part of the line, separated by a semicolon (;), is the weather station name. Typically, we can assume this is a city name. To estimate the longest and average name lengths, consider the following:

For practical purposes, we will use an estimated average length of 15 bytes for the city names. This conservative estimate helps in initial calculations, although the official constraint allows up to 100 bytes.

Nature of the Temperature

The temperature value is predictable within certain ranges. Considering the Earth’s recorded extremes:

These values translate to string representations:

Given that extreme cold is rarer than moderate and hot temperatures, we’ll assume an average temperature string length of 4 bytes.

Calculation Input

Image description

Photo by Francis Bouffard on Unsplash

Let’s go wild and use just the official constraints, then compare with optimistic estimations.

Maximum File Size

Per row

100 bytes (station name) + 1 byte (semicolon) + 5 bytes (temperature) + 1 byte (newline) = 107 bytes

Total

1,000,000,000 rows * 107 bytes/row = 107,000,000,000 bytes ≈ 99.6 GiB

This is a lot. With these limitations, the resulting file should be extremely large. It would be interesting to see the results of the participants processing a 100 GiB file.

Estimated Optimistic File Size

Per row

15 bytes (station name) + 1 byte (semicolon) + 4 bytes (temperature) + 1 byte (newline) = 21 bytes

Total

1,000,000,000 rows * 21 bytes/row = 21,000,000,000 bytes ≈ 19.5 GiB

Oops, a 20 GiB file. It is twice as large as mentioned in the details. Even if we use 13 bytes for the city and 3 bytes for the weather, it would be a 17 GiB file.

Calculation Output

Let’s estimate the size of the output data:

For each weather station, the output format is: StationName=min/mean/max, where:

Thus, for each station:

Total per station: 100 bytes (station name) + 1 byte (=) + 5 bytes (min) + 1 byte (/) + 5 bytes (mean) + 1 byte (/) + 5 bytes (max) + 2 bytes (, ) = 120 bytes

With up to 10,000 unique stations, the total output size is:

Total for all stations: 120 bytes/station * 10,000 stations = 1,200,000 bytes ≈ 1.1 MiB

Note that braces {...} are not included in this calculation, as they would only be in the final row as a comma , (comma and space separating entries).

Summary

In summary, processing an estimated 19 GiB input file to generate a 1.1 MiB output file represents a significant data compression. This reflects the efficiency of summarizing temperature data per weather station, reducing the large dataset to meaningful statistics while adhering to given constraints. This exercise underscores the importance of efficient data processing and storage in handling large datasets.

Image description

That’s all folks!