Xz compression format

12/14/2023

For the traffic-images-detections group, while having less compression than the taxi-availability group, experiences greater compression ratios as size gets larger. It also appears that for the taxi-availability group, compression ratios are flat across the group. I’m not sure of the exact reasons but a point to note is that taxi-availability consists of only floats, while traffic-images-detections is a mix of strings and floats.

The group for taxi-availability is ordered orange-green-red ( bz2– xz– gzip) from best to worst ratios, whereas for traffic-images-detections, xz outperforms bz2. Looking at the bunch of lines to the center-right, the higher 3 belong to one dataset ( taxi-availability) and the lower 3 belong to another ( traffic-images-detections). compression='bz2' and compression='xz'īetween compression='bz2' (orange) and compression='xz' (green), the results are slightly more interesting. compression='gzip'Ĭompression ratios of the remaining 3 formats are much closer, though it seems that of the 3, compression='gzip' (red line) seems to have the lowest compression rates across the spectrum. Generally, compression='zip' (yellow line) performs worse than the other compression formats, though the difference narrows as file sizes got larger. It appears that most of these were from a dataset which consists of only floats (taxi-availability).

However, around the middle portion there are a bunch of files which got larger after compression, which was interesting. compression=Noneįor compression=None, the ratio hovers around 0, indicating not much changes to file size, as expected. At smaller file sizes, compression may even make the file larger. On the far end this range size savings are over 90%. A number below 0 indicates the file got larger.įrom the shape of the chart, it appears that larger file sizes benefit more from compression. A higher number indicates that the output file size is smaller (as a ratio of the original). The compressed ratio, or 1- (size of compressed file divided by original size), is on the vertical axis. On the horizontal axis is the original file size in bytes, on a log scale. Here’s an interactive version of the charts below. Noting that speed performance over time may vary, I shuffled the files before processing to ensure that such performance differences would not accumulate at any particular dataset or date range.įinally, I opened each of the compressed files and compared the time required to load each of these files as a dataframe, also shuffling the files before running the tasks. I compared the start and end sizes, as well as the write times. Next, I opened the CSV files with Pandas as a dataframe and saved them with each of the compression formats. These datasets have no fixed number of rows and sizes vary widely. Other APIs are not tagged to specific locations, such as availability of taxis or incidents reported. These totaled to approximately 864 files per dataset (slightly less due to errors with some runs).įor some datasets, the number of rows are almost always the same in each query, as they come from specific meters/ measurement points along roads. I ran these for 9 days, once every 15 minutes. Some had purely floats, while some had a mix of strings and floats. Each dataset was different in terms of average size and the types of data stored. To compare the different compression formats, I first generated CSV files for the 6 datasets I was logging data for. I just had to figure out which compression mode to use. I was processing the data using Pandas, and as Pandas itself is able to handle compression, things were pretty convenient. However, that didn’t seem to be sufficient in my quest to keep costs for the project to a minimum. This was somewhat effective as many of the constant fields were road or place names, strings which took considerable space. In particular, the traffic-speed-bands API provides speed bands (10kmh ranges) of almost 60k road stretches across Singapore every 5 minutes.įirst, I removed relatively constant fields and stored them as metadata. This accumulates as storage costs on Google Cloud Platform. While I’m only querying the APIs once every 15 minutes, some of the data is really rich, and file sizes add up quickly.

The real time APIs from LTA refreshes data at 1-5 minute intervals. I’ve been working on a side project to log real-time traffic statistics across Singapore and make the historic data publicly available.

0 Comments

I'm James. This is my year of travel.

Xz compression format

Leave a Reply.

Author

Archives

Categories