Re: parquet performance for wide tables (many columns)

Weston Pace Tue, 13 Jul 2021 11:11:13 -0700

The short answer is no, there is nothing "pathological" about parquet,
it should be more or less as suited for wide columns as arrow's IPC
format.  Both formats will require additional metadata when there are
more columns and compressibility may differ (although .arrows data is
often uncompressed).


Can you provide your test script?  I don't get quite the same results.
For my test I created two tables, one that was 10,000 columns by 8,000
rows and one that was 80,000,000 rows in 1 column.  There is simply
more metadata when you have 10k rows and less opportunity for
compression.  As a result the file sizes were 611M for the tall and
739M for the wide so the wide requires about 20% more data.  Reading
times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
wide and so the wide takes about 15% more time to read.  This seems
about right to me.

## Writing script

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

TALL_ROWS = 80_000_000
TALL_COLS = 1
WIDE_ROWS = 8_000
WIDE_COLS = 10_000

tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)

tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
pq.write_table(tall_table, '/tmp/tall.pq')

wide_names = [f'f{i}' for i in range(WIDE_COLS)]
wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
pq.write_table(wide_table, '/tmp/wide.pq')

## Reading script

import pyarrow.parquet as pq

table = pq.read_table('/tmp/tall.pq')
print(table.num_rows)
print(table.num_columns)

On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <[email protected]> wrote:
>
> An alternative representation would be to have a single settlement price 
> column, and add a stock_id column. Instead of a single row for each time 
> step, you would now have, say, 10K rows - one for each stock.
>
> I think this will yield better performance.
>
> On Tue, 13 Jul 2021, 18:12 Joris Peeters, <[email protected]> wrote:
>>
>> Hello,
>>
>> Sending to user@arrow, as that appears the best place for parquet questions 
>> atm, but feel free to redirect me.
>>
>> My objective is to store financial data in Parquet files, and read it out 
>> fast.
>> The columns represent stocks (~= 10,000 or so), and each row is a date (~= 
>> 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to use 
>> short row groups of e.g. a year each, for quickly getting to smaller date 
>> ranges, or query for a subset of columns (stocks).
>>
>> The appeal of parquet is that I could store all of this stuff in one file, 
>> and use the row-groups + column-select for slicing, rather than have a ton 
>> of smaller files etc. Would also integrate well with various ML tech.
>>
>> When doing some basic performance testing, with random data, I noticed that 
>> the performance for tables with many columns seems fairly poor. I've 
>> attached a little benchmark script - see output at the bottom.
>>
>> Stylised conslusions,
>> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more 
>> performant than a "wide" dataframe.
>> - with the Arrow format (as opposed to parquet), the difference is much 
>> smaller.
>> - Similar results on Windows & Linux, and for Arrow's parquet vs fastparquet.
>>
>> Is there something pathological about the parquet format that manifests in 
>> this regime, or is it rather that the code might not have been optimised for 
>> this? Aware that ncols >> nrows is not ideal, but was hoping for less of a 
>> cliff.
>>
>> Happy to dig in, but polling experts first.
>>
>> Best,
>> -J
>>
>> >python benchmark.py
>> 2021-07-13 16:31:54.786 INFO     Writing parquet to 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> 2021-07-13 16:31:55.123 INFO     Written.
>> 2021-07-13 16:31:55.123 INFO     Writing parquet to 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> 2021-07-13 16:31:57.155 INFO     Written.
>> 2021-07-13 16:31:57.155 INFO     Writing parquet to 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq 
>> [FastParquet]
>> 2021-07-13 16:31:57.789 INFO     Written.
>> 2021-07-13 16:31:57.790 INFO     Writing parquet to 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq 
>> [FastParquet]
>> 2021-07-13 16:32:03.613 INFO     Written.
>> 2021-07-13 16:32:03.613 INFO     Reading parquet from 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> 2021-07-13 16:32:03.890 INFO     Read.
>> 2021-07-13 16:32:03.899 INFO     Reading parquet from 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> 2021-07-13 16:32:08.727 INFO     Read.
>> 2021-07-13 16:32:08.737 INFO     Reading parquet from 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq 
>> [FastParquet]
>> 2021-07-13 16:32:08.983 INFO     Read.
>> 2021-07-13 16:32:08.991 INFO     Reading parquet from 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq 
>> [FastParquet]
>> 2021-07-13 16:32:11.580 INFO     Read.
>> 2021-07-13 16:32:11.589 INFO     Writing Arrow to 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> 2021-07-13 16:32:13.057 INFO     Arrow written.
>> 2021-07-13 16:32:13.078 INFO     Writing Arrow to 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> 2021-07-13 16:32:13.425 INFO     Arrow written.
>> 2021-07-13 16:32:13.434 INFO     Reading Arrow from 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> 2021-07-13 16:32:13.620 INFO     Read.
>> 2021-07-13 16:32:13.637 INFO     Reading Arrow from 
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> 2021-07-13 16:32:13.711 INFO     Read.
>>
>>

Re: parquet performance for wide tables (many columns)

Reply via email to