The short answer is no, there is nothing "pathological" about parquet,
it should be more or less as suited for wide columns as arrow's IPC
format. Both formats will require additional metadata when there are
more columns and compressibility may differ (although .arrows data is
often uncompressed).
Can you provide your test script? I don't get quite the same results.
For my test I created two tables, one that was 10,000 columns by 8,000
rows and one that was 80,000,000 rows in 1 column. There is simply
more metadata when you have 10k rows and less opportunity for
compression. As a result the file sizes were 611M for the tall and
739M for the wide so the wide requires about 20% more data. Reading
times (hot-in-cache reads) were ~.73s for the tall and ~.84s for the
wide and so the wide takes about 15% more time to read. This seems
about right to me.
## Writing script
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
TALL_ROWS = 80_000_000
TALL_COLS = 1
WIDE_ROWS = 8_000
WIDE_COLS = 10_000
tall_data = np.random.rand(TALL_COLS, TALL_ROWS)
wide_data = np.random.rand(WIDE_COLS, WIDE_ROWS)
tall_table = pa.Table.from_arrays([tall_data[0]], names=["values"])
pq.write_table(tall_table, '/tmp/tall.pq')
wide_names = [f'f{i}' for i in range(WIDE_COLS)]
wide_table = pa.Table.from_arrays(wide_data, names=wide_names)
pq.write_table(wide_table, '/tmp/wide.pq')
## Reading script
import pyarrow.parquet as pq
table = pq.read_table('/tmp/tall.pq')
print(table.num_rows)
print(table.num_columns)
On Tue, Jul 13, 2021 at 6:23 AM Martin Percossi <[email protected]> wrote:
>
> An alternative representation would be to have a single settlement price
> column, and add a stock_id column. Instead of a single row for each time
> step, you would now have, say, 10K rows - one for each stock.
>
> I think this will yield better performance.
>
> On Tue, 13 Jul 2021, 18:12 Joris Peeters, <[email protected]> wrote:
>>
>> Hello,
>>
>> Sending to user@arrow, as that appears the best place for parquet questions
>> atm, but feel free to redirect me.
>>
>> My objective is to store financial data in Parquet files, and read it out
>> fast.
>> The columns represent stocks (~= 10,000 or so), and each row is a date (~=
>> 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to use
>> short row groups of e.g. a year each, for quickly getting to smaller date
>> ranges, or query for a subset of columns (stocks).
>>
>> The appeal of parquet is that I could store all of this stuff in one file,
>> and use the row-groups + column-select for slicing, rather than have a ton
>> of smaller files etc. Would also integrate well with various ML tech.
>>
>> When doing some basic performance testing, with random data, I noticed that
>> the performance for tables with many columns seems fairly poor. I've
>> attached a little benchmark script - see output at the bottom.
>>
>> Stylised conslusions,
>> - Reading/writing a "tall" (nrows >> ncols) dataframe is much more
>> performant than a "wide" dataframe.
>> - with the Arrow format (as opposed to parquet), the difference is much
>> smaller.
>> - Similar results on Windows & Linux, and for Arrow's parquet vs fastparquet.
>>
>> Is there something pathological about the parquet format that manifests in
>> this regime, or is it rather that the code might not have been optimised for
>> this? Aware that ncols >> nrows is not ideal, but was hoping for less of a
>> cliff.
>>
>> Happy to dig in, but polling experts first.
>>
>> Best,
>> -J
>>
>> >python benchmark.py
>> 2021-07-13 16:31:54.786 INFO Writing parquet to
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> 2021-07-13 16:31:55.123 INFO Written.
>> 2021-07-13 16:31:55.123 INFO Writing parquet to
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> 2021-07-13 16:31:57.155 INFO Written.
>> 2021-07-13 16:31:57.155 INFO Writing parquet to
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
>> [FastParquet]
>> 2021-07-13 16:31:57.789 INFO Written.
>> 2021-07-13 16:31:57.790 INFO Writing parquet to
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
>> [FastParquet]
>> 2021-07-13 16:32:03.613 INFO Written.
>> 2021-07-13 16:32:03.613 INFO Reading parquet from
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
>> 2021-07-13 16:32:03.890 INFO Read.
>> 2021-07-13 16:32:03.899 INFO Reading parquet from
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
>> 2021-07-13 16:32:08.727 INFO Read.
>> 2021-07-13 16:32:08.737 INFO Reading parquet from
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
>> [FastParquet]
>> 2021-07-13 16:32:08.983 INFO Read.
>> 2021-07-13 16:32:08.991 INFO Reading parquet from
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
>> [FastParquet]
>> 2021-07-13 16:32:11.580 INFO Read.
>> 2021-07-13 16:32:11.589 INFO Writing Arrow to
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> 2021-07-13 16:32:13.057 INFO Arrow written.
>> 2021-07-13 16:32:13.078 INFO Writing Arrow to
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> 2021-07-13 16:32:13.425 INFO Arrow written.
>> 2021-07-13 16:32:13.434 INFO Reading Arrow from
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
>> 2021-07-13 16:32:13.620 INFO Read.
>> 2021-07-13 16:32:13.637 INFO Reading Arrow from
>> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
>> 2021-07-13 16:32:13.711 INFO Read.
>>
>>