An alternative representation would be to have a single settlement price
column, and add a stock_id column. Instead of a single row for each time
step, you would now have, say, 10K rows - one for each stock.

I think this will yield better performance.

On Tue, 13 Jul 2021, 18:12 Joris Peeters, <[email protected]>
wrote:

> Hello,
>
> Sending to user@arrow, as that appears the best place for parquet
> questions atm, but feel free to redirect me.
>
> My objective is to store financial data in Parquet files, and read it out
> fast.
> The columns represent stocks (~= 10,000 or so), and each row is a date (~=
> 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to
> use short row groups of e.g. a year each, for quickly getting to smaller
> date ranges, or query for a subset of columns (stocks).
>
> The appeal of parquet is that I could store all of this stuff in one file,
> and use the row-groups + column-select for slicing, rather than have a ton
> of smaller files etc. Would also integrate well with various ML tech.
>
> When doing some basic performance testing, with random data, I noticed
> that the performance for tables with many columns seems fairly poor. I've
> attached a little benchmark script - see output at the bottom.
>
> Stylised conslusions,
> - Reading/writing a "tall" (nrows >> ncols) dataframe is *much* more
> performant than a "wide" dataframe.
> - with the Arrow format (as opposed to parquet), the difference is much
> smaller.
> - Similar results on Windows & Linux, and for Arrow's parquet vs
> fastparquet.
>
> Is there something pathological about the parquet format that manifests in
> this regime, or is it rather that the code might not have been optimised
> for this? Aware that ncols >> nrows is not ideal, but was hoping for less
> of a cliff.
>
> Happy to dig in, but polling experts first.
>
> Best,
> -J
>
> >python benchmark.py
> 2021-07-13 16:31:54.786 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> 2021-07-13 16:31:55.123 INFO     Written.
> 2021-07-13 16:31:55.123 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> 2021-07-13 16:31:57.155 INFO     Written.
> 2021-07-13 16:31:57.155 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq
> [FastParquet]
> 2021-07-13 16:31:57.789 INFO     Written.
> 2021-07-13 16:31:57.790 INFO     Writing parquet to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq
> [FastParquet]
> 2021-07-13 16:32:03.613 INFO     Written.
> 2021-07-13 16:32:03.613 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow]
> 2021-07-13 16:32:03.890 INFO     Read.
> 2021-07-13 16:32:03.899 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow]
> 2021-07-13 16:32:08.727 INFO     Read.
> 2021-07-13 16:32:08.737 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq
> [FastParquet]
> 2021-07-13 16:32:08.983 INFO     Read.
> 2021-07-13 16:32:08.991 INFO     Reading parquet from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq
> [FastParquet]
> 2021-07-13 16:32:11.580 INFO     Read.
> 2021-07-13 16:32:11.589 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> 2021-07-13 16:32:13.057 INFO     Arrow written.
> 2021-07-13 16:32:13.078 INFO     Writing Arrow to
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> 2021-07-13 16:32:13.425 INFO     Arrow written.
> 2021-07-13 16:32:13.434 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows
> 2021-07-13 16:32:13.620 INFO     Read.
> 2021-07-13 16:32:13.637 INFO     Reading Arrow from
> C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows
> 2021-07-13 16:32:13.711 INFO     Read.
>
>
>

Reply via email to