An alternative representation would be to have a single settlement price column, and add a stock_id column. Instead of a single row for each time step, you would now have, say, 10K rows - one for each stock.
I think this will yield better performance. On Tue, 13 Jul 2021, 18:12 Joris Peeters, <[email protected]> wrote: > Hello, > > Sending to user@arrow, as that appears the best place for parquet > questions atm, but feel free to redirect me. > > My objective is to store financial data in Parquet files, and read it out > fast. > The columns represent stocks (~= 10,000 or so), and each row is a date (~= > 8000, e.g. 30 years). Values are e.g. settlement prices. I might want to > use short row groups of e.g. a year each, for quickly getting to smaller > date ranges, or query for a subset of columns (stocks). > > The appeal of parquet is that I could store all of this stuff in one file, > and use the row-groups + column-select for slicing, rather than have a ton > of smaller files etc. Would also integrate well with various ML tech. > > When doing some basic performance testing, with random data, I noticed > that the performance for tables with many columns seems fairly poor. I've > attached a little benchmark script - see output at the bottom. > > Stylised conslusions, > - Reading/writing a "tall" (nrows >> ncols) dataframe is *much* more > performant than a "wide" dataframe. > - with the Arrow format (as opposed to parquet), the difference is much > smaller. > - Similar results on Windows & Linux, and for Arrow's parquet vs > fastparquet. > > Is there something pathological about the parquet format that manifests in > this regime, or is it rather that the code might not have been optimised > for this? Aware that ncols >> nrows is not ideal, but was hoping for less > of a cliff. > > Happy to dig in, but polling experts first. > > Best, > -J > > >python benchmark.py > 2021-07-13 16:31:54.786 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow] > 2021-07-13 16:31:55.123 INFO Written. > 2021-07-13 16:31:55.123 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow] > 2021-07-13 16:31:57.155 INFO Written. > 2021-07-13 16:31:57.155 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_fpq.pq > [FastParquet] > 2021-07-13 16:31:57.789 INFO Written. > 2021-07-13 16:31:57.790 INFO Writing parquet to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_fpq.pq > [FastParquet] > 2021-07-13 16:32:03.613 INFO Written. > 2021-07-13 16:32:03.613 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq [Arrow] > 2021-07-13 16:32:03.890 INFO Read. > 2021-07-13 16:32:03.899 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq [Arrow] > 2021-07-13 16:32:08.727 INFO Read. > 2021-07-13 16:32:08.737 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall_apq.pq > [FastParquet] > 2021-07-13 16:32:08.983 INFO Read. > 2021-07-13 16:32:08.991 INFO Reading parquet from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide_apq.pq > [FastParquet] > 2021-07-13 16:32:11.580 INFO Read. > 2021-07-13 16:32:11.589 INFO Writing Arrow to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows > 2021-07-13 16:32:13.057 INFO Arrow written. > 2021-07-13 16:32:13.078 INFO Writing Arrow to > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows > 2021-07-13 16:32:13.425 INFO Arrow written. > 2021-07-13 16:32:13.434 INFO Reading Arrow from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_wide.arrows > 2021-07-13 16:32:13.620 INFO Read. > 2021-07-13 16:32:13.637 INFO Reading Arrow from > C:\Users\jpeeter\AppData\Local\Temp\tmpstgfosrp\example_tall.arrows > 2021-07-13 16:32:13.711 INFO Read. > > >
