I would guess that because the 187mb file is generated from 100,000+ files, and each input file is one row, you are hitting basically a pathological case for Dataset/Arrow. Namely, the Dataset code isn't going to consolidate input batches together, and so each input batch (= 1 row) is making it into the output file. And there's some metadata per batch (=per row!), inflating the storage requirements, and completely inhibiting compression. (You're running LZ4 on each individual value in each row!)
To check this, can you call combine_chunks() [1] after to_table, before write_feather? This will combine the record batches (well, technically, chunks of the ChunkedArrays) into a single record batch, and I would guess you'll end up with something similar to the to_pandas().to_feather() case (without bouncing through Pandas). You could also check this with to_table().column(0).num_chunks (I'd expect this would be equal to to_table().num_rows). [1]: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.combine_chunks On Sun, Oct 9, 2022, at 21:12, Nikhil Makan wrote: > Hi Will, > > For clarity the simulation files do get written to an Azure Blob Storage, > however to simplify things I have not tried to read the data directly from > the cloud storage. I have downloaded it first and then loaded it into a > dataset locally (which takes 12 mins). The process to produce the arrow file > for each simulation is done through pandas. The raw results of each > simulation gets read in using Pandas and written to an arrow file after a > simulation is complete. The 100 000+ files are therefore in the arrow format > using LZ4 compression. > > The table retrieved from the dataset object is then written to an arrow file > using feather.write_feather which again by default uses LZ4. > > Do you know if there is any way to inspect the two files or tables to get > more information about them as I can't understand how I have two arrow files, > one which is 187 mb the other 5.5mb however both with the same compression > LZ4, schema, shape and nbytes when read in. > > I can even read in the 187mb arrow file and write it back to disk and it > remains 187 mb, so there is definitely some property of the arrow table that > I am not seeing. It is not necessarily a property of just the file. > > Kind regards > Nikhil Makan > > > On Mon, Oct 10, 2022 at 1:38 PM Will Jones <[email protected]> wrote: >> Hi Nikhil, >> >>> To do this I have chosen to use the Dataset API which reads all the files >>> (this takes around 12 mins) >> >> Given the number of files (100k+ right?) this does seem surprising. >> Especially if you are using a remote filesystem like Azure or S3. Perhaps >> you should consider having your application record the file paths of each >> simulation result; then the Datasets API doesn't need to spend time >> resolving all the file paths. >> >>> The part I find strange is the size of the combined arrow file (187 mb) on >>> disk and time it takes to write and read that file (> 1min). >> >> On the size, you should check which compression you are using. There are >> some code paths that write uncompressed data by default and some code paths >> that do. The Pandas to_feather() uses LZ4 by default; it's possible the >> other way you are writing isn't. See IPC write options [1]. >> >> On the time to read, that seems very long for local, and even for remote >> (Azure?). >> >> Best, >> >> Will Jones >> >> [1] >> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.IpcWriteOptions.html#pyarrow.ipc.IpcWriteOptions >> >> >> On Sun, Oct 9, 2022 at 5:08 PM Nikhil Makan <[email protected]> wrote: >>> Hi All, >>> >>> I have a situation where I am running a lot of simulations in parallel >>> (100k+). The results of each simulation get written to an arrow file. A >>> result file contains around 8 columns and 1 row. >>> >>> After all simulations have run I want to be able to merge all these files >>> up into a single arrow file. >>> * To do this I have chosen to use the Dataset API which reads all the >>> files (this takes around 12 mins) >>> * I then call to_table() on the dataset to bring it into memory >>> * Finally I write the table to an arrow file using feather.write_feather() >>> Unfortunately I do not have a reproducible example, but hopefully the >>> answer to this question won't require one. >>> >>> The part I find strange is the size of the combined arrow file (187 mb) on >>> disk and time it takes to write and read that file (> 1min). >>> >>> Alternatively I can convert the table to a pandas dataframe by calling >>> to_pandas() and then use to_feather() on the dataframe. The resulting file >>> on disk is now only 5.5 mb and naturally writes and opens in a flash. >>> >>> I feel like this has something to do with partitions and how the table is >>> being structured coming from the dataset API that is then preserved when >>> writing to a file. >>> >>> Does anyone know why this would be the case and how to achieve the same >>> outcome as done with the intermediate step by converting to pandas. >>> >>> Kind regards >>> Nikhil Makan >>>
