That's it! Thanks David. to_table().column(0).num_chunks = to_table().num_rows, therefore combine_chunks() merged them all into one.
However I need to unpack the finer details of this a bit more: - Given I have over 100k simulations producing these 'single row' files. What's the best way to handle this afterwards or is there a better way to store this from the start. The simulations are all running in parallel so writing to the same file is challenging and I don't want to go down the route of implementing a database of some sort. Would storing them as uncompressed arrow files improve performance when reading in with the dataset API or is there another more efficient way to combine them? - What is optimal for chunks? Naturally a chunk for each row is not efficient. Leaning on some knowledge I have with Spark the idea is to partition your data so it can be spread across the number of nodes in the cluster, to few partitions meant an under utilised cluster. What should be done with chunks and do we have control over setting this? Kind regards Nikhil Makan On Mon, Oct 10, 2022 at 2:21 PM David Li <[email protected]> wrote: > I would guess that because the 187mb file is generated from 100,000+ > files, and each input file is one row, you are hitting basically a > pathological case for Dataset/Arrow. Namely, the Dataset code isn't going > to consolidate input batches together, and so each input batch (= 1 row) is > making it into the output file. And there's some metadata per batch (=per > row!), inflating the storage requirements, and completely inhibiting > compression. (You're running LZ4 on each individual value in each row!) > > To check this, can you call combine_chunks() [1] after to_table, before > write_feather? This will combine the record batches (well, technically, > chunks of the ChunkedArrays) into a single record batch, and I would guess > you'll end up with something similar to the to_pandas().to_feather() case > (without bouncing through Pandas). > > You could also check this with to_table().column(0).num_chunks (I'd expect > this would be equal to to_table().num_rows). > > [1]: > https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.combine_chunks > > On Sun, Oct 9, 2022, at 21:12, Nikhil Makan wrote: > > Hi Will, > > For clarity the simulation files do get written to an Azure Blob Storage, > however to simplify things I have not tried to read the data directly from > the cloud storage. I have downloaded it first and then loaded it into a > dataset locally (which takes 12 mins). The process to produce the arrow > file for each simulation is done through pandas. The raw results of each > simulation gets read in using Pandas and written to an arrow file after a > simulation is complete. The 100 000+ files are therefore in the arrow > format using LZ4 compression. > > The table retrieved from the dataset object is then written to an arrow > file using feather.write_feather which again by default uses LZ4. > > Do you know if there is any way to inspect the two files or tables to get > more information about them as I can't understand how I have two arrow > files, one which is 187 mb the other 5.5mb however both with the same > compression LZ4, schema, shape and nbytes when read in. > > I can even read in the 187mb arrow file and write it back to disk and it > remains 187 mb, so there is definitely some property of the arrow table > that I am not seeing. It is not necessarily a property of just the file. > > Kind regards > Nikhil Makan > > > On Mon, Oct 10, 2022 at 1:38 PM Will Jones <[email protected]> > wrote: > > Hi Nikhil, > > To do this I have chosen to use the Dataset API which reads all the files > (this takes around 12 mins) > > > Given the number of files (100k+ right?) this does seem surprising. > Especially if you are using a remote filesystem like Azure or S3. Perhaps > you should consider having your application record the file paths of each > simulation result; then the Datasets API doesn't need to spend time > resolving all the file paths. > > The part I find strange is the size of the combined arrow file (187 mb) on > disk and time it takes to write and read that file (> 1min). > > > On the size, you should check which compression you are using. There are > some code paths that write uncompressed data by default and some code paths > that do. The Pandas to_feather() uses LZ4 by default; it's possible the > other way you are writing isn't. See IPC write options [1]. > > On the time to read, that seems very long for local, and even for remote > (Azure?). > > Best, > > Will Jones > > [1] > https://arrow.apache.org/docs/python/generated/pyarrow.ipc.IpcWriteOptions.html#pyarrow.ipc.IpcWriteOptions > > > On Sun, Oct 9, 2022 at 5:08 PM Nikhil Makan <[email protected]> > wrote: > > Hi All, > > I have a situation where I am running a lot of simulations in parallel > (100k+). The results of each simulation get written to an arrow file. A > result file contains around 8 columns and 1 row. > > After all simulations have run I want to be able to merge all these files > up into a single arrow file. > > - To do this I have chosen to use the Dataset API which reads all the > files (this takes around 12 mins) > - I then call to_table() on the dataset to bring it into memory > - Finally I write the table to an arrow file using > feather.write_feather() > > Unfortunately I do not have a reproducible example, but hopefully the > answer to this question won't require one. > > The part I find strange is the size of the combined arrow file (187 mb) on > disk and time it takes to write and read that file (> 1min). > > Alternatively I can convert the table to a pandas dataframe by calling > to_pandas() and then use to_feather() on the dataframe. The resulting file > on disk is now only 5.5 mb and naturally writes and opens in a flash. > > I feel like this has something to do with partitions and how the table is > being structured coming from the dataset API that is then preserved when > writing to a file. > > Does anyone know why this would be the case and how to achieve the same > outcome as done with the intermediate step by converting to pandas. > > Kind regards > Nikhil Makan > > >
