Hi Nikhil,

To do this I have chosen to use the Dataset API which reads all the files
> (this takes around 12 mins)
>

Given the number of files (100k+ right?) this does seem surprising.
Especially if you are using a remote filesystem like Azure or S3. Perhaps
you should consider having your application record the file paths of each
simulation result; then the Datasets API doesn't need to spend time
resolving all the file paths.

The part I find strange is the size of the combined arrow file (187 mb) on
> disk and time it takes to write and read that file (> 1min).
>

On the size, you should check which compression you are using. There are
some code paths that write uncompressed data by default and some code paths
that do. The Pandas to_feather() uses LZ4 by default; it's possible the
other way you are writing isn't. See IPC write options [1].

On the time to read, that seems very long for local, and even for remote
(Azure?).

Best,

Will Jones

[1]
https://arrow.apache.org/docs/python/generated/pyarrow.ipc.IpcWriteOptions.html#pyarrow.ipc.IpcWriteOptions


On Sun, Oct 9, 2022 at 5:08 PM Nikhil Makan <[email protected]> wrote:

> Hi All,
>
> I have a situation where I am running a lot of simulations in parallel
> (100k+). The results of each simulation get written to an arrow file. A
> result file contains around 8 columns and 1 row.
>
> After all simulations have run I want to be able to merge all these files
> up into a single arrow file.
>
>    - To do this I have chosen to use the Dataset API which reads all the
>    files (this takes around 12 mins)
>    - I then call to_table() on the dataset to bring it into memory
>    - Finally I write the table to an arrow file using
>    feather.write_feather()
>
> Unfortunately I do not have a reproducible example, but hopefully the
> answer to this question won't require one.
>
> The part I find strange is the size of the combined arrow file (187 mb) on
> disk and time it takes to write and read that file (> 1min).
>
> Alternatively I can convert the table to a pandas dataframe by calling
> to_pandas() and then use to_feather() on the dataframe. The resulting file
> on disk is now only 5.5 mb and naturally writes and opens in a flash.
>
> I feel like this has something to do with partitions and how the table is
> being structured coming from the dataset API that is then preserved when
> writing to a file.
>
> Does anyone know why this would be the case and how to achieve the same
> outcome as done with the intermediate step by converting to pandas.
>
> Kind regards
> Nikhil Makan
>
>

Reply via email to