I would guess that because the 187mb file is generated from 100,000+ files, and 
each input file is one row, you are hitting basically a pathological case for 
Dataset/Arrow. Namely, the Dataset code isn't going to consolidate input 
batches together, and so each input batch (= 1 row) is making it into the 
output file. And there's some metadata per batch (=per row!), inflating the 
storage requirements, and completely inhibiting compression. (You're running 
LZ4 on each individual value in each row!)

To check this, can you call combine_chunks() [1] after to_table, before 
write_feather? This will combine the record batches (well, technically, chunks 
of the ChunkedArrays) into a single record batch, and I would guess you'll end 
up with something similar to the to_pandas().to_feather() case (without 
bouncing through Pandas). 

You could also check this with to_table().column(0).num_chunks (I'd expect this 
would be equal to to_table().num_rows).

[1]: 
https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.combine_chunks

On Sun, Oct 9, 2022, at 21:12, Nikhil Makan wrote:
> Hi Will,
> 
> For clarity the simulation files do get written to an Azure Blob Storage, 
> however to simplify things I have not tried to read the data directly from 
> the cloud storage. I have downloaded it first and then loaded it into a 
> dataset locally (which takes 12 mins). The process to produce the arrow file 
> for each simulation is done through pandas. The raw results of each 
> simulation gets read in using Pandas and written to an arrow file after a 
> simulation is complete. The 100 000+ files are therefore in the arrow format 
> using LZ4 compression.
> 
> The table retrieved from the dataset object is then written to an arrow file 
> using feather.write_feather which again by default uses LZ4.
> 
> Do you know if there is any way to inspect the two files or tables to get 
> more information about them as I can't understand how I have two arrow files, 
> one which is 187 mb the other 5.5mb however both with the same compression 
> LZ4, schema, shape and nbytes when read in.
> 
> I can even read in the 187mb arrow file and write it back to disk and it 
> remains 187 mb, so there is definitely some property of the arrow table that 
> I am not seeing. It is not necessarily a property of just the file.
> 
> Kind regards
> Nikhil Makan
> 
> 
> On Mon, Oct 10, 2022 at 1:38 PM Will Jones <[email protected]> wrote:
>> Hi Nikhil,
>> 
>>> To do this I have chosen to use the Dataset API which reads all the files 
>>> (this takes around 12 mins)
>> 
>> Given the number of files (100k+ right?) this does seem surprising. 
>> Especially if you are using a remote filesystem like Azure or S3. Perhaps 
>> you should consider having your application record the file paths of each 
>> simulation result; then the Datasets API doesn't need to spend time 
>> resolving all the file paths.
>> 
>>> The part I find strange is the size of the combined arrow file (187 mb) on 
>>> disk and time it takes to write and read that file (> 1min).
>> 
>> On the size, you should check which compression you are using. There are 
>> some code paths that write uncompressed data by default and some code paths 
>> that do. The Pandas to_feather() uses LZ4 by default; it's possible the 
>> other way you are writing isn't. See IPC write options [1].
>> 
>> On the time to read, that seems very long for local, and even for remote 
>> (Azure?).
>> 
>> Best,
>> 
>> Will Jones
>> 
>> [1] 
>> https://arrow.apache.org/docs/python/generated/pyarrow.ipc.IpcWriteOptions.html#pyarrow.ipc.IpcWriteOptions
>>  
>> 
>> On Sun, Oct 9, 2022 at 5:08 PM Nikhil Makan <[email protected]> wrote:
>>> Hi All,
>>> 
>>> I have a situation where I am running a lot of simulations in parallel 
>>> (100k+). The results of each simulation get written to an arrow file. A 
>>> result file contains around 8 columns and 1 row.
>>> 
>>> After all simulations have run I want to be able to merge all these files 
>>> up into a single arrow file.
>>>  * To do this I have chosen to use the Dataset API which reads all the 
>>> files (this takes around 12 mins)
>>>  * I then call to_table() on the dataset to bring it into memory
>>>  * Finally I write the table to an arrow file using feather.write_feather()
>>> Unfortunately I do not have a reproducible example, but hopefully the 
>>> answer to this question won't require one.
>>> 
>>> The part I find strange is the size of the combined arrow file (187 mb) on 
>>> disk and time it takes to write and read that file (> 1min).
>>> 
>>> Alternatively I can convert the table to a pandas dataframe by calling 
>>> to_pandas() and then use to_feather() on the dataframe. The resulting file 
>>> on disk is now only 5.5 mb and naturally writes and opens in a flash.
>>> 
>>> I feel like this has something to do with partitions and how the table is 
>>> being structured coming from the dataset API that is then preserved when 
>>> writing to a file.
>>> 
>>> Does anyone know why this would be the case and how to achieve the same 
>>> outcome as done with the intermediate step by converting to pandas.
>>> 
>>> Kind regards
>>> Nikhil Makan
>>> 

Reply via email to