[Java][Arrow IPC] Extreme memory usage when reading feather files

Laura Doktorova Sun, 29 Jan 2023 20:56:04 -0800

Use case:

   - process rows in Java from lots of feather files generated from pandas
   dataframes and convert to another format
   - files contain ~2000+ float8 columns and 200_000 rows each
   - do not need to have the whole file in memory
   - need to process files in parallel


Problem:

   - Reading files takes a lot of memory, even if we read just one file.
   - Reading one such file with ArrowFileReader fails if directMemorySize
   is set below 2.5gb (with swap set to 0).
   - Reading one such file with dataset api fails if system memory is below
   7gb (with swap set to 0).
   - Also noted that saving this file from pandas with a call to_feather
   fails when RAM is below 9GB of RAM (swap is 0). Dataframe itself is 3GB.


Is there a way to read feather files in a way that would not take so much
space? It seems that each file loads in batches of 4, but the memory taken
is much more than one batch.
I can provide the sample code if that helps.

Thank you

[Java][Arrow IPC] Extreme memory usage when reading feather files

Reply via email to