Use case: - process rows in Java from lots of feather files generated from pandas dataframes and convert to another format - files contain ~2000+ float8 columns and 200_000 rows each - do not need to have the whole file in memory - need to process files in parallel
Problem: - Reading files takes a lot of memory, even if we read just one file. - Reading one such file with ArrowFileReader fails if directMemorySize is set below 2.5gb (with swap set to 0). - Reading one such file with dataset api fails if system memory is below 7gb (with swap set to 0). - Also noted that saving this file from pandas with a call to_feather fails when RAM is below 9GB of RAM (swap is 0). Dataframe itself is 3GB. Is there a way to read feather files in a way that would not take so much space? It seems that each file loads in batches of 4, but the memory taken is much more than one batch. I can provide the sample code if that helps. Thank you
