Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-24 Thread Antoine Pitrou
For the record: this has been opened on JIRA as ARROW-10052. Here is the analysis I posted there (pasted): I'm not sure there's anything surprising. Running this thing a bit (in debug mode!), I see that RSS usage grows by 500-1000 bytes for each column chunk (that is, each column in a row grou

RE: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-22 Thread Lee, David
ParquetWriter without keeping entire dataset in memory (large than memory parquet files) External Email: Use caution with links and attachments Hi, I’ve tried both with little success. I made a JIRA: https://urldefense.com/v3/__https://issues.apache.org/jira/browse/ARROW-10052__;!!KSjYCgUGsB4

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-21 Thread Niklas B
Hi, I’ve tried both with little success. I made a JIRA: https://issues.apache.org/jira/browse/ARROW-10052 Looking at it now when I've made a minimal example I see something I didn't see/realize before which is that while the memory usage is i

Re: PyArrow: Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)

2020-09-19 Thread Micah Kornfield
Hi Niklas, Two suggestions: * Try to adjust row_group_size on write_table [1] to a smaller then default value. If I read the code correctly this is currently 64 million rows [2], which seems potentially two high as a default (I'll open a JIRA about this). * If this is on linux/mac try setting the