GitHub user alexeyroytman created a discussion: [C++][Parquet] High memory usage on large Parquet file reading with Project and Filter, with Scanner::Scan()
Hello. I'm trying to fulfill a proof-of-concept for Arrow C++ usage for large Parquet file reading with partial-Project and Filter. (I've done the same earlier, with Java code of Apache Parquet project.) I've seen an example of a code that has `ScannerBuilder::Project()` and `ScannerBuilder::Filter()`. In my case, it uses over 300% of CPU, gathers above 30GB of virtual memory, starts thrashing, and at some moment OOM killer kills it. Some background on the contents. 1. For now, I create a Parquet file by myself, and fully control it. 2. Its schema contains of D0..D9 (10 non-nullable strings as BINARY), M0-M4 (5 nullable DOUBLEs). 3. Each of the D(i) is randomly selected from a relatively small subset of strings (up to thousands). 4. M0 has a random value. Other M(j)s are nulls. 5. For D5 and D6 I enable dictionary, statistics and Bloom filter. 6. For M(j)s I disable dictionary, statistics and Bloom filter. 8. The rows are not sorted. 7. The Parquet file is uncompressed. 9. The Parquet file contains 1 billion of rows, its file size is 9.2 GiB. (The file creation time is ~50 minutes.) 11. The original comma-separated file in a tabular form is of 96.5 GiB. (When compressed with "`gzip -9`", it makes it 8.1 GiB). The reading context: 1. I project all D(i), M0 and M1 columns. 2. I filter on: (D5==s51 || D5==s52) && (D6==s61 || D6==s61). 3. The expected number of rows is ~77 million (out of 1 billion). 4. I use `ScannerBuilder::Project()`, `ScannerBuilder::Filter()` and then `Scanner::Scan()`. 5. Inside the `Scanner::Scan()`, sometimes batches have 0 rows, not sure whether this is relevant. I trace high memory allocations (500 MiB and above), all doing `PoolBuffer::Resize()` and/or `PoolBuffer::Reserve()`. I see in the stacktraces: 1. `parquet::TypedDecoder<parquet::PhysicalType<(parquet::Type::type)6> >::DecodeArrowNonNull(int, parquet::EncodingTraits<parquet::PhysicalType<(parquet::Type::type)6> >::Accumulator*)` 2. `TransferColumnData(parquet::internal::RecordReader*, std::unique_ptr<parquet::ColumnChunkMetaData, std::default_delete<parquet::ColumnChunkMetaData> >, std::shared_ptr<arrow::Field> const&, parquet::ColumnDescriptor const*, parquet::arrow::ReaderContext const*, std::shared_ptr<arrow::ChunkedArray>*)` GitHub link: https://github.com/apache/arrow/discussions/49976 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
