[jira] [Created] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

Samarth Jain (Jira) Tue, 20 Aug 2019 14:20:14 -0700

Samarth Jain created PARQUET-1641:
-------------------------------------

             Summary: Parquet pages for different columns cannot be read in 
parallel 
                 Key: PARQUET-1641
                 URL: https://issues.apache.org/jira/browse/PARQUET-1641
             Project: Parquet
          Issue Type: Improvement
            Reporter: Samarth Jain



All ColumnChunkPageReader instances use the same decompressor. 

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L1286]

<code>

BytesInputDecompressor decompressor = 
options.getCodecFactory().getDecompressor(descriptor.metadata.getCodec());

return new ColumnChunkPageReader(decompressor, pagesInChunk, dictionaryPage);

</code>

 

The CodecFactory caches the decompressors for every codec type returning the 
same instance on every getCompressor(codecName) call. See the caching happening 
here:

[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L197]

<code>

@Override
public BytesDecompressor getDecompressor(CompressionCodecName codecName) {
 BytesDecompressor decomp = decompressors.get(codecName);
 if (decomp == null) {
 decomp = createDecompressor(codecName);
 decompressors.put(codecName, decomp);
 }
 return decomp;
}

</code>

 

If multiple threads try to read the pages belonging to different columns, they 
run into thread

safety issues. This issue prevents increasing the throughput at which 
applications can read parquet data by parallelizing page reads. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Created] (PARQUET-1641) Parquet pages for different columns cannot be read in parallel

Reply via email to