[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-04 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569210#comment-16569210
 ] 

Wes McKinney commented on PARQUET-1370:
---

I have opened some issues related to buffering / concurrent IO in C++, e.g. 
https://issues.apache.org/jira/browse/ARROW-501

[~rgruener] In 0.10.0 the pyarrow file handles implement RawIOBase now

I don't think it would be to difficult to add a buffering reader to the Parquet 
hot path with a configurable buffer size. We already have a 
{{BufferedInputStream}} which may help

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568428#comment-16568428
 ] 

Robert Gruener commented on PARQUET-1370:
-

I see, in my case I am using the file handle from the pyarrow hdfs class which 
does not seem to implement the RawIOBase API. Though it should be pretty easy 
to work around.

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568403#comment-16568403
 ] 

Uwe L. Korn commented on PARQUET-1370:
--

I'm doing the same, my code looks as follows:
{code:java}
reader = …some file handle…
reader = io.BufferedReader(reader, 512 * 1024)
parquet_file = ParquetFile(reader){code}
This was so simple that I thought it might not be relevant for now. Having a 
general C++ implementation of {{io.BufferedReader}} in Arrow C++ might be a 
simpler approach to our problem. The usage of `io.BufferedReader` involves 
probably some additional memory copies and overhead as we have to switch 
between Python and C++ often.

(In my case, the file handle is coming from [https://github.com/mbr/simplekv] / 
[https://github.com/blue-yonder/storefact] )

 

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Robert Gruener (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568384#comment-16568384
 ] 

Robert Gruener commented on PARQUET-1370:
-

Thanks for the tip [~xhochy]! Though we are reading parquet using pyarrow so I 
dont think it would be as straight forward as adding that buffer. Unless there 
is something I am not seeing?

 

Either way it would be nice to not have to worry about this as a user of the 
library.

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1370) Read consecutive column chunks in a single scan

2018-08-03 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/PARQUET-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16568368#comment-16568368
 ] 

Uwe L. Korn commented on PARQUET-1370:
--

[~rgruener] I was also plagued by this issue but I wrapped my Python code in 
[https://docs.python.org/3/library/io.html#io.BufferedReader] and this gave me 
sufficient performance. This was especially useful for me as I'm working with 
object stores like S3 or Azure Blob where consecutive reads of 40kb or 512kb 
nearly make no difference but the HTTP request overhead is the main bottleneck. 

> Read consecutive column chunks in a single scan
> ---
>
> Key: PARQUET-1370
> URL: https://issues.apache.org/jira/browse/PARQUET-1370
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Robert Gruener
>Priority: Major
>
> Currently parquet-cpp calls for a filesystem scan with every single data page 
> see 
> [https://github.com/apache/parquet-cpp/blob/a0d1669cf67b055cd7b724dea04886a0ded53c8f/src/parquet/column_reader.cc#L181]
> For remote filesystems this can be very inefficient when reading many small 
> columns. The java implementation already does this and will read consecutive 
> column chunks (and the resulting pages) in a single scan see 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L786]
>  
> This might be a bit difficult to do, as it would require changing a lot of 
> the code structure but it would certainly be valuable for workloads concerned 
> with optimal read performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)