[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2018-09-24 Thread Bridget Bevens (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626663#comment-16626663
 ] 

Bridget Bevens commented on DRILL-4800:
---

Docs were updated here: 
https://drill.apache.org/docs/asynchronous-parquet-reader/ 

> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>Priority: Major
>  Labels: doc-complete
> Fix For: 1.9.0
>
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-11-03 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633538#comment-15633538
 ] 

Parth Chandra commented on DRILL-4800:
--

Yes. Updated the gist. 

> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>  Labels: doc-impacting
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-11-03 Thread Jinfeng Ni (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632981#comment-15632981
 ] 

Jinfeng Ni commented on DRILL-4800:
---

In the configuration parameter doc 
https://github.com/parthchandra/drill/wiki/Drill-Parquet-Scan-Configuration,  
the default value of `store.parquet.reader.columnreader.async` would be false, 
right?


> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>  Labels: doc-impacting
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-10-31 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623876#comment-15623876
 ] 

Parth Chandra commented on DRILL-4800:
--

Some perf numbers on a new setup (so these are not directly comparable with the 
ones in the proposal doc)-

Other reader (target): 17,142 ms
DRILL 1.9.0 SNAPSHOT : 24,517 ms
#1 DRILL 1.9.0 with buffering: 18,287 ms
#2 DRILL 1.9.0 with Async Page Reader + buffering: 18,055 ms
#3 DRILL 1.9.0 with Async Page Reader + buffering + Parallel decoding: 16,281 ms

Under concurrent loads -
#1, #2 scale linearly and with 5 concurrent queries take ~42s
#3 also scales linearly, but degrades much faster, taking ~59s

Under #3, CPU starts becoming a bottleneck

For the PR, I'm proposing to go with #2 on by default, since it is still much 
better than what we have. 

Note that these numbers are from a scan heavy query on the TPCH lineitem table. 





> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>  Labels: doc-impacting
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-10-31 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623475#comment-15623475
 ] 

Parth Chandra commented on DRILL-4800:
--

Configuration parameters are documented here: 
https://github.com/parthchandra/drill/wiki/Drill-Parquet-Scan-Configuration


> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>  Labels: doc-impacting
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-10-12 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570177#comment-15570177
 ] 

Parth Chandra commented on DRILL-4800:
--

Submitted PR 611: https://github.com/apache/drill/pull/611
[~jaltekruse] Would you like to take a look?
Also added some documentation here on the configurable options here: 
https://github.com/parthchandra/drill/wiki/Parquet-file-reading-performance-improvement
 
I'll post some performance number in the next couple of days.


> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-08-05 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409753#comment-15409753
 ] 

Parth Chandra commented on DRILL-4800:
--

Updated the doc to include more configurable options and metrics based on 
feedback. Added an open item to improve operator stats to handle the proposed 
changes.

> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>Assignee: Parth Chandra
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-07-25 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392388#comment-15392388
 ] 

Parth Chandra commented on DRILL-4800:
--

Good point. I'll include that in the benchmarking phase after making the first 
set of changes. 

> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-07-23 Thread Tomer Shiran (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390860#comment-15390860
 ] 

Tomer Shiran commented on DRILL-4800:
-

This looks interesting and definitely valuable. One other thing that would be 
worth measuring and comparing is the throughput when there are multiple queries 
(perhaps 10 & 50?) running. In a real-world scenario, there will be many BI 
queries running concurrently, so CPU efficiency will be as important as great 
pipelining.

> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4800) Improve parquet reader performance

2016-07-22 Thread Parth Chandra (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390046#comment-15390046
 ] 

Parth Chandra commented on DRILL-4800:
--

Been working on this in the background, collecting data on the parquet reader. 
Here's a summary and a proposal -
https://github.com/parthchandra/drill/wiki/Parquet-file-reading-performance-improvement/_edit
Also, the same proposal in a Google doc so folks can comment 
(https://docs.google.com/document/d/1FK2LWlazgSLWa_5_WDyt52lYATu8m6UWaWhr591R3ZI/edit?usp=sharing






> Improve parquet reader performance
> --
>
> Key: DRILL-4800
> URL: https://issues.apache.org/jira/browse/DRILL-4800
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Parth Chandra
>
> Reported by a user in the field - 
> We're generally getting read speeds of about 100-150 MB/s/node on PARQUET 
> scan operator. This seems a little low given the number of drives on the node 
> - 24. We're looking for options we can improve the performance of this 
> operator as most of our queries are I/O bound. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)