[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16626663#comment-16626663 ] Bridget Bevens commented on DRILL-4800: --- Docs were updated here: https://drill.apache.org/docs/asynchronous-parquet-reader/ > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra >Priority: Major > Labels: doc-complete > Fix For: 1.9.0 > > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633538#comment-15633538 ] Parth Chandra commented on DRILL-4800: -- Yes. Updated the gist. > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: doc-impacting > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632981#comment-15632981 ] Jinfeng Ni commented on DRILL-4800: --- In the configuration parameter doc https://github.com/parthchandra/drill/wiki/Drill-Parquet-Scan-Configuration, the default value of `store.parquet.reader.columnreader.async` would be false, right? > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: doc-impacting > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623876#comment-15623876 ] Parth Chandra commented on DRILL-4800: -- Some perf numbers on a new setup (so these are not directly comparable with the ones in the proposal doc)- Other reader (target): 17,142 ms DRILL 1.9.0 SNAPSHOT : 24,517 ms #1 DRILL 1.9.0 with buffering: 18,287 ms #2 DRILL 1.9.0 with Async Page Reader + buffering: 18,055 ms #3 DRILL 1.9.0 with Async Page Reader + buffering + Parallel decoding: 16,281 ms Under concurrent loads - #1, #2 scale linearly and with 5 concurrent queries take ~42s #3 also scales linearly, but degrades much faster, taking ~59s Under #3, CPU starts becoming a bottleneck For the PR, I'm proposing to go with #2 on by default, since it is still much better than what we have. Note that these numbers are from a scan heavy query on the TPCH lineitem table. > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: doc-impacting > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623475#comment-15623475 ] Parth Chandra commented on DRILL-4800: -- Configuration parameters are documented here: https://github.com/parthchandra/drill/wiki/Drill-Parquet-Scan-Configuration > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: doc-impacting > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570177#comment-15570177 ] Parth Chandra commented on DRILL-4800: -- Submitted PR 611: https://github.com/apache/drill/pull/611 [~jaltekruse] Would you like to take a look? Also added some documentation here on the configurable options here: https://github.com/parthchandra/drill/wiki/Parquet-file-reading-performance-improvement I'll post some performance number in the next couple of days. > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15409753#comment-15409753 ] Parth Chandra commented on DRILL-4800: -- Updated the doc to include more configurable options and metrics based on feedback. Added an open item to improve operator stats to handle the proposed changes. > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra >Assignee: Parth Chandra > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392388#comment-15392388 ] Parth Chandra commented on DRILL-4800: -- Good point. I'll include that in the benchmarking phase after making the first set of changes. > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390860#comment-15390860 ] Tomer Shiran commented on DRILL-4800: - This looks interesting and definitely valuable. One other thing that would be worth measuring and comparing is the throughput when there are multiple queries (perhaps 10 & 50?) running. In a real-world scenario, there will be many BI queries running concurrently, so CPU efficiency will be as important as great pipelining. > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4800) Improve parquet reader performance
[ https://issues.apache.org/jira/browse/DRILL-4800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15390046#comment-15390046 ] Parth Chandra commented on DRILL-4800: -- Been working on this in the background, collecting data on the parquet reader. Here's a summary and a proposal - https://github.com/parthchandra/drill/wiki/Parquet-file-reading-performance-improvement/_edit Also, the same proposal in a Google doc so folks can comment (https://docs.google.com/document/d/1FK2LWlazgSLWa_5_WDyt52lYATu8m6UWaWhr591R3ZI/edit?usp=sharing > Improve parquet reader performance > -- > > Key: DRILL-4800 > URL: https://issues.apache.org/jira/browse/DRILL-4800 > Project: Apache Drill > Issue Type: Improvement >Reporter: Parth Chandra > > Reported by a user in the field - > We're generally getting read speeds of about 100-150 MB/s/node on PARQUET > scan operator. This seems a little low given the number of drives on the node > - 24. We're looking for options we can improve the performance of this > operator as most of our queries are I/O bound. -- This message was sent by Atlassian JIRA (v6.3.4#6332)