[jira] [Updated] (DRILL-5207) Improve Parquet scan pipelining
[ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Ollala updated DRILL-5207: - Reviewer: Kunal Khatua (was: Sudheesh Katkam) > Improve Parquet scan pipelining > --- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: doc-impacting > Fix For: 1.10.0 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads > ~1MB at a time. The Parquet decode is also processing 1MB at a time. This > means the disk is idle while the data is being processed. Reducing the buffer > to 1MB will reduce the time the processing thread waits for the disk read > thread. > Additionally, since the data to process a page may be more or less than 1MB, > a queue of pages will help so that the disk scan does not block (until the > queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon > as it is initialized. Since this is called at setup time, this increases the > setup time for the query and query execution does not begin until this is > completed. > There are a few other inefficiencies - options are read every time a page > reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5207) Improve Parquet scan pipelining
[ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Chandra updated DRILL-5207: - Labels: doc-impacting (was: ready-to-commit) > Improve Parquet scan pipelining > --- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: doc-impacting > Fix For: 1.10.0 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads > ~1MB at a time. The Parquet decode is also processing 1MB at a time. This > means the disk is idle while the data is being processed. Reducing the buffer > to 1MB will reduce the time the processing thread waits for the disk read > thread. > Additionally, since the data to process a page may be more or less than 1MB, > a queue of pages will help so that the disk scan does not block (until the > queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon > as it is initialized. Since this is called at setup time, this increases the > setup time for the query and query execution does not begin until this is > completed. > There are a few other inefficiencies - options are read every time a page > reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5207) Improve Parquet scan pipelining
[ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam updated DRILL-5207: --- Labels: ready-to-commit (was: ) > Improve Parquet scan pipelining > --- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Parth Chandra >Assignee: Parth Chandra > Labels: ready-to-commit > Fix For: 1.10.0 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads > ~1MB at a time. The Parquet decode is also processing 1MB at a time. This > means the disk is idle while the data is being processed. Reducing the buffer > to 1MB will reduce the time the processing thread waits for the disk read > thread. > Additionally, since the data to process a page may be more or less than 1MB, > a queue of pages will help so that the disk scan does not block (until the > queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon > as it is initialized. Since this is called at setup time, this increases the > setup time for the query and query execution does not begin until this is > completed. > There are a few other inefficiencies - options are read every time a page > reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5207) Improve Parquet scan pipelining
[ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua updated DRILL-5207: Fix Version/s: (was: 1.10) 1.10.0 > Improve Parquet scan pipelining > --- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Parth Chandra >Assignee: Parth Chandra > Fix For: 1.10.0 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads > ~1MB at a time. The Parquet decode is also processing 1MB at a time. This > means the disk is idle while the data is being processed. Reducing the buffer > to 1MB will reduce the time the processing thread waits for the disk read > thread. > Additionally, since the data to process a page may be more or less than 1MB, > a queue of pages will help so that the disk scan does not block (until the > queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon > as it is initialized. Since this is called at setup time, this increases the > setup time for the query and query execution does not begin until this is > completed. > There are a few other inefficiencies - options are read every time a page > reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (DRILL-5207) Improve Parquet scan pipelining
[ https://issues.apache.org/jira/browse/DRILL-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zelaine Fong updated DRILL-5207: Reviewer: Sudheesh Katkam Assigned Reviewer to [~sudheeshkatkam] > Improve Parquet scan pipelining > --- > > Key: DRILL-5207 > URL: https://issues.apache.org/jira/browse/DRILL-5207 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.9.0 >Reporter: Parth Chandra >Assignee: Parth Chandra > Fix For: 1.10 > > > The parquet reader's async page reader is not quite efficiently pipelined. > The default size of the disk read buffer is 4MB while the page reader reads > ~1MB at a time. The Parquet decode is also processing 1MB at a time. This > means the disk is idle while the data is being processed. Reducing the buffer > to 1MB will reduce the time the processing thread waits for the disk read > thread. > Additionally, since the data to process a page may be more or less than 1MB, > a queue of pages will help so that the disk scan does not block (until the > queue is full), waiting for the processing thread. > Additionally, the BufferedDirectBufInputStream class reads from disk as soon > as it is initialized. Since this is called at setup time, this increases the > setup time for the query and query execution does not begin until this is > completed. > There are a few other inefficiencies - options are read every time a page > reader is created. Reading options can be expensive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)