Re: Batch Sizing for Parquet Flat Reader

2018-03-25 Thread salim achouche
I have updated the document with more design details. On Thu, Feb 8, 2018 at 5:42 PM, salim achouche wrote: > The following document >

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Aman Sinha
Hi Paul, thanks for your comments. I have added my thoughts in the DRILL-6147 JIRA as well. Regarding the hangout, let me find out about availability of other folks too and will circle back with you. thanks, Aman On Sun, Mar 4, 2018 at 1:23 PM, Paul Rogers wrote: >

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Paul Rogers
Hi Aman, To follow up, we should look at all sides of the issue. One factor overlooked in my previous note is that code now is better than code later. DRILL-6147 is available today and will immediately give users a performance boost. The result set loader is large and will take some months to

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Paul Rogers
Hi Aman, Please see my comment in DRILL-6147. For the hangout to be productive, perhaps we should create test cases that will show the benefit of DRILL-6147 relative to the result set loader. The test case of interest has three parts: * Multiple variable-width fields (say five) with a large

Re: Batch Sizing for Parquet Flat Reader

2018-03-04 Thread Aman Sinha
Hi all, with reference to DRILL-6147 given the overlapping approaches, I feel like we should have a separate hangout session with interested parties and discuss the details. Let me know and I can setup one. Aman On Mon, Feb 12, 2018 at 8:50

Re: Batch Sizing for Parquet Flat Reader

2018-02-12 Thread Padma Penumarthy
If our goal is to not to allocate more than 16MB for individual vectors to avoid external fragmentation, I guess we can take that also into consideration in our calculations to figure out the outgoing number of rows. The math might become more complex. But, the main point like you said is

Re: Batch Sizing for Parquet Flat Reader

2018-02-12 Thread Paul Rogers
Agreed that allocating vectors up front is another good improvement. The average batch size approach gets us 80% of the way to the goal: it limits batch size and allows vector preallocation. What it cannot do is limit individual vector sizes. Nor can it ensure that the resulting batch is

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Padma Penumarthy
With average row size method, since I know number of rows and the average size for each column, I am planning to use that information to allocate required memory for each vector upfront. This should help avoid copying every time we double and also improve memory utilization. Thanks Padma >

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
Hi Salim. Thanks much for the detailed explanation! You clearly have developed a deep understanding of the Parquet code and its impact on CPU and I/O performance. My comments are more from the holistic perspective as Drill as a whole. Far too much to discuss on the dev list. I've added your

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
Parth notes: Also note that memory allocations by Netty greater than the 16MB chunk sizeare returned to the OS when the memory is free'd. Both this document andthe original document on memory fragmentation state incorrectly that suchmemory is not released back to the OS. A quick thought

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread salim achouche
Paul, I cannot thank you enough for your help and guidance! You are right that columnar readers will have a harder time balancing resource requirements and performance. Nevertheless, DRILL-6147 is a starting point; it should allow us to gain knowledge and accordingly refine our strategy as we

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
One more thought: > > 3) Assuming that you go with the average batch size calculation approach, The average batch size approach is a quick and dirty approach for non-leaf operators that can observe an incoming batch to estimate row width. Because Drill batches are large, the law of large

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread salim achouche
Thanks Parth for your feedback! I am planning to enhance the document based on the received feedback and the prototype I am currently working on! Regards, Salim On Sun, Feb 11, 2018 at 2:36 PM, salim achouche wrote: > Thanks Paul for your feedback! let me try to answer

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread salim achouche
Thanks Paul for your feedback! let me try to answer some of your questions / comments: Duplicate Implementation - I am not contemplating two different implementations; one for Parquet and another for the rest of the code - Instead, I am reacting to the fact that we have two different processing

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Paul Rogers
Hi All, Perhaps this topic needs just a bit more thought and discussion to avoid working at cross purposes. I've outlined the issues, and a possible path forward, in a comment to DRILL-6147. Quick summary: creating a second batch size implementation just for Parquet will be very difficult once

Re: Batch Sizing for Parquet Flat Reader

2018-02-11 Thread Parth Chandra
Thanks Salim. Can you add this to the JIRA/design doc. Also, I would venture to suggest that the section on predicate pushdown can be made clearer. Also, Since you're proposing the average batch size approach with overflow handling, some detail on the proposed changes to the framework would be

Re: Batch Sizing for Parquet Flat Reader

2018-02-09 Thread salim achouche
Thank you Parth for providing feedback; please find my answers below: I have created Apache JIRA DRILL-6147 for tracking this improvement. > 2) Not sure where you were going with the predicate pushdown section and how it pertains to

Re: Batch Sizing for Parquet Flat Reader

2018-02-09 Thread Parth Chandra
Is there a JIRA for this? Would be useful to capture the comments in the JIRA. Note that the document itself is not comment-able as it is shared with view-only permissions. Some thoughts in no particular order- 1) The Page based statistical approach is likely to run into trouble with the encoding

Batch Sizing for Parquet Flat Reader

2018-02-08 Thread salim achouche
The following document describes a proposal for enforcing batch sizing constraints (count and memory) within the Parquet Reader (Flat Schema). Please feel free to take a look and provide feedback.