Re: Column index testing break down

2019-03-07 Thread Wes McKinney
hi Tim, On Thu, Mar 7, 2019 at 11:52 AM Tim Armstrong wrote: > > I think you and I have different priors on this, Wes. It's definitely not > clear-cut. I think it's an interesting point to discuss and it's > unfortunate that you feel that way. > > Partially the current state of things is due to

Re: Column index testing break down

2019-03-07 Thread Tim Armstrong
I think you and I have different priors on this, Wes. It's definitely not clear-cut. I think it's an interesting point to discuss and it's unfortunate that you feel that way. Partially the current state of things is due to path-dependence, but there are some parts of the Impala runtime that it's

Re: Column index testing break down

2019-03-07 Thread Wes McKinney
It makes me very sad that Impala has this bespoke Parquet implementation. I never really understood the benefit of doing the work there rather than in Apache Parquet. I never found the arguments "why" I've heard over the years (that the implementation needed to be tightly coupled to the Impala

Re: Column index testing break down

2019-03-07 Thread Anna Szonyi
Hi Wes, Zoltan has created a C++ implementation for Impala. We would be happy to contribute it to Parquet cpp when we have time or if someone is keen on getting it in sooner and wants to take it over, we would be happy to review it. Feel free to check it out and chime in to the review for the

Re: Column index testing break down

2019-03-06 Thread Wes McKinney
Is there anyone who might be able to take on the project of implementing this in C++? We're having an increasing number of C++ Parquet users nowadays. On Tue, Mar 5, 2019 at 9:54 AM Anna Szonyi wrote: > > Hi dev@ community, > > This week I would like to ask for some feedback on the testing we've

Re: Column index testing break down

2019-03-05 Thread Anna Szonyi
Hi dev@ community, This week I would like to ask for some feedback on the testing we've been sending out. We've been sharing the most important test cases we've created for the write path of the parquet column index feature, now we would like to hear from you! Is there anything else you feel is

Re: Column index testing break down

2019-02-25 Thread Anna Szonyi
Hi dev@, After a week off, this week we have an excerpt from our internal data interoperability testing, which tests compatibility between Hive, Spark and Impala over Avro and Parquet. This test case is tailor-made to test specific layouts so that files written using parquet-mr can be read by any

Re: Column index testing break down

2019-02-11 Thread Anna Szonyi
Hi dev@, Last week we had a twofer: e2e tool and integration test validating the contract of column indexes/indices (if all values are between min and max and if set whether the boundary order is correct). There are some takeaways and corrections to be made to the former (like the max->min typo)

Re: Column index testing break down

2019-02-06 Thread Anna Szonyi
Hi Ryan, Thanks for the quick feedback! We also have an integration test version of the contract validation, which uses in-memory data and validates the min/max/boundaryorder, as well as the correctness of the filtering - this would have been the next installment, so *spoiler alert* :)

Column index testing break down

2019-02-04 Thread Anna Szonyi
Hi dev@, As part of acquiring the final sign offs for the parquet release, we had another discussion with Ryan about the testing of column indexes. He asked us to break down our testing into "fun size" test bites and send them out incrementally to the community, so everyone would have time to