Re: Inconsistent handling of schema in Avro tables

2018-07-16 Thread Todd Lipcon
On Thu, Jul 12, 2018 at 5:07 PM, Bharath Vissapragada < bhara...@cloudera.com.invalid> wrote: > On Thu, Jul 12, 2018 at 12:03 PM Todd Lipcon > wrote: > > > > So, I think my proposal here is: > > > > 1. Query behavior on existing tables > > - If the table-level format is non-Avro, > > - AND the

Re: Inconsistent handling of schema in Avro tables

2018-07-12 Thread Bharath Vissapragada
On Thu, Jul 12, 2018 at 12:03 PM Todd Lipcon wrote: > Again there's inconsistency with Hive: the presence of a single Avro > partition doesn't change the table-level schema. > > The interesting thing is that, when I modified Impala to have a similar > behavior, I got the following error from the

Re: Inconsistent handling of schema in Avro tables

2018-07-12 Thread Todd Lipcon
Again there's inconsistency with Hive: the presence of a single Avro partition doesn't change the table-level schema. The interesting thing is that, when I modified Impala to have a similar behavior, I got the following error from the backend when trying to query the data: WARNINGS: Unresolvable

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Todd Lipcon
Turns out it's even a bit more messy. The presence of one or more avro partitions can change the types of existing columns, even if there is no explicit avro schema specified for the table: https://gist.github.com/5018d6ff50f846c72762319eb7cf5ca8 Not quite sure how to handle this one in a world

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Bharath Vissapragada
Agreed. On Wed, Jul 11, 2018 at 8:55 PM Todd Lipcon wrote: > Your commit message there makes sense, Bharath -- we should set > 'avroSchema' in the descriptor in case any referenced partition is avro, > because the scanner needs that info. However, we don't need to also > override the

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Todd Lipcon
Your commit message there makes sense, Bharath -- we should set 'avroSchema' in the descriptor in case any referenced partition is avro, because the scanner needs that info. However, we don't need to also override the table-level schema. So, I think we can preserve the fix that you made while also

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Bharath Vissapragada
I added this functionality where adding an Avro partition in a mixed partition table resets the table level schema. While I don't exactly remember why we chose this path, I do recall that we debated quite a bit

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Edward Capriolo
I know that Hive can deal with schema being different per partition, but I really hesitate to understand why someone would want to do this. If someone asked me to support a mixed avro/parquet table I would suggest they create a view. If they kept insisting I would reply "Well it is your funeral."

Re: Inconsistent handling of schema in Avro tables

2018-07-11 Thread Tim Armstrong
The behaviour of Avro schemas in all these cases has always been rather mysterious to me. Before you wrote this email I would have assumed that Impala's behaviour would be like Hive's behaviour. I agree with the principle that the creation of a partition without changes to table metadata

Inconsistent handling of schema in Avro tables

2018-07-11 Thread Todd Lipcon
Hey folks, I'm trying to understand the current behavior of tables that contain partitions of mixed format, specifically when one or more partitions is stored as Avro. Impala seems to be doing a number of things which I find surprising, and I'm not sure if they are intentional or should be