Re: Avro - Schema is good - Schema validation is bad

Stefán Baxter Mon, 14 Dec 2015 22:52:30 -0800

Well, at least I'm not alone here.

I think it must be time to set some ground rules for these things and what
it means to support evolving schema and what is needed to eliminate ETL.


I trust that enforcing a strict schema "just because we think we can" must
go against the principles of such rules.

We moved all our stuff to Avro to avoid various problems with type handling
(assuming Double on nulls etc.) and to be hit with this, after all that
work, is like a slap in the face with two pilchards (more here:
https://www.youtube.com/watch?v=IhJQp-q1Y1s)

Regards,
 -Stefán

On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <[email protected]> wrote:

> Sigh of relief is premature.  Nobody has committed to carrying this
> interpretation forward.
>
>
>
> On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <[email protected]
> >
> wrote:
>
> > /me sighs of relief
> >
> > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > Actually, even without multiple storage types, this could be radically
> > > confusing.
> > >
> > > If I have many avro files that are partitioned into directories, then
> > > queries that use the partitioning to limit the files that I see could
> > > include or exclude more recent files that have added a new field.
> > >
> > > That means that a query would succeed or fail according to which date
> > range
> > > I use for the query.
> > >
> > > That seems pretty radically bad.
> > >
> > >
> > >
> > >
> > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > This simply can not be the desired behavior!
> > > >
> > > > This prevents from using a field from a changing schema with dir0
> > > > sub-selection (directory pruning) as the altered/full schema is never
> > > part
> > > > of the query and it subsequently fails.
> > > >
> > > > Drill should, IMOP, never have rules that are dependent on the
> > underlying
> > > > storage type. If the query runs with JSON and Parquet then it should
> > work
> > > > for Avro as well.
> > > >
> > > > I'm hoping this strict schema validation is all just a
> > misunderstanding.
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <[email protected]>
> > wrote:
> > > >
> > > > > For Avro files, we first construct the schema, and this schema is
> > used
> > > > for
> > > > > validating queries. So, if there are any errors in the query (like
> > the
> > > > > invalid field references) it will fail fast. As of now, for other
> > file
> > > > > formats, query validation (checking  for invalid field reference)
> > does
> > > > not
> > > > > happen, and at run time, it constructs the schema for them and
> hence
> > > > nulls
> > > > > for invalid fields.
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'm getting the following error when querying Avro files:
> > > > > >
> > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1, column
> > 57:
> > > > > > Column 'some_col' not found in any table
> > > > > >
> > > > > > It's true that the field is in none of the tables I'm targeting,
> in
> > > > that
> > > > > > particular query, but that does not mean that it is in none of
> the
> > > > > possible
> > > > > > files I could be querying.
> > > > > >
> > > > > > We use Avro to get the benefits of the schema but I never
> expected
> > > > Drill
> > > > > to
> > > > > > enforce it this way.
> > > > > >
> > > > > > Why do unresolved  columns not return null?
> > > > > >
> > > > > > This makes no sense to me as I think a fundamental trade of
> Drill,
> > > when
> > > > > > trying to eliminate ETL, is to return null for any missing
> fields.
> > > > > >
> > > > > > Please advise.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kamesh.
> > > > >
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Reply via email to