Re: Avro - Schema is good - Schema validation is bad

Jacques Nadeau Wed, 16 Dec 2015 06:20:12 -0800

I think the main problem your hitting is that we should do a union of all
files. In that case, as long as the field is in a single file, we're going
to let the field through.


There is a balancing between early termination and flexibility that we must
provide. If someone types a field and it is guaranteed to not be in the
data, the thinking is we should fail the query early as that is probably a
mistake on the user's part.  If it could be a valid field, we proceed with
execution and null it out until we find something.  That is the goal
anyway. Clearly we have a bug here as we should never deny a possible or
known field.

I think of fields in three categories: known, possible, impossible.
Impossible fields should fail to validate. Possible and known fields should
validate and execute.

With regards to Ted's concern: I agree that applying a filter shouldn't
fail a query. That means we will either have to consider the complete union
Schema before pruning files or consider all fields as either known or
possible after pruning files.

Stefan, if you haven't already, please open a bug that known fields are
failing to validate in Avro and we will fix shortly. Sorry about the bug.
On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected]> wrote:

> Well, at least I'm not alone here.
>
> I think it must be time to set some ground rules for these things and what
> it means to support evolving schema and what is needed to eliminate ETL.
>
> I trust that enforcing a strict schema "just because we think we can" must
> go against the principles of such rules.
>
> We moved all our stuff to Avro to avoid various problems with type handling
> (assuming Double on nulls etc.) and to be hit with this, after all that
> work, is like a slap in the face with two pilchards (more here:
> https://www.youtube.com/watch?v=IhJQp-q1Y1s)
>
> Regards,
>  -Stefán
>
> On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <[email protected]>
> wrote:
>
> > Sigh of relief is premature.  Nobody has committed to carrying this
> > interpretation forward.
> >
> >
> >
> > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> [email protected]
> > >
> > wrote:
> >
> > > /me sighs of relief
> > >
> > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <[email protected]>
> > > wrote:
> > >
> > > > Actually, even without multiple storage types, this could be
> radically
> > > > confusing.
> > > >
> > > > If I have many avro files that are partitioned into directories, then
> > > > queries that use the partitioning to limit the files that I see could
> > > > include or exclude more recent files that have added a new field.
> > > >
> > > > That means that a query would succeed or fail according to which date
> > > range
> > > > I use for the query.
> > > >
> > > > That seems pretty radically bad.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > This simply can not be the desired behavior!
> > > > >
> > > > > This prevents from using a field from a changing schema with dir0
> > > > > sub-selection (directory pruning) as the altered/full schema is
> never
> > > > part
> > > > > of the query and it subsequently fails.
> > > > >
> > > > > Drill should, IMOP, never have rules that are dependent on the
> > > underlying
> > > > > storage type. If the query runs with JSON and Parquet then it
> should
> > > work
> > > > > for Avro as well.
> > > > >
> > > > > I'm hoping this strict schema validation is all just a
> > > misunderstanding.
> > > > >
> > > > > Regards,
> > > > >  -Stefán
> > > > >
> > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <[email protected]>
> > > wrote:
> > > > >
> > > > > > For Avro files, we first construct the schema, and this schema is
> > > used
> > > > > for
> > > > > > validating queries. So, if there are any errors in the query
> (like
> > > the
> > > > > > invalid field references) it will fail fast. As of now, for other
> > > file
> > > > > > formats, query validation (checking  for invalid field reference)
> > > does
> > > > > not
> > > > > > happen, and at run time, it constructs the schema for them and
> > hence
> > > > > nulls
> > > > > > for invalid fields.
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'm getting the following error when querying Avro files:
> > > > > > >
> > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1,
> column
> > > 57:
> > > > > > > Column 'some_col' not found in any table
> > > > > > >
> > > > > > > It's true that the field is in none of the tables I'm
> targeting,
> > in
> > > > > that
> > > > > > > particular query, but that does not mean that it is in none of
> > the
> > > > > > possible
> > > > > > > files I could be querying.
> > > > > > >
> > > > > > > We use Avro to get the benefits of the schema but I never
> > expected
> > > > > Drill
> > > > > > to
> > > > > > > enforce it this way.
> > > > > > >
> > > > > > > Why do unresolved  columns not return null?
> > > > > > >
> > > > > > > This makes no sense to me as I think a fundamental trade of
> > Drill,
> > > > when
> > > > > > > trying to eliminate ETL, is to return null for any missing
> > fields.
> > > > > > >
> > > > > > > Please advise.
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefán
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Kamesh.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Reply via email to