Re: Avro - Schema is good - Schema validation is bad

Stefán Baxter Wed, 16 Dec 2015 11:58:50 -0800

Hi,

Directory pruning is great. It allows us, for example, to do efficient
date-range queries even when our data is arranged in a day or week based
directory structure.


We would like to be able to run the same query for all this data even
though the schema has changes slightly (new fields added) over time.

For me there are two thing in this scenario that are unreasonable:

   1. For Drill to have to get the schema for all possible files (union
   based) to validate queries
   - adding 100s of *irrelevant* files to the mix

   2. For Drill to fail the query because a field is not found in the
   sub-set (directory pruned sub-set)

The current approach results in option 2 and the proposed solution results
in option 1 (As I understand it)

We would be perfectly happy with unknown fields resulting in null as there
are many ways to deal with null values built into Drill.

Hopefully this a) makes sense and b) is acceptable.

Enforcing a strict schema for Avro could be an optional feature (IMO).

Regards,
  -Stefán

On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <[email protected]> wrote:

> I think the main problem your hitting is that we should do a union of all
> files. In that case, as long as the field is in a single file, we're going
> to let the field through.
>
> There is a balancing between early termination and flexibility that we must
> provide. If someone types a field and it is guaranteed to not be in the
> data, the thinking is we should fail the query early as that is probably a
> mistake on the user's part.  If it could be a valid field, we proceed with
> execution and null it out until we find something.  That is the goal
> anyway. Clearly we have a bug here as we should never deny a possible or
> known field.
>
> I think of fields in three categories: known, possible, impossible.
> Impossible fields should fail to validate. Possible and known fields should
> validate and execute.
>
> With regards to Ted's concern: I agree that applying a filter shouldn't
> fail a query. That means we will either have to consider the complete union
> Schema before pruning files or consider all fields as either known or
> possible after pruning files.
>
> Stefan, if you haven't already, please open a bug that known fields are
> failing to validate in Avro and we will fix shortly. Sorry about the bug.
> On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected]>
> wrote:
>
> > Well, at least I'm not alone here.
> >
> > I think it must be time to set some ground rules for these things and
> what
> > it means to support evolving schema and what is needed to eliminate ETL.
> >
> > I trust that enforcing a strict schema "just because we think we can"
> must
> > go against the principles of such rules.
> >
> > We moved all our stuff to Avro to avoid various problems with type
> handling
> > (assuming Double on nulls etc.) and to be hit with this, after all that
> > work, is like a slap in the face with two pilchards (more here:
> > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> >
> > Regards,
> >  -Stefán
> >
> > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <[email protected]>
> > wrote:
> >
> > > Sigh of relief is premature.  Nobody has committed to carrying this
> > > interpretation forward.
> > >
> > >
> > >
> > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > /me sighs of relief
> > > >
> > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <[email protected]>
> > > > wrote:
> > > >
> > > > > Actually, even without multiple storage types, this could be
> > radically
> > > > > confusing.
> > > > >
> > > > > If I have many avro files that are partitioned into directories,
> then
> > > > > queries that use the partitioning to limit the files that I see
> could
> > > > > include or exclude more recent files that have added a new field.
> > > > >
> > > > > That means that a query would succeed or fail according to which
> date
> > > > range
> > > > > I use for the query.
> > > > >
> > > > > That seems pretty radically bad.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This simply can not be the desired behavior!
> > > > > >
> > > > > > This prevents from using a field from a changing schema with dir0
> > > > > > sub-selection (directory pruning) as the altered/full schema is
> > never
> > > > > part
> > > > > > of the query and it subsequently fails.
> > > > > >
> > > > > > Drill should, IMOP, never have rules that are dependent on the
> > > > underlying
> > > > > > storage type. If the query runs with JSON and Parquet then it
> > should
> > > > work
> > > > > > for Avro as well.
> > > > > >
> > > > > > I'm hoping this strict schema validation is all just a
> > > > misunderstanding.
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefán
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > For Avro files, we first construct the schema, and this schema
> is
> > > > used
> > > > > > for
> > > > > > > validating queries. So, if there are any errors in the query
> > (like
> > > > the
> > > > > > > invalid field references) it will fail fast. As of now, for
> other
> > > > file
> > > > > > > formats, query validation (checking  for invalid field
> reference)
> > > > does
> > > > > > not
> > > > > > > happen, and at run time, it constructs the schema for them and
> > > hence
> > > > > > nulls
> > > > > > > for invalid fields.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I'm getting the following error when querying Avro files:
> > > > > > > >
> > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line 1,
> > column
> > > > 57:
> > > > > > > > Column 'some_col' not found in any table
> > > > > > > >
> > > > > > > > It's true that the field is in none of the tables I'm
> > targeting,
> > > in
> > > > > > that
> > > > > > > > particular query, but that does not mean that it is in none
> of
> > > the
> > > > > > > possible
> > > > > > > > files I could be querying.
> > > > > > > >
> > > > > > > > We use Avro to get the benefits of the schema but I never
> > > expected
> > > > > > Drill
> > > > > > > to
> > > > > > > > enforce it this way.
> > > > > > > >
> > > > > > > > Why do unresolved  columns not return null?
> > > > > > > >
> > > > > > > > This makes no sense to me as I think a fundamental trade of
> > > Drill,
> > > > > when
> > > > > > > > trying to eliminate ETL, is to return null for any missing
> > > fields.
> > > > > > > >
> > > > > > > > Please advise.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefán
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Kamesh.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Avro - Schema is good - Schema validation is bad

Reply via email to