Re: Avro - Schema is good - Schema validation is bad

Kamesh Fri, 25 Dec 2015 06:58:43 -0800

For doing schema validation for Avro files, we are thinking of doing the
following steps.Please provide your suggestions, before we implement this.


   - Schema validation feature will be configurable and user can provide
   whether he/she wants this feature or not by setting a configuration
   property like PigAvroStorage
   <https://cwiki.apache.org/confluence/display/PIG/AvroStorage>.
   - If the schema validation flag is set, then we can consider the union
   schema of all the files in a directory recursively.


On Fri, Dec 18, 2015 at 9:17 AM, Stefán Baxter <[email protected]>
wrote:

> Hi Kamesh,
>
> This is, strictly speaking, not the same issue even though they have in
> common the fact that the Avro schema validation aspect.
>
> Regards,
>  -Stefán
>
> On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <[email protected]> wrote:
>
> > If there are any suggestion, can we take it in the JIRA. I feel, there is
> > already JIRA for this.
> >
> >
> https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
> >
> >
> > On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <
> [email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > Directory pruning is great. It allows us, for example, to do efficient
> > > date-range queries even when our data is arranged in a day or week
> based
> > > directory structure.
> > >
> > > We would like to be able to run the same query for all this data even
> > > though the schema has changes slightly (new fields added) over time.
> > >
> > > For me there are two thing in this scenario that are unreasonable:
> > >
> > >    1. For Drill to have to get the schema for all possible files (union
> > >    based) to validate queries
> > >    - adding 100s of *irrelevant* files to the mix
> > >
> > >    2. For Drill to fail the query because a field is not found in the
> > >    sub-set (directory pruned sub-set)
> > >
> > > The current approach results in option 2 and the proposed solution
> > results
> > > in option 1 (As I understand it)
> > >
> > > We would be perfectly happy with unknown fields resulting in null as
> > there
> > > are many ways to deal with null values built into Drill.
> > >
> > > Hopefully this a) makes sense and b) is acceptable.
> > >
> > > Enforcing a strict schema for Avro could be an optional feature (IMO).
> > >
> > > Regards,
> > >   -Stefán
> > >
> > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <[email protected]>
> > > wrote:
> > >
> > > > I think the main problem your hitting is that we should do a union of
> > all
> > > > files. In that case, as long as the field is in a single file, we're
> > > going
> > > > to let the field through.
> > > >
> > > > There is a balancing between early termination and flexibility that
> we
> > > must
> > > > provide. If someone types a field and it is guaranteed to not be in
> the
> > > > data, the thinking is we should fail the query early as that is
> > probably
> > > a
> > > > mistake on the user's part.  If it could be a valid field, we proceed
> > > with
> > > > execution and null it out until we find something.  That is the goal
> > > > anyway. Clearly we have a bug here as we should never deny a possible
> > or
> > > > known field.
> > > >
> > > > I think of fields in three categories: known, possible, impossible.
> > > > Impossible fields should fail to validate. Possible and known fields
> > > should
> > > > validate and execute.
> > > >
> > > > With regards to Ted's concern: I agree that applying a filter
> shouldn't
> > > > fail a query. That means we will either have to consider the complete
> > > union
> > > > Schema before pruning files or consider all fields as either known or
> > > > possible after pruning files.
> > > >
> > > > Stefan, if you haven't already, please open a bug that known fields
> are
> > > > failing to validate in Avro and we will fix shortly. Sorry about the
> > bug.
> > > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Well, at least I'm not alone here.
> > > > >
> > > > > I think it must be time to set some ground rules for these things
> and
> > > > what
> > > > > it means to support evolving schema and what is needed to eliminate
> > > ETL.
> > > > >
> > > > > I trust that enforcing a strict schema "just because we think we
> can"
> > > > must
> > > > > go against the principles of such rules.
> > > > >
> > > > > We moved all our stuff to Avro to avoid various problems with type
> > > > handling
> > > > > (assuming Double on nulls etc.) and to be hit with this, after all
> > that
> > > > > work, is like a slap in the face with two pilchards (more here:
> > > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > > > >
> > > > > Regards,
> > > > >  -Stefán
> > > > >
> > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Sigh of relief is premature.  Nobody has committed to carrying
> this
> > > > > > interpretation forward.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > > > [email protected]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > /me sighs of relief
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Actually, even without multiple storage types, this could be
> > > > > radically
> > > > > > > > confusing.
> > > > > > > >
> > > > > > > > If I have many avro files that are partitioned into
> > directories,
> > > > then
> > > > > > > > queries that use the partitioning to limit the files that I
> see
> > > > could
> > > > > > > > include or exclude more recent files that have added a new
> > field.
> > > > > > > >
> > > > > > > > That means that a query would succeed or fail according to
> > which
> > > > date
> > > > > > > range
> > > > > > > > I use for the query.
> > > > > > > >
> > > > > > > > That seems pretty radically bad.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > This simply can not be the desired behavior!
> > > > > > > > >
> > > > > > > > > This prevents from using a field from a changing schema
> with
> > > dir0
> > > > > > > > > sub-selection (directory pruning) as the altered/full
> schema
> > is
> > > > > never
> > > > > > > > part
> > > > > > > > > of the query and it subsequently fails.
> > > > > > > > >
> > > > > > > > > Drill should, IMOP, never have rules that are dependent on
> > the
> > > > > > > underlying
> > > > > > > > > storage type. If the query runs with JSON and Parquet then
> it
> > > > > should
> > > > > > > work
> > > > > > > > > for Avro as well.
> > > > > > > > >
> > > > > > > > > I'm hoping this strict schema validation is all just a
> > > > > > > misunderstanding.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >  -Stefán
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > For Avro files, we first construct the schema, and this
> > > schema
> > > > is
> > > > > > > used
> > > > > > > > > for
> > > > > > > > > > validating queries. So, if there are any errors in the
> > query
> > > > > (like
> > > > > > > the
> > > > > > > > > > invalid field references) it will fail fast. As of now,
> for
> > > > other
> > > > > > > file
> > > > > > > > > > formats, query validation (checking  for invalid field
> > > > reference)
> > > > > > > does
> > > > > > > > > not
> > > > > > > > > > happen, and at run time, it constructs the schema for
> them
> > > and
> > > > > > hence
> > > > > > > > > nulls
> > > > > > > > > > for invalid fields.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > I'm getting the following error when querying Avro
> files:
> > > > > > > > > > >
> > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line
> > 1,
> > > > > column
> > > > > > > 57:
> > > > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > > > >
> > > > > > > > > > > It's true that the field is in none of the tables I'm
> > > > > targeting,
> > > > > > in
> > > > > > > > > that
> > > > > > > > > > > particular query, but that does not mean that it is in
> > none
> > > > of
> > > > > > the
> > > > > > > > > > possible
> > > > > > > > > > > files I could be querying.
> > > > > > > > > > >
> > > > > > > > > > > We use Avro to get the benefits of the schema but I
> never
> > > > > > expected
> > > > > > > > > Drill
> > > > > > > > > > to
> > > > > > > > > > > enforce it this way.
> > > > > > > > > > >
> > > > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > > > >
> > > > > > > > > > > This makes no sense to me as I think a fundamental
> trade
> > of
> > > > > > Drill,
> > > > > > > > when
> > > > > > > > > > > trying to eliminate ETL, is to return null for any
> > missing
> > > > > > fields.
> > > > > > > > > > >
> > > > > > > > > > > Please advise.
> > > > > > > > > > >
> > > > > > > > > > > Regards,
> > > > > > > > > > >  -Stefán
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Kamesh.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Kamesh.
> >
>



-- 
Kamesh.

Re: Avro - Schema is good - Schema validation is bad

Reply via email to