Hi Kamesh,

This is, strictly speaking, not the same issue even though they have in
common the fact that the Avro schema validation aspect.

Regards,
 -Stefán

On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <[email protected]> wrote:

> If there are any suggestion, can we take it in the JIRA. I feel, there is
> already JIRA for this.
>
> https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070
>
>
> On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <[email protected]>
> wrote:
>
> > Hi,
> >
> > Directory pruning is great. It allows us, for example, to do efficient
> > date-range queries even when our data is arranged in a day or week based
> > directory structure.
> >
> > We would like to be able to run the same query for all this data even
> > though the schema has changes slightly (new fields added) over time.
> >
> > For me there are two thing in this scenario that are unreasonable:
> >
> >    1. For Drill to have to get the schema for all possible files (union
> >    based) to validate queries
> >    - adding 100s of *irrelevant* files to the mix
> >
> >    2. For Drill to fail the query because a field is not found in the
> >    sub-set (directory pruned sub-set)
> >
> > The current approach results in option 2 and the proposed solution
> results
> > in option 1 (As I understand it)
> >
> > We would be perfectly happy with unknown fields resulting in null as
> there
> > are many ways to deal with null values built into Drill.
> >
> > Hopefully this a) makes sense and b) is acceptable.
> >
> > Enforcing a strict schema for Avro could be an optional feature (IMO).
> >
> > Regards,
> >   -Stefán
> >
> > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <[email protected]>
> > wrote:
> >
> > > I think the main problem your hitting is that we should do a union of
> all
> > > files. In that case, as long as the field is in a single file, we're
> > going
> > > to let the field through.
> > >
> > > There is a balancing between early termination and flexibility that we
> > must
> > > provide. If someone types a field and it is guaranteed to not be in the
> > > data, the thinking is we should fail the query early as that is
> probably
> > a
> > > mistake on the user's part.  If it could be a valid field, we proceed
> > with
> > > execution and null it out until we find something.  That is the goal
> > > anyway. Clearly we have a bug here as we should never deny a possible
> or
> > > known field.
> > >
> > > I think of fields in three categories: known, possible, impossible.
> > > Impossible fields should fail to validate. Possible and known fields
> > should
> > > validate and execute.
> > >
> > > With regards to Ted's concern: I agree that applying a filter shouldn't
> > > fail a query. That means we will either have to consider the complete
> > union
> > > Schema before pruning files or consider all fields as either known or
> > > possible after pruning files.
> > >
> > > Stefan, if you haven't already, please open a bug that known fields are
> > > failing to validate in Avro and we will fix shortly. Sorry about the
> bug.
> > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected]>
> > > wrote:
> > >
> > > > Well, at least I'm not alone here.
> > > >
> > > > I think it must be time to set some ground rules for these things and
> > > what
> > > > it means to support evolving schema and what is needed to eliminate
> > ETL.
> > > >
> > > > I trust that enforcing a strict schema "just because we think we can"
> > > must
> > > > go against the principles of such rules.
> > > >
> > > > We moved all our stuff to Avro to avoid various problems with type
> > > handling
> > > > (assuming Double on nulls etc.) and to be hit with this, after all
> that
> > > > work, is like a slap in the face with two pilchards (more here:
> > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s)
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <[email protected]>
> > > > wrote:
> > > >
> > > > > Sigh of relief is premature.  Nobody has committed to carrying this
> > > > > interpretation forward.
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter <
> > > > [email protected]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > /me sighs of relief
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Actually, even without multiple storage types, this could be
> > > > radically
> > > > > > > confusing.
> > > > > > >
> > > > > > > If I have many avro files that are partitioned into
> directories,
> > > then
> > > > > > > queries that use the partitioning to limit the files that I see
> > > could
> > > > > > > include or exclude more recent files that have added a new
> field.
> > > > > > >
> > > > > > > That means that a query would succeed or fail according to
> which
> > > date
> > > > > > range
> > > > > > > I use for the query.
> > > > > > >
> > > > > > > That seems pretty radically bad.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter <
> > > > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > This simply can not be the desired behavior!
> > > > > > > >
> > > > > > > > This prevents from using a field from a changing schema with
> > dir0
> > > > > > > > sub-selection (directory pruning) as the altered/full schema
> is
> > > > never
> > > > > > > part
> > > > > > > > of the query and it subsequently fails.
> > > > > > > >
> > > > > > > > Drill should, IMOP, never have rules that are dependent on
> the
> > > > > > underlying
> > > > > > > > storage type. If the query runs with JSON and Parquet then it
> > > > should
> > > > > > work
> > > > > > > > for Avro as well.
> > > > > > > >
> > > > > > > > I'm hoping this strict schema validation is all just a
> > > > > > misunderstanding.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefán
> > > > > > > >
> > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > For Avro files, we first construct the schema, and this
> > schema
> > > is
> > > > > > used
> > > > > > > > for
> > > > > > > > > validating queries. So, if there are any errors in the
> query
> > > > (like
> > > > > > the
> > > > > > > > > invalid field references) it will fail fast. As of now, for
> > > other
> > > > > > file
> > > > > > > > > formats, query validation (checking  for invalid field
> > > reference)
> > > > > > does
> > > > > > > > not
> > > > > > > > > happen, and at run time, it constructs the schema for them
> > and
> > > > > hence
> > > > > > > > nulls
> > > > > > > > > for invalid fields.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter <
> > > > > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I'm getting the following error when querying Avro files:
> > > > > > > > > >
> > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line
> 1,
> > > > column
> > > > > > 57:
> > > > > > > > > > Column 'some_col' not found in any table
> > > > > > > > > >
> > > > > > > > > > It's true that the field is in none of the tables I'm
> > > > targeting,
> > > > > in
> > > > > > > > that
> > > > > > > > > > particular query, but that does not mean that it is in
> none
> > > of
> > > > > the
> > > > > > > > > possible
> > > > > > > > > > files I could be querying.
> > > > > > > > > >
> > > > > > > > > > We use Avro to get the benefits of the schema but I never
> > > > > expected
> > > > > > > > Drill
> > > > > > > > > to
> > > > > > > > > > enforce it this way.
> > > > > > > > > >
> > > > > > > > > > Why do unresolved  columns not return null?
> > > > > > > > > >
> > > > > > > > > > This makes no sense to me as I think a fundamental trade
> of
> > > > > Drill,
> > > > > > > when
> > > > > > > > > > trying to eliminate ETL, is to return null for any
> missing
> > > > > fields.
> > > > > > > > > >
> > > > > > > > > > Please advise.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > >  -Stefán
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Kamesh.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Kamesh.
>

Reply via email to