Hi Kamesh, This is, strictly speaking, not the same issue even though they have in common the fact that the Avro schema validation aspect.
Regards, -Stefán On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <[email protected]> wrote: > If there are any suggestion, can we take it in the JIRA. I feel, there is > already JIRA for this. > > https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070 > > > On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter <[email protected]> > wrote: > > > Hi, > > > > Directory pruning is great. It allows us, for example, to do efficient > > date-range queries even when our data is arranged in a day or week based > > directory structure. > > > > We would like to be able to run the same query for all this data even > > though the schema has changes slightly (new fields added) over time. > > > > For me there are two thing in this scenario that are unreasonable: > > > > 1. For Drill to have to get the schema for all possible files (union > > based) to validate queries > > - adding 100s of *irrelevant* files to the mix > > > > 2. For Drill to fail the query because a field is not found in the > > sub-set (directory pruned sub-set) > > > > The current approach results in option 2 and the proposed solution > results > > in option 1 (As I understand it) > > > > We would be perfectly happy with unknown fields resulting in null as > there > > are many ways to deal with null values built into Drill. > > > > Hopefully this a) makes sense and b) is acceptable. > > > > Enforcing a strict schema for Avro could be an optional feature (IMO). > > > > Regards, > > -Stefán > > > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <[email protected]> > > wrote: > > > > > I think the main problem your hitting is that we should do a union of > all > > > files. In that case, as long as the field is in a single file, we're > > going > > > to let the field through. > > > > > > There is a balancing between early termination and flexibility that we > > must > > > provide. If someone types a field and it is guaranteed to not be in the > > > data, the thinking is we should fail the query early as that is > probably > > a > > > mistake on the user's part. If it could be a valid field, we proceed > > with > > > execution and null it out until we find something. That is the goal > > > anyway. Clearly we have a bug here as we should never deny a possible > or > > > known field. > > > > > > I think of fields in three categories: known, possible, impossible. > > > Impossible fields should fail to validate. Possible and known fields > > should > > > validate and execute. > > > > > > With regards to Ted's concern: I agree that applying a filter shouldn't > > > fail a query. That means we will either have to consider the complete > > union > > > Schema before pruning files or consider all fields as either known or > > > possible after pruning files. > > > > > > Stefan, if you haven't already, please open a bug that known fields are > > > failing to validate in Avro and we will fix shortly. Sorry about the > bug. > > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected]> > > > wrote: > > > > > > > Well, at least I'm not alone here. > > > > > > > > I think it must be time to set some ground rules for these things and > > > what > > > > it means to support evolving schema and what is needed to eliminate > > ETL. > > > > > > > > I trust that enforcing a strict schema "just because we think we can" > > > must > > > > go against the principles of such rules. > > > > > > > > We moved all our stuff to Avro to avoid various problems with type > > > handling > > > > (assuming Double on nulls etc.) and to be hit with this, after all > that > > > > work, is like a slap in the face with two pilchards (more here: > > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s) > > > > > > > > Regards, > > > > -Stefán > > > > > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning <[email protected]> > > > > wrote: > > > > > > > > > Sigh of relief is premature. Nobody has committed to carrying this > > > > > interpretation forward. > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter < > > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > /me sighs of relief > > > > > > > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Actually, even without multiple storage types, this could be > > > > radically > > > > > > > confusing. > > > > > > > > > > > > > > If I have many avro files that are partitioned into > directories, > > > then > > > > > > > queries that use the partitioning to limit the files that I see > > > could > > > > > > > include or exclude more recent files that have added a new > field. > > > > > > > > > > > > > > That means that a query would succeed or fail according to > which > > > date > > > > > > range > > > > > > > I use for the query. > > > > > > > > > > > > > > That seems pretty radically bad. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter < > > > > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > This simply can not be the desired behavior! > > > > > > > > > > > > > > > > This prevents from using a field from a changing schema with > > dir0 > > > > > > > > sub-selection (directory pruning) as the altered/full schema > is > > > > never > > > > > > > part > > > > > > > > of the query and it subsequently fails. > > > > > > > > > > > > > > > > Drill should, IMOP, never have rules that are dependent on > the > > > > > > underlying > > > > > > > > storage type. If the query runs with JSON and Parquet then it > > > > should > > > > > > work > > > > > > > > for Avro as well. > > > > > > > > > > > > > > > > I'm hoping this strict schema validation is all just a > > > > > > misunderstanding. > > > > > > > > > > > > > > > > Regards, > > > > > > > > -Stefán > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > For Avro files, we first construct the schema, and this > > schema > > > is > > > > > > used > > > > > > > > for > > > > > > > > > validating queries. So, if there are any errors in the > query > > > > (like > > > > > > the > > > > > > > > > invalid field references) it will fail fast. As of now, for > > > other > > > > > > file > > > > > > > > > formats, query validation (checking for invalid field > > > reference) > > > > > > does > > > > > > > > not > > > > > > > > > happen, and at run time, it constructs the schema for them > > and > > > > > hence > > > > > > > > nulls > > > > > > > > > for invalid fields. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter < > > > > > > > > [email protected]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I'm getting the following error when querying Avro files: > > > > > > > > > > > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line > 1, > > > > column > > > > > > 57: > > > > > > > > > > Column 'some_col' not found in any table > > > > > > > > > > > > > > > > > > > > It's true that the field is in none of the tables I'm > > > > targeting, > > > > > in > > > > > > > > that > > > > > > > > > > particular query, but that does not mean that it is in > none > > > of > > > > > the > > > > > > > > > possible > > > > > > > > > > files I could be querying. > > > > > > > > > > > > > > > > > > > > We use Avro to get the benefits of the schema but I never > > > > > expected > > > > > > > > Drill > > > > > > > > > to > > > > > > > > > > enforce it this way. > > > > > > > > > > > > > > > > > > > > Why do unresolved columns not return null? > > > > > > > > > > > > > > > > > > > > This makes no sense to me as I think a fundamental trade > of > > > > > Drill, > > > > > > > when > > > > > > > > > > trying to eliminate ETL, is to return null for any > missing > > > > > fields. > > > > > > > > > > > > > > > > > > > > Please advise. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Kamesh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > Kamesh. >
