For doing schema validation for Avro files, we are thinking of doing the following steps.Please provide your suggestions, before we implement this.
- Schema validation feature will be configurable and user can provide whether he/she wants this feature or not by setting a configuration property like PigAvroStorage <https://cwiki.apache.org/confluence/display/PIG/AvroStorage>. - If the schema validation flag is set, then we can consider the union schema of all the files in a directory recursively. On Fri, Dec 18, 2015 at 9:17 AM, Stefán Baxter <[email protected]> wrote: > Hi Kamesh, > > This is, strictly speaking, not the same issue even though they have in > common the fact that the Avro schema validation aspect. > > Regards, > -Stefán > > On Fri, Dec 18, 2015 at 2:17 AM, Kamesh <[email protected]> wrote: > > > If there are any suggestion, can we take it in the JIRA. I feel, there is > > already JIRA for this. > > > > > https://issues.apache.org/jira/browse/DRILL-4120?focusedCommentId=15048070&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15048070 > > > > > > On Thu, Dec 17, 2015 at 1:28 AM, Stefán Baxter < > [email protected]> > > wrote: > > > > > Hi, > > > > > > Directory pruning is great. It allows us, for example, to do efficient > > > date-range queries even when our data is arranged in a day or week > based > > > directory structure. > > > > > > We would like to be able to run the same query for all this data even > > > though the schema has changes slightly (new fields added) over time. > > > > > > For me there are two thing in this scenario that are unreasonable: > > > > > > 1. For Drill to have to get the schema for all possible files (union > > > based) to validate queries > > > - adding 100s of *irrelevant* files to the mix > > > > > > 2. For Drill to fail the query because a field is not found in the > > > sub-set (directory pruned sub-set) > > > > > > The current approach results in option 2 and the proposed solution > > results > > > in option 1 (As I understand it) > > > > > > We would be perfectly happy with unknown fields resulting in null as > > there > > > are many ways to deal with null values built into Drill. > > > > > > Hopefully this a) makes sense and b) is acceptable. > > > > > > Enforcing a strict schema for Avro could be an optional feature (IMO). > > > > > > Regards, > > > -Stefán > > > > > > On Wed, Dec 16, 2015 at 2:18 PM, Jacques Nadeau <[email protected]> > > > wrote: > > > > > > > I think the main problem your hitting is that we should do a union of > > all > > > > files. In that case, as long as the field is in a single file, we're > > > going > > > > to let the field through. > > > > > > > > There is a balancing between early termination and flexibility that > we > > > must > > > > provide. If someone types a field and it is guaranteed to not be in > the > > > > data, the thinking is we should fail the query early as that is > > probably > > > a > > > > mistake on the user's part. If it could be a valid field, we proceed > > > with > > > > execution and null it out until we find something. That is the goal > > > > anyway. Clearly we have a bug here as we should never deny a possible > > or > > > > known field. > > > > > > > > I think of fields in three categories: known, possible, impossible. > > > > Impossible fields should fail to validate. Possible and known fields > > > should > > > > validate and execute. > > > > > > > > With regards to Ted's concern: I agree that applying a filter > shouldn't > > > > fail a query. That means we will either have to consider the complete > > > union > > > > Schema before pruning files or consider all fields as either known or > > > > possible after pruning files. > > > > > > > > Stefan, if you haven't already, please open a bug that known fields > are > > > > failing to validate in Avro and we will fix shortly. Sorry about the > > bug. > > > > On Dec 14, 2015 10:51 PM, "Stefán Baxter" <[email protected] > > > > > > wrote: > > > > > > > > > Well, at least I'm not alone here. > > > > > > > > > > I think it must be time to set some ground rules for these things > and > > > > what > > > > > it means to support evolving schema and what is needed to eliminate > > > ETL. > > > > > > > > > > I trust that enforcing a strict schema "just because we think we > can" > > > > must > > > > > go against the principles of such rules. > > > > > > > > > > We moved all our stuff to Avro to avoid various problems with type > > > > handling > > > > > (assuming Double on nulls etc.) and to be hit with this, after all > > that > > > > > work, is like a slap in the face with two pilchards (more here: > > > > > https://www.youtube.com/watch?v=IhJQp-q1Y1s) > > > > > > > > > > Regards, > > > > > -Stefán > > > > > > > > > > On Tue, Dec 15, 2015 at 1:10 AM, Ted Dunning < > [email protected]> > > > > > wrote: > > > > > > > > > > > Sigh of relief is premature. Nobody has committed to carrying > this > > > > > > interpretation forward. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 11:44 AM, Stefán Baxter < > > > > > [email protected] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > /me sighs of relief > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 7:28 PM, Ted Dunning < > > > [email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > Actually, even without multiple storage types, this could be > > > > > radically > > > > > > > > confusing. > > > > > > > > > > > > > > > > If I have many avro files that are partitioned into > > directories, > > > > then > > > > > > > > queries that use the partitioning to limit the files that I > see > > > > could > > > > > > > > include or exclude more recent files that have added a new > > field. > > > > > > > > > > > > > > > > That means that a query would succeed or fail according to > > which > > > > date > > > > > > > range > > > > > > > > I use for the query. > > > > > > > > > > > > > > > > That seems pretty radically bad. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 9:33 AM, Stefán Baxter < > > > > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > This simply can not be the desired behavior! > > > > > > > > > > > > > > > > > > This prevents from using a field from a changing schema > with > > > dir0 > > > > > > > > > sub-selection (directory pruning) as the altered/full > schema > > is > > > > > never > > > > > > > > part > > > > > > > > > of the query and it subsequently fails. > > > > > > > > > > > > > > > > > > Drill should, IMOP, never have rules that are dependent on > > the > > > > > > > underlying > > > > > > > > > storage type. If the query runs with JSON and Parquet then > it > > > > > should > > > > > > > work > > > > > > > > > for Avro as well. > > > > > > > > > > > > > > > > > > I'm hoping this strict schema validation is all just a > > > > > > > misunderstanding. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 3:28 PM, Kamesh < > > > [email protected] > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > For Avro files, we first construct the schema, and this > > > schema > > > > is > > > > > > > used > > > > > > > > > for > > > > > > > > > > validating queries. So, if there are any errors in the > > query > > > > > (like > > > > > > > the > > > > > > > > > > invalid field references) it will fail fast. As of now, > for > > > > other > > > > > > > file > > > > > > > > > > formats, query validation (checking for invalid field > > > > reference) > > > > > > > does > > > > > > > > > not > > > > > > > > > > happen, and at run time, it constructs the schema for > them > > > and > > > > > > hence > > > > > > > > > nulls > > > > > > > > > > for invalid fields. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 14, 2015 at 2:36 PM, Stefán Baxter < > > > > > > > > > [email protected]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > I'm getting the following error when querying Avro > files: > > > > > > > > > > > > > > > > > > > > > > Error: VALIDATION ERROR: From line 1, column 48 to line > > 1, > > > > > column > > > > > > > 57: > > > > > > > > > > > Column 'some_col' not found in any table > > > > > > > > > > > > > > > > > > > > > > It's true that the field is in none of the tables I'm > > > > > targeting, > > > > > > in > > > > > > > > > that > > > > > > > > > > > particular query, but that does not mean that it is in > > none > > > > of > > > > > > the > > > > > > > > > > possible > > > > > > > > > > > files I could be querying. > > > > > > > > > > > > > > > > > > > > > > We use Avro to get the benefits of the schema but I > never > > > > > > expected > > > > > > > > > Drill > > > > > > > > > > to > > > > > > > > > > > enforce it this way. > > > > > > > > > > > > > > > > > > > > > > Why do unresolved columns not return null? > > > > > > > > > > > > > > > > > > > > > > This makes no sense to me as I think a fundamental > trade > > of > > > > > > Drill, > > > > > > > > when > > > > > > > > > > > trying to eliminate ETL, is to return null for any > > missing > > > > > > fields. > > > > > > > > > > > > > > > > > > > > > > Please advise. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Kamesh. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > Kamesh. > > > -- Kamesh.
