Re: Handling schema change in blocking operators

2018-11-06 Thread Paul Rogers
Hi Boaz, As noted earlier, it would be wonderful if Drill could handle schema changes on the fly, using only the information in the files as they are read, and with only a few code changes. Alas, such is not the case. Question: is the goal to have schema changes somewhat less often (but they

Re: Handling schema change in blocking operators

2018-11-06 Thread Boaz Ben-Zvi
 Hi Paul, (_a_)  Having a "schema file" sounds like contradiction to calling Drill "schema free"; maybe we could "sweep it under the mat" by creating a new convention for scanners, such that if a scanner has multiple files to read (e.g. f1.csv, f2,csv, ...), then is there's some file named

Re: Handling schema change in blocking operators

2018-11-06 Thread Paul Rogers
HI Aman, I would completely agree with the analysis -- except for the fact that we can't create a general solution, only a patchwork of incomplete ad-hoc solutions. The question is not whether it would be useful to have a general solution (it would), rather whether it is technically possible

Re: Handling schema change in blocking operators

2018-11-06 Thread Aman Sinha
Hi Paul, Thanks for the feedback ! I am in complete favor of doing the schema discovery and schema hinting. But even on this list in the past we have discussed other use cases such as IoT devices where the schema-on-read is needed (I think it was in the context of the 'death of schema-on-read'

Re: Handling schema change in blocking operators

2018-11-05 Thread Paul Rogers
Hi Aman, Thanks much for the write-up. My two cents, FWIW. As the history of this list has shown, I've fought with the schema change issue multiple times: in sort, in JSON, in the row set loader framework, and in writing the "Data Engineering" chapter in the Learning Drill book. What I have

Handling schema change in blocking operators

2018-11-05 Thread Aman Sinha
Hi all, While we continue to enhance the schema provision and metastore aspects in Drill, we also should explore what it means to be truly schema-less such that we can better handle {semi, un}structured data, data sitting in DBs that store JSON documents (e.g Mongo, MapR-DB). The blocking