This is a bug in handling schema change. I would not expect this to happen in case your second query had select * rather than an explicit projection, select t.cool. Can you file a JIRA?
On Mon, Jun 22, 2015 at 9:20 AM, Andries Engelbrecht < [email protected]> wrote: > Ted, > > Agree fully, it definitely seems like a reproducible bug that should be > filed and resolved. > > > —Andries > > > > On Jun 22, 2015, at 9:14 AM, Ted Dunning <[email protected]> wrote: > > > Andries, > > > > That sounds like a reasonable suggestion, but the real problem is that it > > appears that having the field initially and then having the field be > > missing is OK, but if it is missing first and then present Drill blows a > > gasket. > > > > I think it looks like a bug. Very good and simple demo. > > > > > > > > On Mon, Jun 22, 2015 at 8:53 AM, Andries Engelbrecht < > > [email protected]> wrote: > > > >> A couple of things to try that I have found useful in the past. > >> > >> Pending if you want inner or outer joins, you may want to look at using > >> predicates to eliminate records that are not relevant to the join and > can > >> complicate the work Drill has to do. > >> > >> ie. add predicate "orders.cool is not null” > >> > >> Not only does it filter out the records that are not of interest (and > can > >> cause other challenges), but normally if you can apply predicates to > >> queries to reduce the working set that Drill has to join it can > >> substantially improve the performance for large data sets. Joins tend > to be > >> one of the more expensive operators in any execution engine, where > >> predicates tend to be a much easier operation to execute at large scale. > >> > >> —Andries > >> > >> > >> On Jun 22, 2015, at 7:19 AM, Christopher Matta <[email protected]> wrote: > >> > >>> I can confirm that this is reproducible: > >>> > >>> orders/111.json: > >>> > >>> { > >>> "tax" : 10, > >>> "id" : 111, > >>> "cust_id" : 333, > >>> "total" : 12, > >>> "demo" :10 > >>> } > >>> > >>> orders/222.json: > >>> > >>> { > >>> "cool": 20, > >>> "id" : 222, > >>> "cust_id" : 111, > >>> "total" : 12 > >>> } > >>> > >>> 1st query: > >>> > >>> 0: jdbc:drill:zk=sen11:5181,sen12:5181> SELECT customers.id, > orders.cool > >>> . . . . . . . . . . . . . . . . . . . > FROM > >>> `maprfs.cmatta`.`test/customers/*.json` customers, > >>> . . . . . . . . . . . . . . . . . . . > > >>> `maprfs.cmatta`.`test/orders/*.json` orders > >>> . . . . . . . . . . . . . . . . . . . > WHERE customers.id = > >> orders.cust_id > >>> . . . . . . . . . . . . . . . . . . . > AND customers.country = > 'FRANCE'; > >>> +------+-------+ > >>> | id | cool | > >>> +------+-------+ > >>> | 333 | null | > >>> +------+-------+ > >>> 1 row selected (0.258 seconds) > >>> > >>> Now change orders/111.json by moving the cool field from 222.json to > >>> 111.json: > >>> > >>> { > >>> "cool": 20, > >>> "tax" : 10, > >>> "id" : 111, > >>> "cust_id" : 333, > >>> "total" : 12, > >>> "demo" :10 > >>> } > >>> > >>> And removing cool from orders/222.json: > >>> > >>> { > >>> "id" : 222, > >>> "cust_id" : 111, > >>> "total" : 12 > >>> } > >>> > >>> Re-run the query: > >>> > >>> : jdbc:drill:zk=sen11:5181,sen12:5181> SELECT customers.id, > orders.cool > >>> . . . . . . . . . . . . . . . . . . . > FROM > >>> `maprfs.cmatta`.`test/customers/*.json` customers, > >>> . . . . . . . . . . . . . . . . . . . > > >>> `maprfs.cmatta`.`test/orders/*.json` orders > >>> . . . . . . . . . . . . . . . . . . . > WHERE customers.id = > >> orders.cust_id > >>> . . . . . . . . . . . . . . . . . . . > AND customers.country = > 'FRANCE'; > >>> java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: > >>> java.lang.IllegalStateException: Failure while reading vector. > >>> Expected vector class of > >>> org.apache.drill.exec.vector.NullableIntVector but was holding vector > >>> class org.apache.drill.exec.vector.NullableVarCharVector. > >>> > >>> Fragment 0:0 > >>> > >>> [Error Id: 04e231ee-8bad-4ad2-aff3-6c0273befd2f on > >> se-node11.se.lab:31010] > >>> at sqlline.IncrementalRows.hasNext(IncrementalRows.java:73) > >>> at > >> > sqlline.TableOutputFormat$ResizingRowsProvider.next(TableOutputFormat.java:87) > >>> at sqlline.TableOutputFormat.print(TableOutputFormat.java:118) > >>> at sqlline.SqlLine.print(SqlLine.java:1583) > >>> at sqlline.Commands.execute(Commands.java:852) > >>> at sqlline.Commands.sql(Commands.java:751) > >>> at sqlline.SqlLine.dispatch(SqlLine.java:738) > >>> at sqlline.SqlLine.begin(SqlLine.java:612) > >>> at sqlline.SqlLine.start(SqlLine.java:366) > >>> at sqlline.SqlLine.main(SqlLine.java:259) > >>> > >>> > >>> > >>> Chris Matta > >>> [email protected] > >>> 215-701-3146 > >>> > >>> On Mon, Jun 22, 2015 at 10:13 AM, Tugdual Grall <[email protected]> > >> wrote: > >>> > >>>> Yes. > >>>> > >>>> On Mon, Jun 22, 2015 at 4:12 PM, Christopher Matta <[email protected]> > >>>> wrote: > >>>> > >>>>> Just to clarify, you run the *exact same query* once and it works, > then > >>>>> you remove say the “cool” field from orders/222.json and put it in > >>>>> orders/111.json and the next time the same query returns that error? > >>>>> > >>>>> > >>>>> Chris Matta > >>>>> [email protected] > >>>>> 215-701-3146 > >>>>> > >>>>> On Mon, Jun 22, 2015 at 9:59 AM, Tugdual Grall <[email protected]> > >> wrote: > >>>>> > >>>>>> Hello, > >>>>>> > >>>>>> In my use case I have several JSON documents that I need to query > >> using a > >>>>>> join. > >>>>>> The structure of each document can vary a lot (some fields a present > >> or > >>>>>> not > >>>>>> in documents) > >>>>>> > >>>>>> Sometimes the following exception is raised: > >>>>>> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM > ERROR: > >>>>>> java.lang.IllegalStateException: Failure while reading vector. > >> Expected > >>>>>> vector class of org.apache.drill.exec.vector.NullableIntVector but > was > >>>>>> holding vector class > >> org.apache.drill.exec.vector.NullableVarCharVector. > >>>>>> Fragment 0:0 [Error Id: 35c751bd-3ca0-4e4a-bbac-ad5823ce582f on > >>>>>> 192.168.99.13:31010] > >>>>>> > >>>>>> The queries: > >>>>>> > >>>>>> Following query works: > >>>>>> ----- > >>>>>> SELECT customers.id, orders.demo > >>>>>> FROM dfs.`/Users/tgrall/working/customers/*.json` customers, > >>>>>> dfs.`/Users/tgrall/working/orders/*.json` orders > >>>>>> WHERE customers.id = orders.cust_id > >>>>>> AND customers.country = 'FRANCE' > >>>>>> ----- > >>>>>> > >>>>>> Following query FAILS: > >>>>>> ----- > >>>>>> SELECT customers.id, orders.cool > >>>>>> FROM dfs.`/Users/tgrall/working/customers/*.json` customers, > >>>>>> dfs.`/Users/tgrall/working/orders/*.json` orders > >>>>>> WHERE customers.id = orders.cust_id > >>>>>> AND customers.country = 'FRANCE' > >>>>>> ----- > >>>>>> > >>>>>> > >>>>>> The documents: > >>>>>> > >>>>>> Here the files: > >>>>>> > >>>>>> ./customers/333.json > >>>>>> { > >>>>>> "id" : 333, > >>>>>> "name" : "Dave Smith", > >>>>>> "country" : "FRANCE" > >>>>>> } > >>>>>> > >>>>>> > >>>>>> ./orders/111.json > >>>>>> { > >>>>>> "tax" : 10, > >>>>>> "id" : 111, > >>>>>> "cust_id" : 333, > >>>>>> "total" : 12, > >>>>>> "demo" :10 > >>>>>> } > >>>>>> > >>>>>> ./orders/222.json > >>>>>> { > >>>>>> "cool":20, > >>>>>> "id" : 222, > >>>>>> "cust_id" : 111, > >>>>>> "total" : 12 > >>>>>> } > >>>>>> > >>>>>> > >>>>>> To reproduce the bug you may have to change the document (add/remove > >>>>>> cool, > >>>>>> tax fields) > >>>>>> > >>>>>> It looks like the schema is not "updated" on the fly in some case. > >>>>>> > >>>>>> Any idea how to workaround? Is that bug? > >>>>>> > >>>>>> Regards > >>>>>> Tug > >>>>>> > >>>>> > >>>>> > >>>> > >> > >> > >
