Re: Querying Nested Data in Mongo does not produce the results.
Thanks Kathleen. SELECT camic.provenance.image.case_id caseid from mongo.users.`contacts2` camic where camic.provenance.image.case_id > 10; works for me as well. I have also resolved the bug report that I opened as "Invalid" Regards, Pradeeban. On Thu, Aug 18, 2016 at 5:33 PM, Kathleen Li <k...@maprtech.com> wrote: > Hi Prdedeeban, > > Please check out the following doc: > http://drill.apache.org/docs/select-statements/ > > > Column Aliases. You cannot reference column aliases in the following > clauses: > > * WHERE > * GROUP BY > * HAVING > > > http://drill.apache.org/docs/select-statements/ > > The following works for me without using column alias: > 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT > camic.provenance.image.case_id caseid from test camic where > camic.provenance.image.case_id > 10; > +-+ > | caseid | > +-+ > | 100.0 | > +-+ > > > > > > > > Kathleen > > > > > > > On 8/18/16, 2:03 PM, "Pradeeban Kathiravelu" <kk.pradee...@gmail.com> > wrote: > > >Hi, > >I have this simple data in a Mongo database. > >{"_id":{"$oid":"56a784b76952647b7b51c562"},"provenance":{" > image":{"case_id":100,"subject_id":"TCGA"}}} > > > >When I run > > > >*SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2` > >camic* > > > >I get the below. > > > >+-+ > >| caseid | > >+-+ > >| 100 | > >+-+ > > > >When I run > > > > > > > >*SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2` > >camicWHERE caseid > 10;* > >I expect the same outcome as above. > > > >However, I get the below (no results). > > > >+-+ > >| caseid | > >+-+ > >+-+ > > > > > >*tail -f sqlline.log* indicates the below. (see the highlighted line > >specifically). > > > >2016-08-18 16:56:07,337 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:foreman] > >INFO o.a.drill.exec.work.foreman.Foreman - Query text for query id > >2849df17-8fbd-eb57-03c8-cb2181bc81c7: SELECT > camic.provenance.image.case_id > >caseid > >FROM mongo.users.`contacts2` camic > >WHERE caseid > 10 > >2016-08-18 16:56:08,491 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.s.m.MongoScanBatchCreator - Number of record readers > >initialized : 1 > >2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.w.fragment.FragmentExecutor - > >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested > >AWAITING_ALLOCATION --> RUNNING > >2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.w.f.FragmentStatusReporter - > >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: RUNNING > >2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.s.mongo.MongoRecordReader - Filters Applied : Document{{}} > >2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.s.mongo.MongoRecordReader - Fields Selected > :Document{{_id=0, > >caseid=1, provenance=1}} > >*2016-08-18 16:56:08,514 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >WARN o.a.d.e.e.ExpressionTreeMaterializer - Unable to find value vector > of > >path `caseid`, returning null instance.* > >2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.w.fragment.FragmentExecutor - > >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested RUNNING > >--> FINISHED > >2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] > >INFO o.a.d.e.w.f.FragmentStatusReporter - > >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: FINISHED > > > > > > > >Tested with 1.6.0 and 1.7.0. Reported this as DRILL-4855 > ><https://issues.apache.org/jira/browse/DRILL-4855> > > > >Can someone confirm whether this is a bug, or is it simply I am doing > >something that is not supported, or doing something wrong? > > > >Thank you. > >Regards, > >Pradeeban. > > > >-- > >Pradeeban Kathiravelu. > >PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing, > >INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa, > >Portugal. > >Biomedical Informatics Software Engineer, Emory University School of > >Medicine. > > > >Blog: [Llovizna] http://kkpradeeban.blogspot.com/ > >LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03 > > -- Pradeeban Kathiravelu. PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing, INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa, Portugal. Biomedical Informatics Software Engineer, Emory University School of Medicine. Blog: [Llovizna] http://kkpradeeban.blogspot.com/ LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
Re: Nested/n-dimensional Arrays in Mongo/Json and Drill
Hi Sudeesh, Probably I was not clear in my previous email. The error is due to accessing an array from the SELECT directly, which is not supported by Drill. My question was, how to overcome this hurdle without using the performance-hungry FLATTEN keyword or using the array indices (impractical as the data is not structured and we do not know the number of elements, and it is going to be a lot and a 3-D array). I am planning to use Protobufs with Drill to serialize and deserialize the data. Anyway, since you asked, given below is the error logs when accessing the nested array element using SELECT. (The entire query and the sample data were given in the previous email). Error: SYSTEM ERROR: NullPointerException Fragment 0:0 [Error Id: 98390161-00e7-4982-b40b-171c8f1029c3 on llovizna:31010] (java.lang.NullPointerException) null org.apache.drill.exec.vector.complex.impl.MapOrListWriterImpl.list():73 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():139 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():191 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():744 (state=,code=0) Thank you. Regards, Pradeeban. On Tue, Sep 13, 2016 at 8:34 PM, Sudheesh Katkam <skat...@maprtech.com> wrote: > Hi Pradeeban, > > Can you post the detailed error message? > > First set the option: > > SET `exec.errors.verbose` = true; > > And then run the query. The detailed output will point us to where the > error occurred. > > Thank you, > Sudheesh > > > On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu < > kk.pradee...@gmail.com> wrote: > > > > Hi, > > I have a complex json data stored in Mongo. > > > > The data has nested arrays as shown below . > > > > here when I try to access the geometry field in select queries, the Drill > > query fails. > > > > select camic._id, camic.type, camic.parent_id, camic.randval, > > camic.creation_date, camic.object_type, camic.x, camic.y, > camic.normalized, > > camic.bbox, camic.geometry, camic.footprint, camic.properties, > > camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` > as > > camic WHERE ((camic.provenance.image.case_id = > 'TCGA-02-0001-01Z-00-DX1') > > AND (camic.provenance.analysis.execution_id = 'tammy-test:7') AND > > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND > (camic.y > >> = 0) AND (camic.y <= 1)); > > > > would fail with a below error: > > > > > > > > > > > > *Error: SYSTEM ERRO
Nested/n-dimensional Arrays in Mongo/Json and Drill
Hi, I have a complex json data stored in Mongo. The data has nested arrays as shown below . here when I try to access the geometry field in select queries, the Drill query fails. select camic._id, camic.type, camic.parent_id, camic.randval, camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized, camic.bbox, camic.geometry, camic.footprint, camic.properties, camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND (camic.provenance.analysis.execution_id = 'tammy-test:7') AND (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y >= 0) AND (camic.y <= 1)); would fail with a below error: *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id: 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)* However, after removing the geometry from the select query, it would work just fine. select camic._id, camic.type, camic.parent_id, camic.randval, camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized, camic.bbox, camic.footprint, camic.properties, camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND (camic.provenance.analysis.execution_id = 'tammy-test:7') AND (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y >= 0) AND (camic.y <= 1)); I know this is due to the array in the output. I am also aware of the commonly suggested options. 1. Using the array indexes in the select query: This is *impractical*. I do not know how many elements I would have in this geojson - the coordinates. It may be millions or as low as 3. 2. Flatten keyword: I am using Drill on top of Mongo - and finding an interesting case where Drill outperforms certain queries in a distributed execution than just using Mongo. Using Flatten basically kills all the performance benefits I have with Drill otherwise. Flatten is just plain *expensive* operation for the scale of my data (around 48 GB. But I can split them into a few GB each). Now given that I have explained why I cannot use the commonly suggested options to deal with the complex Drill queries involving arrays, what are the other alternatives? I am thinking of using Protobufs with Drill to serialize the nested arrays. Has anyone else tried it before, and any pointers, or anyother suggestions on this overall requirement of querying the nested arrays? Thank you. Regards, Pradeeban. The sample data in Mongo (data anonymized). > db.testUAIM2.findOne() { "_id" : ObjectId("22"), "bbox" : [ 0.11, 0.33, 0.44, 0.44 ], "*geometry*" : { "type" : "Polygon", "coordinates" : [ [ [ 0.04272915795445442, 0.9368849396705627 ], [ 0.04272915795445442, 0.9369083046913147 ], [ 0.042739588767290115, 0.9369083046913147 ], [ 0.042739588767290115, 0.9369550347328186 ], [ 0.04275001958012581, 0.9369550347328186 ], [ 0.04275001958012581, 0.9370251893997192 ], [ 0.042760446667671204, 0.9370251893997192 ], [ 0.042760446667671204, 0.9371420741081238 ], [ 0.04275001958012581, 0.9371420741081238 ], [ 0.04275001958012581, 0.9372121691703796 ], [ 0.042739588767290115, 0.9372121691703796 ], [ 0.042739588767290115, 0.9372823238372803 ], [ 0.04272915795445442, 0.9372823238372803 ], [ 0.04272915795445442, 0.9373056888580322 ], [ 0.04272915795445442, 0.9368849396705627 ] ] ] }, "footprint" : 5, "properties" : { "scalar_features" : [ { "nv" : [ { "name" : "f", "valu
Drill with Proto Buffers or Apache Thrift
Hi, We are evaluating Drill for data with multi-dimensional array. We like to keep the overhead low. So we decided against using flatten() to query the multi-dimensional array. Similarly using the indices to refer to the array elements is simply infeasible as our array is dynamic and we will not know the number of elements present in the array (the array represents the coordinates in a geojson). We are evaluating the potentials for using Proto Buffers to serialize the multi-dimensional array first before querying the data with Drill. So avoiding the error " *Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST"* Pls note that while our query results include these arrays (as in "select *"), we are not querying the array itself with Drill. Rather, we are querying the other attributes associated with in the same object. Hence it is theoretically possible to query while the array remains serialized. Our data is originally in the format of a JSON, hence the complex structure. However, we have some questions on the architectural feasibility without draining the performance of Drill and Proto Buffers. It is no doubt that both are highly performing. However, we are skeptical about the use of them combined. Is there any development effort on serialization with Protocol Buffers and/or Apache Thrift? Any storage plugins developed, or similar deployment architectures,as in: *Data with multi-dimensional array -> Data with the multi-dimensional array serialized with Protocol Buffers -> Query with Drill -> Deserialize the multi-dimensional arrays in the query results back with Protocol Buffers* ? Pls share your thoughts on this (whether you have attempted this, or is there something that I am failing to see). We have also tried other alternatives such as using CTAS and also a potential to just modify the data source schema from multi-dimensional arrays to a map [1]. We do not mind the initial performance hit of conversions. This is just a one-time cost. What matters is the consequent read queries - they should be efficient and fast, as in using Drill when multi-dimensional arrays are not included. [1] http://kkpradeeban.blogspot.com/search/label/Drill Thank you. Regards, Pradeeban. -- Pradeeban Kathiravelu. PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing, INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa, Portugal. Biomedical Informatics Software Engineer, Emory University School of Medicine. Blog: [Llovizna] http://kkpradeeban.blogspot.com/ LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
Querying Nested Data in Mongo does not produce the results.
Hi, I have this simple data in a Mongo database. {"_id":{"$oid":"56a784b76952647b7b51c562"},"provenance":{"image":{"case_id":100,"subject_id":"TCGA"}}} When I run *SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2` camic* I get the below. +-+ | caseid | +-+ | 100 | +-+ When I run *SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2` camicWHERE caseid > 10;* I expect the same outcome as above. However, I get the below (no results). +-+ | caseid | +-+ +-+ *tail -f sqlline.log* indicates the below. (see the highlighted line specifically). 2016-08-18 16:56:07,337 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:foreman] INFO o.a.drill.exec.work.foreman.Foreman - Query text for query id 2849df17-8fbd-eb57-03c8-cb2181bc81c7: SELECT camic.provenance.image.case_id caseid FROM mongo.users.`contacts2` camic WHERE caseid > 10 2016-08-18 16:56:08,491 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.s.m.MongoScanBatchCreator - Number of record readers initialized : 1 2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.w.fragment.FragmentExecutor - 2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested AWAITING_ALLOCATION --> RUNNING 2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.w.f.FragmentStatusReporter - 2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: RUNNING 2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.s.mongo.MongoRecordReader - Filters Applied : Document{{}} 2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.s.mongo.MongoRecordReader - Fields Selected :Document{{_id=0, caseid=1, provenance=1}} *2016-08-18 16:56:08,514 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] WARN o.a.d.e.e.ExpressionTreeMaterializer - Unable to find value vector of path `caseid`, returning null instance.* 2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.w.fragment.FragmentExecutor - 2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested RUNNING --> FINISHED 2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0] INFO o.a.d.e.w.f.FragmentStatusReporter - 2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: FINISHED Tested with 1.6.0 and 1.7.0. Reported this as DRILL-4855 <https://issues.apache.org/jira/browse/DRILL-4855> Can someone confirm whether this is a bug, or is it simply I am doing something that is not supported, or doing something wrong? Thank you. Regards, Pradeeban. -- Pradeeban Kathiravelu. PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing, INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa, Portugal. Biomedical Informatics Software Engineer, Emory University School of Medicine. Blog: [Llovizna] http://kkpradeeban.blogspot.com/ LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
Re: Interesting Errors
If I understand correctly, he was using the character ・ Different languages have different symbols. The same thing can be said about the Chinese equivalent 。 The characters ・and 。 are entirely different from the "." Not sure whether this needs to be fixed. If I am not completely misunderstanding something, he will also fail if he attempts to use that character (in place of ".") in programming languages (Java, C, ..). Regards, Pradeeban. On Tue, Aug 7, 2018 at 10:02 AM, Charles Givre wrote: > Hello Drill Developers, > I wanted to share an interesting development that happened yesterday. I > was teaching a class at BlackHat, and we have a worksheet that includes a > Drill demonstration using PyDrill. Basically the students are asked to > execute a query in Drill using PyDrill then visualize the results. > > Anyway, a student from Japan tried this, and was getting all kinds of > crazy errors. So I sat down and worked with him to debug. It turns out > that the period on the Japanese keyboard, maps to a different unicode > character than on US keyboards, and hence the queries throw errors. I > discovered this because when I would cut/paste a query from a text file > that I wrote, the query executed, but if we typed one in, it broke. After > digging around a bit, I found that it was the period character. > > I’m not sure that this can or should be fixed, but I wanted to let people > know about this. > > Best, > — C -- Pradeeban Kathiravelu. Senior Systems Software Engineer, Emory University, Atlanta, GA, USA. Ph.D. Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing, INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa, Portugal. Université catholique de Louvain, Louvain-la-Neuve, Belgium. Blog: [Llovizna] kkpradeeban.blogspot.com LinkedIn: www.linkedin.com/in/kpradeeban