Re: Nested/n-dimensional Arrays in Mongo/Json and Drill
Hi Sudeesh, Probably I was not clear in my previous email. The error is due to accessing an array from the SELECT directly, which is not supported by Drill. My question was, how to overcome this hurdle without using the performance-hungry FLATTEN keyword or using the array indices (impractical as the data is not structured and we do not know the number of elements, and it is going to be a lot and a 3-D array). I am planning to use Protobufs with Drill to serialize and deserialize the data. Anyway, since you asked, given below is the error logs when accessing the nested array element using SELECT. (The entire query and the sample data were given in the previous email). Error: SYSTEM ERROR: NullPointerException Fragment 0:0 [Error Id: 98390161-00e7-4982-b40b-171c8f1029c3 on llovizna:31010] (java.lang.NullPointerException) null org.apache.drill.exec.vector.complex.impl.MapOrListWriterImpl.list():73 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112 org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():139 org.apache.drill.exec.store.bson.BsonRecordReader.write():75 org.apache.drill.exec.store.mongo.MongoRecordReader.next():186 org.apache.drill.exec.physical.impl.ScanBatch.next():191 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257 org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251 java.security.AccessController.doPrivileged():-2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor.run():251 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor$Worker.run():617 java.lang.Thread.run():744 (state=,code=0) Thank you. Regards, Pradeeban. On Tue, Sep 13, 2016 at 8:34 PM, Sudheesh Katkamwrote: > Hi Pradeeban, > > Can you post the detailed error message? > > First set the option: > > SET `exec.errors.verbose` = true; > > And then run the query. The detailed output will point us to where the > error occurred. > > Thank you, > Sudheesh > > > On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu < > kk.pradee...@gmail.com> wrote: > > > > Hi, > > I have a complex json data stored in Mongo. > > > > The data has nested arrays as shown below . > > > > here when I try to access the geometry field in select queries, the Drill > > query fails. > > > > select camic._id, camic.type, camic.parent_id, camic.randval, > > camic.creation_date, camic.object_type, camic.x, camic.y, > camic.normalized, > > camic.bbox, camic.geometry, camic.footprint, camic.properties, > > camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` > as > > camic WHERE ((camic.provenance.image.case_id = > 'TCGA-02-0001-01Z-00-DX1') > > AND (camic.provenance.analysis.execution_id = 'tammy-test:7') AND > > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND > (camic.y > >> = 0) AND (camic.y <= 1)); > > > > would fail with a below error: > > > > > > > > > > > > *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id: > > 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)* > > > > However, after removing the geometry from the select query, it would work > > just fine. > > > > select
Re: Nested/n-dimensional Arrays in Mongo/Json and Drill
Hi Pradeeban, Can you post the detailed error message? First set the option: > SET `exec.errors.verbose` = true; And then run the query. The detailed output will point us to where the error occurred. Thank you, Sudheesh > On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu> wrote: > > Hi, > I have a complex json data stored in Mongo. > > The data has nested arrays as shown below . > > here when I try to access the geometry field in select queries, the Drill > query fails. > > select camic._id, camic.type, camic.parent_id, camic.randval, > camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized, > camic.bbox, camic.geometry, camic.footprint, camic.properties, > camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as > camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') > AND (camic.provenance.analysis.execution_id = 'tammy-test:7') AND > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y >> = 0) AND (camic.y <= 1)); > > would fail with a below error: > > > > > > *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id: > 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)* > > However, after removing the geometry from the select query, it would work > just fine. > > select camic._id, camic.type, camic.parent_id, camic.randval, > camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized, > camic.bbox, camic.footprint, camic.properties, camic.provenance, > camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE > ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND > (camic.provenance.analysis.execution_id = 'tammy-test:7') AND > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y >> = 0) AND (camic.y <= 1)); > > > I know this is due to the array in the output. I am also aware of the > commonly suggested options. > 1. Using the array indexes in the select query: This is *impractical*. I do > not know how many elements I would have in this geojson - the coordinates. > It may be millions or as low as 3. > > 2. Flatten keyword: I am using Drill on top of Mongo - and finding an > interesting case where Drill outperforms certain queries in a distributed > execution than just using Mongo. Using Flatten basically kills all the > performance benefits I have with Drill otherwise. Flatten is just plain > *expensive* operation for the scale of my data (around 48 GB. But I can > split them into a few GB each). > > Now given that I have explained why I cannot use the commonly suggested > options to deal with the complex Drill queries involving arrays, what are > the other alternatives? > > I am thinking of using Protobufs with Drill to serialize the nested arrays. > Has anyone else tried it before, and any pointers, or anyother suggestions > on this overall requirement of querying the nested arrays? > > Thank you. > Regards, > Pradeeban. > > > The sample data in Mongo (data anonymized). >> db.testUAIM2.findOne() > { >"_id" : ObjectId("22"), >"bbox" : [ >0.11, >0.33, >0.44, >0.44 >], >"*geometry*" : { >"type" : "Polygon", >"coordinates" : [ >[ >[ >0.04272915795445442, >0.9368849396705627 >], >[ >0.04272915795445442, >0.9369083046913147 >], >[ >0.042739588767290115, >0.9369083046913147 >], >[ >0.042739588767290115, >0.9369550347328186 >], >[ >0.04275001958012581, >0.9369550347328186 >], >[ >0.04275001958012581, >0.9370251893997192 >], >[ >0.042760446667671204, >0.9370251893997192 >], >[ >0.042760446667671204, >0.9371420741081238 >], >[ >0.04275001958012581, >0.9371420741081238 >], >[ >0.04275001958012581, >0.9372121691703796 >], >[ >0.042739588767290115, >0.9372121691703796 >], >[ >0.042739588767290115, >0.9372823238372803 >], >[ >0.04272915795445442, >0.9372823238372803 >], >[ >0.04272915795445442, >0.9373056888580322 >],