Re: Nested/n-dimensional Arrays in Mongo/Json and Drill

2016-09-14 Thread Pradeeban Kathiravelu
Hi Sudeesh,
Probably I was not clear in my previous email. The error is due to
accessing an array from the SELECT directly, which is not supported by
Drill. My question was, how to overcome this hurdle without using the
performance-hungry FLATTEN keyword or using the array indices (impractical
as the data is not structured and we do not know the number of elements,
and it is going to be a lot and a 3-D array). I am planning to use
Protobufs with Drill to serialize and deserialize the data.

Anyway, since you asked, given below is the error logs when accessing the
nested array element using SELECT. (The entire query and the sample data
were given in the previous email).


Error: SYSTEM ERROR: NullPointerException

Fragment 0:0

[Error Id: 98390161-00e7-4982-b40b-171c8f1029c3 on llovizna:31010]

  (java.lang.NullPointerException) null
org.apache.drill.exec.vector.complex.impl.MapOrListWriterImpl.list():73
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():139
org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
org.apache.drill.exec.physical.impl.ScanBatch.next():191
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():744 (state=,code=0)


Thank you.
Regards,
Pradeeban.

On Tue, Sep 13, 2016 at 8:34 PM, Sudheesh Katkam 
wrote:

> Hi Pradeeban,
>
> Can you post the detailed error message?
>
> First set the option:
> > SET `exec.errors.verbose` = true;
>
> And then run the query. The detailed output will point us to where the
> error occurred.
>
> Thank you,
> Sudheesh
>
> > On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu <
> kk.pradee...@gmail.com> wrote:
> >
> > Hi,
> > I have a complex json data stored in Mongo.
> >
> > The data has nested arrays as shown below .
> >
> > here when I try to access the geometry field in select queries, the Drill
> > query fails.
> >
> > select camic._id, camic.type, camic.parent_id, camic.randval,
> > camic.creation_date, camic.object_type, camic.x, camic.y,
> camic.normalized,
> > camic.bbox, camic.geometry, camic.footprint, camic.properties,
> > camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2`
> as
> > camic WHERE ((camic.provenance.image.case_id =
> 'TCGA-02-0001-01Z-00-DX1')
> > AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND
> (camic.y
> >> = 0) AND (camic.y <= 1));
> >
> > would fail with a below error:
> >
> >
> >
> >
> >
> > *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id:
> > 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)*
> >
> > However, after removing the geometry from the select query, it would work
> > just fine.
> >
> > select 

Re: Nested/n-dimensional Arrays in Mongo/Json and Drill

2016-09-13 Thread Sudheesh Katkam
Hi Pradeeban,

Can you post the detailed error message?

First set the option:
> SET `exec.errors.verbose` = true;

And then run the query. The detailed output will point us to where the error 
occurred.

Thank you,
Sudheesh

> On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu  
> wrote:
> 
> Hi,
> I have a complex json data stored in Mongo.
> 
> The data has nested arrays as shown below .
> 
> here when I try to access the geometry field in select queries, the Drill
> query fails.
> 
> select camic._id, camic.type, camic.parent_id, camic.randval,
> camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
> camic.bbox, camic.geometry, camic.footprint, camic.properties,
> camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as
> camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1')
> AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>> = 0) AND (camic.y <= 1));
> 
> would fail with a below error:
> 
> 
> 
> 
> 
> *Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id:
> 8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)*
> 
> However, after removing the geometry from the select query, it would work
> just fine.
> 
> select camic._id, camic.type, camic.parent_id, camic.randval,
> camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
> camic.bbox, camic.footprint, camic.properties, camic.provenance,
> camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE
> ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND
> (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>> = 0) AND (camic.y <= 1));
> 
> 
> I know this is due to the array in the output. I am also aware of the
> commonly suggested options.
> 1. Using the array indexes in the select query: This is *impractical*. I do
> not know how many elements I would have in this geojson - the coordinates.
> It may be millions or as low as 3.
> 
> 2. Flatten keyword: I am using Drill on top of Mongo - and finding an
> interesting case where Drill outperforms certain queries in a distributed
> execution than just using Mongo. Using Flatten basically kills all the
> performance benefits I have with Drill otherwise. Flatten is just plain
> *expensive* operation for the scale of my data (around 48 GB. But I can
> split them into a few GB each).
> 
> Now given that I have explained why I cannot use the commonly suggested
> options to deal with the complex Drill queries involving arrays, what are
> the other alternatives?
> 
> I am thinking of using Protobufs with Drill to serialize the nested arrays.
> Has anyone else tried it before, and any pointers, or anyother suggestions
> on this overall requirement of querying the nested arrays?
> 
> Thank you.
> Regards,
> Pradeeban.
> 
> 
>  The sample data in Mongo (data anonymized).
>> db.testUAIM2.findOne()
> {
>"_id" : ObjectId("22"),
>"bbox" : [
>0.11,
>0.33,
>0.44,
>0.44
>],
>"*geometry*" : {
>"type" : "Polygon",
>"coordinates" : [
>[
>[
>0.04272915795445442,
>0.9368849396705627
>],
>[
>0.04272915795445442,
>0.9369083046913147
>],
>[
>0.042739588767290115,
>0.9369083046913147
>],
>[
>0.042739588767290115,
>0.9369550347328186
>],
>[
>0.04275001958012581,
>0.9369550347328186
>],
>[
>0.04275001958012581,
>0.9370251893997192
>],
>[
>0.042760446667671204,
>0.9370251893997192
>],
>[
>0.042760446667671204,
>0.9371420741081238
>],
>[
>0.04275001958012581,
>0.9371420741081238
>],
>[
>0.04275001958012581,
>0.9372121691703796
>],
>[
>0.042739588767290115,
>0.9372121691703796
>],
>[
>0.042739588767290115,
>0.9372823238372803
>],
>[
>0.04272915795445442,
>0.9372823238372803
>],
>[
>0.04272915795445442,
>0.9373056888580322
>],