Re: Querying Nested Data in Mongo does not produce the results.

2016-08-18 Thread Pradeeban Kathiravelu
Thanks Kathleen.

SELECT camic.provenance.image.case_id caseid
from mongo.users.`contacts2` camic
where camic.provenance.image.case_id > 10;

works for me as well.

I have also resolved the bug report that I opened as "Invalid"

Regards,
Pradeeban.

On Thu, Aug 18, 2016 at 5:33 PM, Kathleen Li <k...@maprtech.com> wrote:

> Hi Prdedeeban,
>
> Please check out the following doc:
> http://drill.apache.org/docs/select-statements/
>
>
> Column Aliases. You cannot reference column aliases in the following
> clauses:
>
> * WHERE
> * GROUP BY
> * HAVING
>
>
> http://drill.apache.org/docs/select-statements/
>
> The following works for me without using column alias:
> 0: jdbc:drill:zk=drill1:5181,drill2:5181,dril> SELECT
> camic.provenance.image.case_id caseid from test camic where
> camic.provenance.image.case_id > 10;
> +-+
> | caseid |
> +-+
> | 100.0 |
> +-+
>
>
>
>
>
>
>
> Kathleen
>
>
>
>
>
>
> On 8/18/16, 2:03 PM, "Pradeeban Kathiravelu" <kk.pradee...@gmail.com>
> wrote:
>
> >Hi,
> >I have this simple data in a Mongo database.
> >{"_id":{"$oid":"56a784b76952647b7b51c562"},"provenance":{"
> image":{"case_id":100,"subject_id":"TCGA"}}}
> >
> >When I run
> >
> >*SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2`
> >camic*
> >
> >I get the below.
> >
> >+-+
> >| caseid  |
> >+-+
> >| 100 |
> >+-+
> >
> >When I run
> >
> >
> >
> >*SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2`
> >camicWHERE caseid > 10;*
> >I expect the same outcome as above.
> >
> >However, I get the below (no results).
> >
> >+-+
> >| caseid  |
> >+-+
> >+-+
> >
> >
> >*tail -f sqlline.log* indicates the below. (see the highlighted line
> >specifically).
> >
> >2016-08-18 16:56:07,337 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:foreman]
> >INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
> >2849df17-8fbd-eb57-03c8-cb2181bc81c7: SELECT
> camic.provenance.image.case_id
> >caseid
> >FROM mongo.users.`contacts2` camic
> >WHERE caseid > 10
> >2016-08-18 16:56:08,491 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.s.m.MongoScanBatchCreator - Number of record readers
> >initialized : 1
> >2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.w.fragment.FragmentExecutor -
> >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested
> >AWAITING_ALLOCATION --> RUNNING
> >2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.w.f.FragmentStatusReporter -
> >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: RUNNING
> >2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.s.mongo.MongoRecordReader - Filters Applied : Document{{}}
> >2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.s.mongo.MongoRecordReader - Fields Selected
> :Document{{_id=0,
> >caseid=1, provenance=1}}
> >*2016-08-18 16:56:08,514 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >WARN  o.a.d.e.e.ExpressionTreeMaterializer - Unable to find value vector
> of
> >path `caseid`, returning null instance.*
> >2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.w.fragment.FragmentExecutor -
> >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested RUNNING
> >--> FINISHED
> >2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
> >INFO  o.a.d.e.w.f.FragmentStatusReporter -
> >2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: FINISHED
> >
> >
> >
> >Tested with 1.6.0 and 1.7.0. Reported this as DRILL-4855
> ><https://issues.apache.org/jira/browse/DRILL-4855>
> >
> >Can someone confirm whether this is a bug, or is it simply I am doing
> >something that is not supported, or doing something wrong?
> >
> >Thank you.
> >Regards,
> >Pradeeban.
> >
> >--
> >Pradeeban Kathiravelu.
> >PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
> >INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
> >Portugal.
> >Biomedical Informatics Software Engineer, Emory University School of
> >Medicine.
> >
> >Blog: [Llovizna] http://kkpradeeban.blogspot.com/
> >LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03
>
>


-- 
Pradeeban Kathiravelu.
PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Biomedical Informatics Software Engineer, Emory University School of
Medicine.

Blog: [Llovizna] http://kkpradeeban.blogspot.com/
LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03


Re: Nested/n-dimensional Arrays in Mongo/Json and Drill

2016-09-14 Thread Pradeeban Kathiravelu
Hi Sudeesh,
Probably I was not clear in my previous email. The error is due to
accessing an array from the SELECT directly, which is not supported by
Drill. My question was, how to overcome this hurdle without using the
performance-hungry FLATTEN keyword or using the array indices (impractical
as the data is not structured and we do not know the number of elements,
and it is going to be a lot and a 3-D array). I am planning to use
Protobufs with Drill to serialize and deserialize the data.

Anyway, since you asked, given below is the error logs when accessing the
nested array element using SELECT. (The entire query and the sample data
were given in the previous email).


Error: SYSTEM ERROR: NullPointerException

Fragment 0:0

[Error Id: 98390161-00e7-4982-b40b-171c8f1029c3 on llovizna:31010]

  (java.lang.NullPointerException) null
org.apache.drill.exec.vector.complex.impl.MapOrListWriterImpl.list():73
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():112
org.apache.drill.exec.store.bson.BsonRecordReader.writeToListOrMap():139
org.apache.drill.exec.store.bson.BsonRecordReader.write():75
org.apache.drill.exec.store.mongo.MongoRecordReader.next():186
org.apache.drill.exec.physical.impl.ScanBatch.next():191
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():94
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51

org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104

org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():744 (state=,code=0)


Thank you.
Regards,
Pradeeban.

On Tue, Sep 13, 2016 at 8:34 PM, Sudheesh Katkam <skat...@maprtech.com>
wrote:

> Hi Pradeeban,
>
> Can you post the detailed error message?
>
> First set the option:
> > SET `exec.errors.verbose` = true;
>
> And then run the query. The detailed output will point us to where the
> error occurred.
>
> Thank you,
> Sudheesh
>
> > On Sep 12, 2016, at 10:47 AM, Pradeeban Kathiravelu <
> kk.pradee...@gmail.com> wrote:
> >
> > Hi,
> > I have a complex json data stored in Mongo.
> >
> > The data has nested arrays as shown below .
> >
> > here when I try to access the geometry field in select queries, the Drill
> > query fails.
> >
> > select camic._id, camic.type, camic.parent_id, camic.randval,
> > camic.creation_date, camic.object_type, camic.x, camic.y,
> camic.normalized,
> > camic.bbox, camic.geometry, camic.footprint, camic.properties,
> > camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2`
> as
> > camic WHERE ((camic.provenance.image.case_id =
> 'TCGA-02-0001-01Z-00-DX1')
> > AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
> > (camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND
> (camic.y
> >> = 0) AND (camic.y <= 1));
> >
> > would fail with a below error:
> >
> >
> >
> >
> >
> > *Error: SYSTEM ERRO

Nested/n-dimensional Arrays in Mongo/Json and Drill

2016-09-12 Thread Pradeeban Kathiravelu
Hi,
I have a complex json data stored in Mongo.

The data has nested arrays as shown below .

here when I try to access the geometry field in select queries, the Drill
query fails.

 select camic._id, camic.type, camic.parent_id, camic.randval,
camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
camic.bbox, camic.geometry, camic.footprint, camic.properties,
camic.provenance, camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as
camic WHERE ((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1')
AND (camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
(camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>= 0) AND (camic.y <= 1));

would fail with a below error:





*Error: SYSTEM ERROR: NullPointerExceptionFragment 0:0[Error Id:
8cd950af-91fa-4cf8-865b-265f227c8e87 on llovizna:31010] (state=,code=0)*

However, after removing the geometry from the select query, it would work
just fine.

 select camic._id, camic.type, camic.parent_id, camic.randval,
camic.creation_date, camic.object_type, camic.x, camic.y, camic.normalized,
camic.bbox, camic.footprint, camic.properties, camic.provenance,
camic.submit_date from mongo.CAMICROSCOPE.`testUAIM2` as camic WHERE
((camic.provenance.image.case_id = 'TCGA-02-0001-01Z-00-DX1') AND
(camic.provenance.analysis.execution_id  = 'tammy-test:7') AND
(camic.footprint >= 800) AND (camic.x >= 0) AND (camic.x <=1) AND (camic.y
>= 0) AND (camic.y <= 1));


I know this is due to the array in the output. I am also aware of the
commonly suggested options.
1. Using the array indexes in the select query: This is *impractical*. I do
not know how many elements I would have in this geojson - the coordinates.
It may be millions or as low as 3.

2. Flatten keyword: I am using Drill on top of Mongo - and finding an
interesting case where Drill outperforms certain queries in a distributed
execution than just using Mongo. Using Flatten basically kills all the
performance benefits I have with Drill otherwise. Flatten is just plain
*expensive* operation for the scale of my data (around 48 GB. But I can
split them into a few GB each).

Now given that I have explained why I cannot use the commonly suggested
options to deal with the complex Drill queries involving arrays, what are
the other alternatives?

I am thinking of using Protobufs with Drill to serialize the nested arrays.
Has anyone else tried it before, and any pointers, or anyother suggestions
on this overall requirement of querying the nested arrays?

Thank you.
Regards,
Pradeeban.


 The sample data in Mongo (data anonymized).
> db.testUAIM2.findOne()
{
"_id" : ObjectId("22"),
"bbox" : [
0.11,
0.33,
0.44,
0.44
],
"*geometry*" : {
"type" : "Polygon",
"coordinates" : [
[
[
0.04272915795445442,
0.9368849396705627
],
[
0.04272915795445442,
0.9369083046913147
],
[
0.042739588767290115,
0.9369083046913147
],
[
0.042739588767290115,
0.9369550347328186
],
[
0.04275001958012581,
0.9369550347328186
],
[
0.04275001958012581,
0.9370251893997192
],
[
0.042760446667671204,
0.9370251893997192
],
[
0.042760446667671204,
0.9371420741081238
],
[
0.04275001958012581,
0.9371420741081238
],
[
0.04275001958012581,
0.9372121691703796
],
[
0.042739588767290115,
0.9372121691703796
],
[
0.042739588767290115,
0.9372823238372803
],
[
0.04272915795445442,
0.9372823238372803
],
[
0.04272915795445442,
0.9373056888580322
],
[
0.04272915795445442,
0.9368849396705627
]
]
]
},
"footprint" : 5,
"properties" : {
"scalar_features" : [
{
"nv" : [
{
    "name" : "f",
"valu

Drill with Proto Buffers or Apache Thrift

2016-09-16 Thread Pradeeban Kathiravelu
Hi,
We are evaluating Drill for data with multi-dimensional array. We like to
keep the overhead low. So we decided against using flatten() to query the
multi-dimensional array. Similarly using the indices to refer to the array
elements is simply infeasible as our array is dynamic and we will not know
the number of elements present in the array (the array represents the
coordinates in a geojson).

We are evaluating the potentials for using Proto Buffers to serialize the
multi-dimensional array first before querying the data with Drill. So
avoiding the error
" *Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type
LIST"*

Pls note that while our query results include these arrays (as in "select
*"), we are not querying the array itself with Drill. Rather, we are
querying the other attributes associated with in the same object. Hence it
is theoretically possible to query while the array remains serialized. Our
data is originally in the format of a JSON, hence the complex structure.

However, we have some questions on the architectural feasibility without
draining the performance of Drill and Proto Buffers. It is no doubt that
both are highly performing. However, we are skeptical about the use of them
combined.

Is there any development effort on serialization with Protocol Buffers
and/or Apache Thrift? Any storage plugins developed, or similar deployment
architectures,as in:
*Data with multi-dimensional array -> Data with the multi-dimensional array
serialized with Protocol Buffers -> Query with Drill -> Deserialize the
multi-dimensional arrays in the query results back with Protocol Buffers* ?

Pls share your thoughts on this (whether you have attempted this, or is
there something that I am failing to see).

We have also tried other alternatives such as using CTAS and also a
potential to just modify the data source schema from multi-dimensional
arrays to a map [1]. We do not mind the initial performance hit of
conversions. This is just a one-time cost. What matters is the consequent
read queries - they should be efficient and fast, as in using Drill when
multi-dimensional arrays are not included.

[1] http://kkpradeeban.blogspot.com/search/label/Drill

Thank you.
Regards,
Pradeeban.
-- 
Pradeeban Kathiravelu.
PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Biomedical Informatics Software Engineer, Emory University School of
Medicine.

Blog: [Llovizna] http://kkpradeeban.blogspot.com/
LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03


Querying Nested Data in Mongo does not produce the results.

2016-08-18 Thread Pradeeban Kathiravelu
Hi,
I have this simple data in a Mongo database.
{"_id":{"$oid":"56a784b76952647b7b51c562"},"provenance":{"image":{"case_id":100,"subject_id":"TCGA"}}}

When I run

*SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2`
camic*

I get the below.

+-+
| caseid  |
+-+
| 100 |
+-+

When I run



*SELECT camic.provenance.image.case_id caseidFROM mongo.users.`contacts2`
camicWHERE caseid > 10;*
I expect the same outcome as above.

However, I get the below (no results).

+-+
| caseid  |
+-+
+-+


*tail -f sqlline.log* indicates the below. (see the highlighted line
specifically).

2016-08-18 16:56:07,337 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:foreman]
INFO  o.a.drill.exec.work.foreman.Foreman - Query text for query id
2849df17-8fbd-eb57-03c8-cb2181bc81c7: SELECT camic.provenance.image.case_id
caseid
FROM mongo.users.`contacts2` camic
WHERE caseid > 10
2016-08-18 16:56:08,491 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.s.m.MongoScanBatchCreator - Number of record readers
initialized : 1
2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.w.fragment.FragmentExecutor -
2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested
AWAITING_ALLOCATION --> RUNNING
2016-08-18 16:56:08,512 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.w.f.FragmentStatusReporter -
2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: RUNNING
2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.s.mongo.MongoRecordReader - Filters Applied : Document{{}}
2016-08-18 16:56:08,513 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.s.mongo.MongoRecordReader - Fields Selected :Document{{_id=0,
caseid=1, provenance=1}}
*2016-08-18 16:56:08,514 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
WARN  o.a.d.e.e.ExpressionTreeMaterializer - Unable to find value vector of
path `caseid`, returning null instance.*
2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.w.fragment.FragmentExecutor -
2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State change requested RUNNING
--> FINISHED
2016-08-18 16:56:08,520 [2849df17-8fbd-eb57-03c8-cb2181bc81c7:frag:0:0]
INFO  o.a.d.e.w.f.FragmentStatusReporter -
2849df17-8fbd-eb57-03c8-cb2181bc81c7:0:0: State to report: FINISHED



Tested with 1.6.0 and 1.7.0. Reported this as DRILL-4855
<https://issues.apache.org/jira/browse/DRILL-4855>

Can someone confirm whether this is a bug, or is it simply I am doing
something that is not supported, or doing something wrong?

Thank you.
Regards,
Pradeeban.

-- 
Pradeeban Kathiravelu.
PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Biomedical Informatics Software Engineer, Emory University School of
Medicine.

Blog: [Llovizna] http://kkpradeeban.blogspot.com/
LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03


Re: Interesting Errors

2018-08-07 Thread Pradeeban Kathiravelu
If I understand correctly, he was using the character ・

Different languages have different symbols.

The same thing can be said about the Chinese equivalent 。

The characters ・and 。 are entirely different from the "."
Not sure whether this needs to be fixed. If I am not completely
misunderstanding something, he will also fail if he attempts to use that
character (in place of ".") in programming languages (Java, C, ..).

Regards,
Pradeeban.

On Tue, Aug 7, 2018 at 10:02 AM, Charles Givre  wrote:

> Hello Drill Developers,
> I wanted to share an interesting development that happened yesterday.  I
> was teaching a class at BlackHat, and we have a worksheet that includes a
> Drill demonstration using PyDrill.   Basically the students are asked to
> execute a query in Drill using PyDrill then visualize the results.
>
> Anyway, a student from Japan tried this, and was getting all kinds of
> crazy errors.  So I sat down and worked with him to debug.  It turns out
> that the period on the Japanese keyboard, maps to a different unicode
> character than on US keyboards, and hence the queries throw errors.  I
> discovered this because when I would cut/paste a query from a text file
> that I wrote, the query executed, but if we typed one in, it broke.  After
> digging around a bit, I found that it was the period character.
>
> I’m not sure that this can or should be fixed, but I wanted to let people
> know about this.
>
> Best,
> — C




-- 
Pradeeban Kathiravelu.
Senior Systems Software Engineer, Emory University, Atlanta, GA, USA.
Ph.D. Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Université catholique de Louvain, Louvain-la-Neuve, Belgium.

Blog: [Llovizna] kkpradeeban.blogspot.com
LinkedIn: www.linkedin.com/in/kpradeeban