Drill with Proto Buffers or Apache Thrift

Pradeeban Kathiravelu Fri, 16 Sep 2016 12:32:09 -0700

Hi,
We are evaluating Drill for data with multi-dimensional array. We like to
keep the overhead low. So we decided against using flatten() to query the
multi-dimensional array. Similarly using the indices to refer to the array
elements is simply infeasible as our array is dynamic and we will not know
the number of elements present in the array (the array represents the
coordinates in a geojson).


We are evaluating the potentials for using Proto Buffers to serialize the
multi-dimensional array first before querying the data with Drill. So
avoiding the error
" *Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type
LIST"*

Pls note that while our query results include these arrays (as in "select
*"), we are not querying the array itself with Drill. Rather, we are
querying the other attributes associated with in the same object. Hence it
is theoretically possible to query while the array remains serialized. Our
data is originally in the format of a JSON, hence the complex structure.

However, we have some questions on the architectural feasibility without
draining the performance of Drill and Proto Buffers. It is no doubt that
both are highly performing. However, we are skeptical about the use of them
combined.

Is there any development effort on serialization with Protocol Buffers
and/or Apache Thrift? Any storage plugins developed, or similar deployment
architectures,as in:
*Data with multi-dimensional array -> Data with the multi-dimensional array
serialized with Protocol Buffers -> Query with Drill -> Deserialize the
multi-dimensional arrays in the query results back with Protocol Buffers* ?

Pls share your thoughts on this (whether you have attempted this, or is
there something that I am failing to see).

We have also tried other alternatives such as using CTAS and also a
potential to just modify the data source schema from multi-dimensional
arrays to a map [1]. We do not mind the initial performance hit of
conversions. This is just a one-time cost. What matters is the consequent
read queries - they should be efficient and fast, as in using Drill when
multi-dimensional arrays are not included.

[1] http://kkpradeeban.blogspot.com/search/label/Drill

Thank you.
Regards,
Pradeeban.
-- 
Pradeeban Kathiravelu.
PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing,
INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa,
Portugal.
Biomedical Informatics Software Engineer, Emory University School of
Medicine.

Blog: [Llovizna] http://kkpradeeban.blogspot.com/
LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03

Drill with Proto Buffers or Apache Thrift

Reply via email to