Hi, We are evaluating Drill for data with multi-dimensional array. We like to keep the overhead low. So we decided against using flatten() to query the multi-dimensional array. Similarly using the indices to refer to the array elements is simply infeasible as our array is dynamic and we will not know the number of elements present in the array (the array represents the coordinates in a geojson).
We are evaluating the potentials for using Proto Buffers to serialize the multi-dimensional array first before querying the data with Drill. So avoiding the error " *Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST"* Pls note that while our query results include these arrays (as in "select *"), we are not querying the array itself with Drill. Rather, we are querying the other attributes associated with in the same object. Hence it is theoretically possible to query while the array remains serialized. Our data is originally in the format of a JSON, hence the complex structure. However, we have some questions on the architectural feasibility without draining the performance of Drill and Proto Buffers. It is no doubt that both are highly performing. However, we are skeptical about the use of them combined. Is there any development effort on serialization with Protocol Buffers and/or Apache Thrift? Any storage plugins developed, or similar deployment architectures,as in: *Data with multi-dimensional array -> Data with the multi-dimensional array serialized with Protocol Buffers -> Query with Drill -> Deserialize the multi-dimensional arrays in the query results back with Protocol Buffers* ? Pls share your thoughts on this (whether you have attempted this, or is there something that I am failing to see). We have also tried other alternatives such as using CTAS and also a potential to just modify the data source schema from multi-dimensional arrays to a map [1]. We do not mind the initial performance hit of conversions. This is just a one-time cost. What matters is the consequent read queries - they should be efficient and fast, as in using Drill when multi-dimensional arrays are not included. [1] http://kkpradeeban.blogspot.com/search/label/Drill Thank you. Regards, Pradeeban. -- Pradeeban Kathiravelu. PhD Researcher, Erasmus Mundus Joint Doctorate in Distributed Computing, INESC-ID Lisboa / Instituto Superior Técnico, Universidade de Lisboa, Portugal. Biomedical Informatics Software Engineer, Emory University School of Medicine. Blog: [Llovizna] http://kkpradeeban.blogspot.com/ LinkedIn: www.linkedin.com/pub/kathiravelu-pradeeban/12/b6a/b03