Moving the topic on non-relational data to this dedicated thread. First a bit of context based on our use case:
* We want to do ad-hoc analyze data coming from diverse sources like APIs, document stores, and relational stores. * Data are not limited to relational structures, e.g. API returning complex object collections. * Data may change its structure over time, e.g. due to implementation upgrades. * We want to use high level declarative query languages such as SQL. Various techniques exist to tackle non-relational data analysis such as mapping to a relational schema or run custom code in a distributed compute cluster (map-reduce, spark jobs, etc) on blob data. These have their drawbacks like data latency and effort on structure transformation, and query latency and cost computing on blob data. We built a columnar data store for non-relational data without pre-defined schema. For querying this data, technologies like Drill made it almost possible to directly work with non-relational data using array and map data types. However, we feel more can be done to truly make non-relational data a first class citizen: 1) functions on array and map -- e.g. sizeOf(person.addresses) where person.addresses is an array. Using FLATTEN is not the same as working with complex objects directly, 2) heterogenous types -- better handling of heterogeneous data types within the same column, e.g. product.version started as numbers, but some are strings. Treating every value as a String is a workaround. 3) better storage plugin support for complex types -- we had to re-generate from our columnar vectors into objects to give to Drill, rather than feeding vectors directly. I don't think any of these are easy to do. Much research and thinking will be needed for a cohesive solution. -- Jiang
