Non-Relational Data

Jiang Wu Thu, 16 Jan 2020 16:47:09 -0800

Moving the topic on non-relational data to this dedicated thread.  First a
bit of context based on our use case:


* We want to do ad-hoc analyze data coming from diverse sources like APIs,
document stores, and relational stores.
* Data are not limited to relational structures, e.g. API returning complex
object collections.
* Data may change its structure over time, e.g. due to implementation
upgrades.
* We want to use high level declarative query languages such as SQL.

Various techniques exist to tackle non-relational data analysis such as
mapping to a relational schema or run custom code in a distributed compute
cluster (map-reduce, spark jobs, etc) on blob data.  These have their
drawbacks like data latency and effort on structure transformation, and
query latency and cost computing on blob data.

We built a columnar data store for non-relational data without pre-defined
schema.  For querying this data, technologies like Drill made it almost
possible to directly work with non-relational data using array and map data
types.  However, we feel more can be done to truly make non-relational data
a first class citizen:

1) functions on array and map -- e.g. sizeOf(person.addresses) where
person.addresses is an array.  Using FLATTEN is not the same as working
with complex objects directly,
2) heterogenous types -- better handling of heterogeneous data types within
the same column, e.g. product.version started as numbers, but some are
strings.  Treating every value as a String is a workaround.
3) better storage plugin support for complex types -- we had to re-generate
from our columnar vectors into objects to give to Drill, rather than
feeding vectors directly.

I don't think any of these are easy to do.  Much research and thinking will
be needed for a cohesive solution.

-- Jiang

Non-Relational Data

Reply via email to