I have noticed that data consuming people will prefer flat records
because they are easier to query. I have yet to find a good tool to
query unstructured records like JSON. A large amount of time and effort
therefore goes into the ETL process.
Maybe one could fork the data flow and send raw records to an "raw" bin
and send the the other fork through a process that conforms each records
to a schema in a schema library.
On 2/10/15 5:01 PM, Wai Yip Tung wrote:
During our development of schema based data pipeline, we often run
into a debate. Should we make the schema tight and strict so that all
the application error can be tested and caught early? Or should we
design the schema to be lenient, because inevitably the schema is
going to be evolved and the data we have found in our system often
contains variations despite our effort constraint it.
Slowly I observed that the difference in school of thought is largely
related to their role. The data producer, mainly the application
developers, wants the schema to be strict (e.g. required attribute, no
union of 'null'). They see this as a debugging tool. They expect
errors to be caught by the encoder during unit test. They expect the
production system to raise alarm loudly if a bad build break things.
The consumers, mainly the data backend developers and the analysts,
want the schema to be lenient. The backend developers often have to
reprocess historical data. Strict schema is often incompatible and
cause big problem in reading historical data. They aruge having some
data, even if slightly broken, is better than having no data.
We have been having difficulty to strike a balance. It leads me to
think perhaps we need more than a single schema in operation. Perhaps
an application developer will create a strict schema. And the backend
application will derive a lenient version from it in order to load all
historical data successfully.
I am wondering if others have seen this kind of tension. Any thought
on how to address this?
Wai Yip