During our development of schema based data pipeline, we often run into
a debate. Should we make the schema tight and strict so that all the
application error can be tested and caught early? Or should we design
the schema to be lenient, because inevitably the schema is going to be
evolved and the data we have found in our system often contains
variations despite our effort constraint it.
Slowly I observed that the difference in school of thought is largely
related to their role. The data producer, mainly the application
developers, wants the schema to be strict (e.g. required attribute, no
union of 'null'). They see this as a debugging tool. They expect errors
to be caught by the encoder during unit test. They expect the production
system to raise alarm loudly if a bad build break things.
The consumers, mainly the data backend developers and the analysts, want
the schema to be lenient. The backend developers often have to reprocess
historical data. Strict schema is often incompatible and cause big
problem in reading historical data. They aruge having some data, even if
slightly broken, is better than having no data.
We have been having difficulty to strike a balance. It leads me to think
perhaps we need more than a single schema in operation. Perhaps an
application developer will create a strict schema. And the backend
application will derive a lenient version from it in order to load all
historical data successfully.
I am wondering if others have seen this kind of tension. Any thought on
how to address this?
Wai Yip
- Schema design guideline, strict v.s. lenient Wai Yip Tung
-