Schema design guideline, strict v.s. lenient

Wai Yip Tung Tue, 10 Feb 2015 17:03:06 -0800

During our development of schema based data pipeline, we often run intoa debate. Should we make the schema tight and strict so that all theapplication error can be tested and caught early? Or should we designthe schema to be lenient, because inevitably the schema is going to beevolved and the data we have found in our system often containsvariations despite our effort constraint it.

Slowly I observed that the difference in school of thought is largelyrelated to their role. The data producer, mainly the applicationdevelopers, wants the schema to be strict (e.g. required attribute, nounion of 'null'). They see this as a debugging tool. They expect errorsto be caught by the encoder during unit test. They expect the productionsystem to raise alarm loudly if a bad build break things.

The consumers, mainly the data backend developers and the analysts, wantthe schema to be lenient. The backend developers often have to reprocesshistorical data. Strict schema is often incompatible and cause bigproblem in reading historical data. They aruge having some data, even ifslightly broken, is better than having no data.

We have been having difficulty to strike a balance. It leads me to thinkperhaps we need more than a single schema in operation. Perhaps anapplication developer will create a strict schema. And the backendapplication will derive a lenient version from it in order to load allhistorical data successfully.

I am wondering if others have seen this kind of tension. Any thought onhow to address this?


Wai Yip

Schema design guideline, strict v.s. lenient

Reply via email to