Hey Phillip,

Thanks for raising this. I like the idea. The question is, should this be
implemented in Spark or some other framework? I know that dbt has a fairly
extensive way of testing your data
<https://www.getdbt.com/product/data-testing/>, and making sure that you
can enforce assumptions on the columns. The nice thing about dbt is that it
is built from a software engineering perspective, so all the tests (or
contracts) are living in version control. Using pull requests you could
collaborate on changing the contract and making sure that the change has
gotten enough attention before pushing it to production. Hope this helps!

Kind regards,
Fokko

Op di 13 jun 2023 om 04:31 schreef Deepak Sharma <deepakmc...@gmail.com>:

> Spark can be used with tools like great expectations as well to implement
> the data contracts .
> I am not sure though if spark alone can do the data contracts .
> I was reading a blog on data mesh and how to glue it together with data
> contracts , that’s where I came across this spark and great expectations
> mention .
>
> HTH
>
> -Deepak
>
> On Tue, 13 Jun 2023 at 12:48 AM, Elliot West <tea...@gmail.com> wrote:
>
>> Hi Phillip,
>>
>> While not as fine-grained as your example, there do exist schema systems
>> such as that in Avro that can can evaluate compatible and incompatible
>> changes to the schema, from the perspective of the reader, writer, or both.
>> This provides some potential degree of enforcement, and means to
>> communicate a contract. Interestingly I believe this approach has been
>> applied to both JsonSchema and protobuf as part of the Confluent Schema
>> registry.
>>
>> Elliot.
>>
>> On Mon, 12 Jun 2023 at 12:43, Phillip Henry <londonjava...@gmail.com>
>> wrote:
>>
>>> Hi, folks.
>>>
>>> There currently seems to be a buzz around "data contracts". From what I
>>> can tell, these mainly advocate a cultural solution. But instead, could big
>>> data tools be used to enforce these contracts?
>>>
>>> My questions really are: are there any plans to implement data
>>> constraints in Spark (eg, an integer must be between 0 and 100; the date in
>>> column X must be before that in column Y)? And if not, is there an appetite
>>> for them?
>>>
>>> Maybe we could associate constraints with schema metadata that are
>>> enforced in the implementation of a FileFormatDataWriter?
>>>
>>> Just throwing it out there and wondering what other people think. It's
>>> an area that interests me as it seems that over half my problems at the day
>>> job are because of dodgy data.
>>>
>>> Regards,
>>>
>>> Phillip
>>>
>>>

Reply via email to