On Tue, 10 Dec 2019 at 10:28, Ryan Skraba <[email protected]> wrote:

> @Roger: The CUE schema gets a +1 for the most accurate regex for
> validating names and namespaces so far! :D  It doesn't look like it's
> being applied to *every* name and namespace attribute though, or am I
> misreading?


No, you're right. As I said, it definitely needs some work (although one of
the interesting
properties of Cue is that that it's possible to independently add
constraints to that schema without changing the source and while preserving
all the existing constraints).

For example, to further restrict field names:

    // A stricter version of the Avro package.
    package avro
    import "github.com/heetch/cue-schema/avro"

    stricterAvro :: avro
    stricterAvro :: Field :: name: =~"^[a-zA-Z][a-zA-Z0-9]*$"
    Schema :: stricterAvro.Schema


I read the schema with just a *minimal* understanding of
> the language, but it looks like it also expects that fixed data can
> have a doc.
>

Yup, that's also something I overlooked.


> I would hope that the doc attribute in a fixed data schema could still
> be retrieved like any other metadata by schema.getObjectProp (at least
> in the Java API).  I'll check!
>
> @Jonah: I think I understand your use case a bit better -- thanks for
> the clarification!
>
> Attributes outside of the spec should be OK to use as metadata, and
> that seems like the right fit for your use case (such as the
> interesting obfuscation attribute in lenses).  Are the avro tools that
> strip non-spec-attributes/metadata doing something wrong?  I can see
> this happening if they are relying on the Parsing Canonical Form or
> the fingerprint (based on canonical form), but that is deliberate to
> remove all differences between two schemas that can be used to parse
> the same binary data.  Note that PCF also removes doc attributes.
>

Yeah, I hadn't noticed that arbitrary extra attributes are allowed. That's
a pity from my p.o.v. because it makes any validating schema substantially
less useful, because it can't report misspelled fields, but the schema
should nonetheless allow it given that the spec does.

Is there code in the avro project that is manipulating schemas and
> stripping metadata silently?  I would consider that a bug.  For
> external tools, it could either be a bug or undocumented behaviour.
>
> All my best, Ryan
>
> On Mon, Dec 9, 2019 at 5:14 PM roger peppe <[email protected]> wrote:
> >
> > Somewhat relevant, here is a CUE schema for Avro schemas that I wrote a
> little while ago that can be used to check Avro schema compliance to a
> degree (if you haven't heard of CUE, there's a bunch of info on it at
> cuelang.org).
> >
> > My understanding of Avro was somewhat less then, so it's probably wrong
> in parts, and it's definitely not a strict as it could be, but I've found
> it useful, and it has lots of room for improvement.
> >
> >   cheers,
> >     rog.
> >
> >
> >
> > On Fri, 6 Dec 2019 at 17:43, Jonah H. Harris <[email protected]>
> wrote:
> >>
> >> On Fri, Dec 6, 2019 at 12:16 PM Ryan Skraba <[email protected]> wrote:
> >>>
> >>> Hello!  Yes, it looks like `fixed` is the only named complex type that
> >>> doesn't have a doc attribute.  No primitive types have the doc
> >>> attribute.
> >>>
> >>> This might be an omission, but I don't think it's inconsistent.  In my
> >>> experience, there's no compelling reason to document schemas of
> >>> primitive types, but a good practice for the fields or container types
> >>> that they're inside.  Fixed is not a primitive type, but in practice
> >>> it's used like bytes (which is).
> >>
> >>
> >> Hey, Ryan. Thanks for getting back to me so quickly.
> >>
> >> Yeah. I don't think primitive types need the doc attribute. As fixed is
> complex and can be an independent type, however, I thought that was
> inconsistent with the other complex types.
> >>
> >>>
> >>> In my opinion, I wouldn't consider it important to make the doc
> >>> attribute universal on any type/field, but I wouldn't have any strong
> >>> objection if that were the consensus.  Today, I'm pretty sure that the
> >>> Java implementation corresponds to the spec with regards to the doc
> >>> attribute.
> >>
> >>
> >> Agreed.
> >>
> >>>
> >>> As a minimum, I'd propose that the only action here is to change the
> >>> IDL guide: "Comments that begin with /** are used as the documentation
> >>> string (if applicable) for the type or field definition that follows
> >>> the comment."
> >>>
> >>> Is this what you're looking for?
> >>
> >>
> >> Yes. We're actually using the doc string to store not only a textual
> description of the field/type, but also a set of annotations used for event
> storage and data masking. The main reason we wanted doc to be consistent
> for all complex types (including fixed) is that it permits us to easily
> tell what complex objects can exist across the ecosystem directly from our
> schema repository. Initially, we wanted to use a separate internal
> attribute (similar to the lenses obfuscate attribute approach --
> https://docs.lenses.io/2.0/install_setup/datagovernance/index.html#data-anonymization
> -- but we've found several Avro tools strip out all non-spec-compliant
> attributes. This leaves us only the doc field.
> >>
> >>> P.S. I'm very intrigued by the "thorough schema compliance checker"!
> >>> Is this something that would be shared? Would it help find other
> >>> inconsistencies in the Avro spec and implementations?
> >>
> >>
> >> Yes, this will be open-sourced.
> >>
> >> --
> >> Jonah H. Harris
> >>
>

Reply via email to