I'm about to make all of this even more confusingĀ For pair-wise resolution when the operation is deserialization, "reader" and "writer" make sense. In a more general sense it is simply "from" and "to" -- One might move from schema A to B without serialization at all, transforming a data structure, or simply want a view of data in the form of A as if it was in B. There aren't any clear naming winners and many sound good for one use case but worse for others: 'source' and 'destination', 'source' and 'sink', 'original' and 'target', 'expected' and 'actual', 'reader' and 'writer', 'resolver' and 'resolvee', 'sender' and 'reciever'.
As part of AVRO-1124 I have recently met in person with a few folks who needed enhancements to that ticket (the discussion and conclusion will be added there shortly, prior to the next patch version). The result is that two names are not enough, because expressing resolution of _sets_ of schemas is more complicated than pairs. When describing a set of schemas that represent some sort of data that may have been persisted, six states are needed. The six states are made up of two dimensions. * The "reader" dimension is binary, and represents whether a schema is used for reading or not (is ever a "to", "reader", or "target"). * The "write" dimension has three states in the 'write' spectrum: Writer (an active "from" or "source"), Written (persisted data, not actively written), and None (not used for writing). The naming of these will be confusing, as part of AVRO-1124 we'll have to have names that are as clear as possible. Currently I have enumerations: ReadState.READER and ReadState.NONE; WriteState.WRITER, WriteState.WRITTEN, and WriteState.NONE. I am not a big fan of these names, and am open to suggestions. A consistent approach in naming is important. For example, I previously had, WriteState.WRITTEN named WriteState.READABLE. That represents the idea of what the state is for the best, but is extremely confusing. These six states relate with one schema resolution rule: Schemas in state ReadState.READER must be able to read all schemas with WriterState.WRITER or WriterState.WRITTEN. and one rule for persisting data: Data must not be persisted unless the corresponding schema is in state WriterState.WRITER Without going into the details, this allows for any schema evolution use case over a set of schemas with both ephemeral data and persisted data. Schemas can transition from one state to another, as long as the constraint rules above are met at all times. "Reader" and "Writer" have been useful because they correlate with other meaningful names well -- hypothetically: boolean mySchema.canRead(Schema writer) and boolean mySchema.canBeReadWith(Schema reader) A naming scheme for describing schema resolution an evolution will need to work across many use cases and be useful for describing relationships between schemas. Describing only the pair-wise resolution is not enough. On 6/8/13 12:44 AM, "Doug Cutting" <[email protected]> wrote: > Originally I used the term 'actual' for the schema of the data written and > 'expected' for the schema that the reader of the data wished to see it as. > Some found those terms confusing and suggested that 'writer' and 'reader' were > more intuitive, so we started using those instead. That unfortunately seems > not to have resolved the confusion entirely. > > Perhaps we should improve the documentation around this? Do you have any > specific suggestions about how that might be done? > > Doug > > On Jun 7, 2013 10:12 PM, "Gregory (Grisha) Trubetskoy" <[email protected]> > wrote: >> >> I'm curious how the "Reader" and "Writer" terminology came about, and, most >> importantly, whether it's as confusing to the rest of you as it is to me? >> >> As I understand it, the principal analogy here is from the RPC world - a >> process A writes some Avro to process B, in which case A is the writer and B >> is the reader. >> >> And there is the possibility that the schema which B may be expecting isn't >> what A is providing, thus B may have to do some conversion on its end to grok >> it, and Avro schema resolution rules may make this possible. >> >> So far so good. This is where it becomes confusing. I am lost on how the act >> of reading or writing is relevant to the task at hand, which is conversion of >> a value from one schema to another. >> >> As I read stuff on the lists and the docs, I couldn't help noticing words >> such as "original", "first", "second", "actual, "expected" being using >> alongside "reader" and "writer" as clarification. >> >> Why would be wrong with a "source" and "destination" schmeas? >> >> Consider the following line (from Avro-C): >> >> writer_iface = avro_resolved_writer_new(writer_schema, reader_schema); >> >> Here "writer" in resolved_writer and writer_schema are unrelated. The former >> refers to the fact that this interface will be modifying (writing to) an >> object, the latter is referring to the writer (source, original, a.k.a >> actual) schema. >> >> Wouldn't this read better as: >> >> writer_iface = avro_resolved_writer_new(source_schema, dest_schema); >> >> Anyway - I just want to know if I'm missing something obvious when I think >> that reader/writer is confusing. >> >> Thanks, >> >> Grisha
