Re: question about completely untagged data...

David Jeske Mon, 29 Nov 2010 11:04:58 -0800

On Mon, Nov 29, 2010 at 10:25 AM, Doug Cutting <[email protected]> wrote:


> If this worst-case transpired, I don't think it would be too difficult for
> most datasets to reconstruct the schema by examining the data.  With
> ProtocolBuffers and Thrift, if the IDL is lost you'd be in a similar,
> although simpler, situation of having to figure out field names and types.
>  Folks regularly reverse-engineer much more complex stuff than this.
>

I don't follow how this would be possible with Avro. With no type
information, how would you tell the difference between an array of ints, a
bunch of enums, a binary chunk of data, or even just a string? Thrift and
Protobufs have the types so understanding the structure would be trivial,
it's only the meaning that would need to be re-derived.

That said, you could store the Id->Schema mapping in multiple places. Among
> other places, it could be in your source code repository.
>

These are user-supplied schemas. Rather than get lost in details of my
project, let me frame this with a shared-context example.

Imagine we were going to replace the row-packing format of MySQL with Avro.
Currently Mysql has one "row-packing format" per table. To add or remove a
column, one must rewrite the whole table. Therefore, to mimic this
functionality, we would keep a copy of the schema in system metadata.
However, if that copy was corrupted, the table contents would be completely
unintelligible. Storing the Avro schema in source code isn't something the
client would always do, because he used SQL to create the schema and might
not even care Avro exists inside. Of course one might argue that this is a
use case where Avro doesn't provide value over custom-coded packings like
MySQL, but I have a different opinion. One significant performance limit in
databases is the need for the database to always unpack and repack record
values to deliver them. By using a standardized format that others can
understand as the native on-disk format, a "fast path" around the sql parser
can allow users a way around that overhead if it's useful.

If you really want to keep bit more of descriptive information, you

could also just consider formats that do include property names, like

JSON (with compression). Depending on exactly what you plan to store, it
> might be a competitive choice all around.


Compressing a single copy of the json schema would not produce nearly a
small enough representation, because the names of the fields would not be
present in the stream to be compressed out. Consider a schema that is just 3
integers with names. The field names would make the compressed schema much
bigger than the data (compressed or not). I obviously could remove the field
names.. which is the idea I mentioned in my earlier email to store a packed
form of just the type-structure.

Which means in order to get good compression of the field-names, the output
would need to be block compressed across multiple records. I don't want this
requirement.

I don't think either Avro or Thrift is actually aimed so much for storing
> data as for transferring data; since the issue of persisting schemas does
> complicate things significantly (same is true with protobuf too, just even
> more so).


I'm not sure why you say this. While I'm no longer at Google, while I was
there protobuf was used extensively for data schema. If I could share how
much "protobuf schema data" was stored, you'd probably think 'extensively'
is an understatement. The Google Sawzall paper shares a glimmer into how
this was used. The schemas were recorded into a central repository (the
source control tree). However, if ever one was lost, the types in the binary
format would allow some limited ability to read the data.


> And Avro specifically seems like best fit for sequences of homogenous data
> entries (rows of DB, log entries etc). This may or may not be similar to
> your use case. But maybe there are other reasons why you have limited
> choice to just these two formats?


Actually, Thrift and Protobufs are both perfect binary formats for my
application, so if that was the only issue I don't need to look beyond them.
As I stated in my first post, I also want an implementation that allows
clients in a variety of languages to read schemas and dynamically interpret
data. Hive/Pig is a good example. From the Google Sawzall paper you can see
that Google has this internally as part of Sawzall, but neither the public
protobuf project nor Thrift have this capability built in.

Avro does have this capability to dynamically read schemas in the
multi-language client-code, so I came around to ask if there was a way to
get the slightly-better data-safety I'd like. I believe the workaround that
I mentioned earlier in the thread might be acceptable (storing a packed form
of the type-structure).

Thanks again for the comments!

Re: question about completely untagged data...

Reply via email to