RE: [PROPOSAL] new subproject: Avro

Chad Walters Mon, 06 Apr 2009 00:23:50 -0700

Cross-posting to the Thrift dev and user lists since folks there may be 
interested in this. It appears that my attempts to subscribe to 
[email protected] from my work email were silently failing somewhere 
along the line -- I'll try not to take it personally. ;) Some others have 
experienced this too -- so if you didn't get a subscription confirmation 
message, then it failed. Try from a different address, I guess. You can view 
the thread here without being subscribed: 
http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/browser


Doug,

First, let me say that I think Avro has a lot of useful features -- features 
that I would like to see fully supported in Thrift. At a minimum, I would like 
for us to be able to hash out the details to guarantee that there can really be 
full interoperability between Avro and Thrift. I am really interested in 
working cooperatively and collaboratively on this and I am willing to put in 
significant time on design and communication to help make full interoperability 
possible (I am unfortunately not able to contribute code directly at this time).

Second, I think all of this decision about where Avro should live requires more 
thought and more discussion. I'd love to hear from more folks outside of Yahoo 
on this topic: so far all of the +1 votes have come from Yahoo employees. I'd 
also love to hear from other folks who have significant investments in both 
Thrift and Hadoop.

Some points to think about:

-- You suggest that there is not a lot in Thrift that Avro can leverage. I 
think you may be overlooking the fact that Thrift has a user base and a 
community of developers who are very interested in issues of cross-language 
data serialization and interoperability. Thrift has committers with expertise 
in a pretty big set of languages and leveraging this could get Avro's 
functionality onto more languages faster than the current path. Also, there is 
in fact significant overlap between Hadoop users and Thrift users at this 
point, as well as significant use of Thrift in more than one Hadoop sub-project.

At the code level, Thrift contains a transport abstraction and multiple 
different transport and server implementations in many different target 
languages. If there were closer collaboration, Avro could certainly benefit 
from leveraging the existing ones and any additional contributions in this area 
would benefit both projects.

-- You also suggest that the two are largely disjoint from a technical 
perspective:
"Thrift fundamentally standardizes an API, not a data format.
Avro fundamentally is a data format specification, like XML."

I agree with the fundamental part but I think that doesn't bring to light 
enough of what is in common and what is different for purposes of this 
discussion.

Thrift specifies a type system, an API for data formats and transport 
mechanisms, a schema resolution algorithm, and provides implementations of 
several distinct data formats and transports.

Avro specifies a single data format but it also brings along several other 
things as well, including a type system, specific RPC mechanism and a schema 
resolution algorithm.

The most significant issue is that both of them specify a type system. At a 
very minimum I would like to see Avro and Thrift make agreements on that type 
system. The fact that there is significant existing investment in the Thrift 
type system by the Thrift community should weigh somewhere in this discussion. 
Obviously, the technical needs of Avro will also have weight there, but where 
there is room for choice, the Thrift choices should be respected. Arbitrary 
changes here will make it unnecessarily painful, perhaps impossible, for Thrift 
to directly adopt Avro and instead Thrift will be forced to make an "Avro-like" 
data specification, hampering interoperability for everyone.

There may be pitfalls in the other areas of overlap as well that would prevent 
real interoperability -- let's elucidate them in further discussions.

-- Avro appears to have 3 primary features that Thrift does not currently 
support sufficiently:
1. Schema serialization allowing for compact representation of files containing 
large numbers of records of identical types
2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic 
languages (like the Python Hadoop Streaming use case)
3. Lazy partial deserialization to support "projection"

Note that features 1 and 3 are independent of whether schemas are dynamicly 
interpreted or compiled into static bindings.

WRT #1: Thrift's DenseProtocol goes some distance towards this although it 
doesn't go the whole way. Thrift can easily be extended to further compact the 
DenseProtocol's wire format for special cases where all fields are required. We 
have had significant discussions on the Thrift list about doing more in this 
area previously but we couldn't get folks from Hadoop who cared most about this 
use case to participate with us on capturing a complete set of requirements and 
so there was no strong driver for it.

WRT #2: I totally understand the case you make for dynamic interpretation in ad 
hoc data processing. I would love to see Thrift enhanced to do this kind of 
thing.

WRT #3: Partial deserialization seems like a really useful feature for several 
use cases, not just for "projection". I think Thrift could and should be 
extended to support this functionality, and it should be available for both 
static bindings and dynamic schema interpretation via field names and field IDs 
where possible.

-- You state:
"Perhaps Thrift could be augmented to support Avro's JSON schemas and 
serialization.  Then it could interoperate with other Avro-based 
systems.  But then Thrift would have yet another serialization format, 
that every language would need to implement for it to be useful..."

First, that "Perhaps" hides a lot of complexity and unless that is hashed out 
ahead of time I am pretty sure the real answer will be "Thrift cannot be 
augmented to support Avro directly but instead could be augmented to support 
something that looks quite a bit like Avro but differs in mostly unimportant 
ways." To me that seems like a shame.

Furthermore, you say that last part ("Thrift would have yet another 
serialization format...") like it is a bad thing... Note that it is an explicit 
design goal of Thrift to allow for multiple different serialization formats so 
that lots of different use cases can be supported by the same fundamental 
framework. This is a clear recognition that there is no one-size-fits-all 
answer for data serialization (fast RPC vs compact archival record data vs 
human readability, to name a few salient use cases). For a compelling enough 
use case, there is no reason not to port new protocols across multiple 
languages (generally done on an as-needed basis by someone who wants that 
functionality in that language). Another great feature of the protocol 
abstraction is that it allows data to be seamlessly moved from one 
serialization format to another as, say, it is read out of archival storage and 
sent on as RPC.

Also, doesn't Avro essentially contain "another serialization format that every 
language would need to implement for it to be useful"? Seems like the same 
basic set of work to me, whether it is in Avro or Thrift.

-- You state:
"Avro fundamentally is a data format specification, like XML.  Thrift could 
implement this specification.  The Avro project includes reference 
implementations, but the format is intended to be simple enough and the 
specification stable enough that others might reasonably develop alternate, 
independent implementations."

I think this is a bit inaccurate. First there is the issue of type system 
compatibility that I raised above and the plausibility of satisfying that 
"could" without refinement and collaboration on Avro's specification. 
Furthermore, stated goal of the subproject is "for Avro to replace both 
Hadoop's RPC and to be used for most Hadoop data files". This will bring in 
quite a bit beyond a reference implementation of a data format specification, 
especially depending on how many languages you intend to build RPC support for 
(Java, Python, C++ all mentioned at some point -- others?). I don't think it is 
unreasonable that the significant proportion of folks in the Hadoop community 
who are also using Thrift are puzzled about why there isn't more consideration 
being given to convergence between Avro and Thrift.

-- You state:
"Also, with the schema, resolving version differences is simplified. 
Developers don't need to assign field numbers, but can just use names. 
For performance, one can internally use field numbers while reading, to 
avoid string comparisons, but developers need no longer specify these, 
but can use names, as in most software.  Here having the schema means we 
can simplify the IDL and its versioning semantics."

The simplification comes simply not having the field IDs in the IDL? I am not 
sure why having sequential id numbers after each field is considered to be so 
onerous. I honestly have never heard a single Thrift user complain about this. 
Anyone doing more than just that is doing something advanced that wouldn't be 
possible without the field IDs (like renaming a field). I think having to deal 
with JSON syntax in the Avro IDL is actually more annoying for humans than the 
application of field IDs, both with the added syntactic punctuation and the 
increased verbosity. If the field IDs are really so objectionable, Thrift could 
allow them to be optional for purely dynamic usages.

I also don't see why matching names is considered easier than matching numbers, 
which is essentially what the versioning semantics come down to in the end. Am 
I missing something here?

-- You state:
"Would you write parsers for Thrift's IDL in every language?  Or would 
you use JSON, as Avro does, to avoid that?"

Here I totally agree with you: a JSON IDL is better for machine parsing than 
Thrift's current IDL, which is targeted more at human parsing. And given that I 
agree that some form of dynamic interpretation is a useful feature, I don't see 
any reason why a JSON version of the IDL couldn't become part of the picture. 
Furthermore, the Thrift IDL compiler could easily be extended to take this JSON 
format as both an input (in addition to the current Thrift IDL) and output.

An alternative is would just be to have the other languages bind to the Thrift 
IDL parser directly -- most languages bind to C (granted for some it is easier 
than others) -- and get back the parsed data structure to interpret off of.

-- By making Avro a sub-project of Hadoop, I believe you will succeed in 
producing an improved version of Hadoop Record IO and a better RPC mechanism 
than the current Hadoop RPC. However, I don't think that this will result in 
better general RPC than Thrift and it will certainly be much less performant 
for RPC in a wide range of applications.

Consider an alternative: making Avro more like a sub-project of Thrift or just 
implementing it directly in Thrift. In that case, I think the end result will 
be a powerful and flexible "one-stop shop" for data serialization for RPC and 
archival purposes with the ability to bring both static and dynamic 
capabilities as needed for particular application purposes. To me this seems 
like a bigger win for both Hadoop and for Thrift.

Thanks for reading through to this point. I look forward to further discussion.

Chad

RE: [PROPOSAL] new subproject: Avro

Reply via email to