Cross-posting to the Thrift dev and user lists since folks there may be
interested in this. It appears that my attempts to subscribe to
[email protected] from my work email were silently failing somewhere
along the line -- I'll try not to take it personally. ;) Some others have
experienced this too -- so if you didn't get a subscription confirmation
message, then it failed. Try from a different address, I guess. You can view
the thread here without being subscribed:
http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/browser
Doug,
First, let me say that I think Avro has a lot of useful features -- features
that I would like to see fully supported in Thrift. At a minimum, I would like
for us to be able to hash out the details to guarantee that there can really be
full interoperability between Avro and Thrift. I am really interested in
working cooperatively and collaboratively on this and I am willing to put in
significant time on design and communication to help make full interoperability
possible (I am unfortunately not able to contribute code directly at this time).
Second, I think all of this decision about where Avro should live requires more
thought and more discussion. I'd love to hear from more folks outside of Yahoo
on this topic: so far all of the +1 votes have come from Yahoo employees. I'd
also love to hear from other folks who have significant investments in both
Thrift and Hadoop.
Some points to think about:
-- You suggest that there is not a lot in Thrift that Avro can leverage. I
think you may be overlooking the fact that Thrift has a user base and a
community of developers who are very interested in issues of cross-language
data serialization and interoperability. Thrift has committers with expertise
in a pretty big set of languages and leveraging this could get Avro's
functionality onto more languages faster than the current path. Also, there is
in fact significant overlap between Hadoop users and Thrift users at this
point, as well as significant use of Thrift in more than one Hadoop sub-project.
At the code level, Thrift contains a transport abstraction and multiple
different transport and server implementations in many different target
languages. If there were closer collaboration, Avro could certainly benefit
from leveraging the existing ones and any additional contributions in this area
would benefit both projects.
-- You also suggest that the two are largely disjoint from a technical
perspective:
"Thrift fundamentally standardizes an API, not a data format.
Avro fundamentally is a data format specification, like XML."
I agree with the fundamental part but I think that doesn't bring to light
enough of what is in common and what is different for purposes of this
discussion.
Thrift specifies a type system, an API for data formats and transport
mechanisms, a schema resolution algorithm, and provides implementations of
several distinct data formats and transports.
Avro specifies a single data format but it also brings along several other
things as well, including a type system, specific RPC mechanism and a schema
resolution algorithm.
The most significant issue is that both of them specify a type system. At a
very minimum I would like to see Avro and Thrift make agreements on that type
system. The fact that there is significant existing investment in the Thrift
type system by the Thrift community should weigh somewhere in this discussion.
Obviously, the technical needs of Avro will also have weight there, but where
there is room for choice, the Thrift choices should be respected. Arbitrary
changes here will make it unnecessarily painful, perhaps impossible, for Thrift
to directly adopt Avro and instead Thrift will be forced to make an "Avro-like"
data specification, hampering interoperability for everyone.
There may be pitfalls in the other areas of overlap as well that would prevent
real interoperability -- let's elucidate them in further discussions.
-- Avro appears to have 3 primary features that Thrift does not currently
support sufficiently:
1. Schema serialization allowing for compact representation of files containing
large numbers of records of identical types
2. Dynamic interpretation of schemas, which improves ease-of-use in dynamic
languages (like the Python Hadoop Streaming use case)
3. Lazy partial deserialization to support "projection"
Note that features 1 and 3 are independent of whether schemas are dynamicly
interpreted or compiled into static bindings.
WRT #1: Thrift's DenseProtocol goes some distance towards this although it
doesn't go the whole way. Thrift can easily be extended to further compact the
DenseProtocol's wire format for special cases where all fields are required. We
have had significant discussions on the Thrift list about doing more in this
area previously but we couldn't get folks from Hadoop who cared most about this
use case to participate with us on capturing a complete set of requirements and
so there was no strong driver for it.
WRT #2: I totally understand the case you make for dynamic interpretation in ad
hoc data processing. I would love to see Thrift enhanced to do this kind of
thing.
WRT #3: Partial deserialization seems like a really useful feature for several
use cases, not just for "projection". I think Thrift could and should be
extended to support this functionality, and it should be available for both
static bindings and dynamic schema interpretation via field names and field IDs
where possible.
-- You state:
"Perhaps Thrift could be augmented to support Avro's JSON schemas and
serialization. Then it could interoperate with other Avro-based
systems. But then Thrift would have yet another serialization format,
that every language would need to implement for it to be useful..."
First, that "Perhaps" hides a lot of complexity and unless that is hashed out
ahead of time I am pretty sure the real answer will be "Thrift cannot be
augmented to support Avro directly but instead could be augmented to support
something that looks quite a bit like Avro but differs in mostly unimportant
ways." To me that seems like a shame.
Furthermore, you say that last part ("Thrift would have yet another
serialization format...") like it is a bad thing... Note that it is an explicit
design goal of Thrift to allow for multiple different serialization formats so
that lots of different use cases can be supported by the same fundamental
framework. This is a clear recognition that there is no one-size-fits-all
answer for data serialization (fast RPC vs compact archival record data vs
human readability, to name a few salient use cases). For a compelling enough
use case, there is no reason not to port new protocols across multiple
languages (generally done on an as-needed basis by someone who wants that
functionality in that language). Another great feature of the protocol
abstraction is that it allows data to be seamlessly moved from one
serialization format to another as, say, it is read out of archival storage and
sent on as RPC.
Also, doesn't Avro essentially contain "another serialization format that every
language would need to implement for it to be useful"? Seems like the same
basic set of work to me, whether it is in Avro or Thrift.
-- You state:
"Avro fundamentally is a data format specification, like XML. Thrift could
implement this specification. The Avro project includes reference
implementations, but the format is intended to be simple enough and the
specification stable enough that others might reasonably develop alternate,
independent implementations."
I think this is a bit inaccurate. First there is the issue of type system
compatibility that I raised above and the plausibility of satisfying that
"could" without refinement and collaboration on Avro's specification.
Furthermore, stated goal of the subproject is "for Avro to replace both
Hadoop's RPC and to be used for most Hadoop data files". This will bring in
quite a bit beyond a reference implementation of a data format specification,
especially depending on how many languages you intend to build RPC support for
(Java, Python, C++ all mentioned at some point -- others?). I don't think it is
unreasonable that the significant proportion of folks in the Hadoop community
who are also using Thrift are puzzled about why there isn't more consideration
being given to convergence between Avro and Thrift.
-- You state:
"Also, with the schema, resolving version differences is simplified.
Developers don't need to assign field numbers, but can just use names.
For performance, one can internally use field numbers while reading, to
avoid string comparisons, but developers need no longer specify these,
but can use names, as in most software. Here having the schema means we
can simplify the IDL and its versioning semantics."
The simplification comes simply not having the field IDs in the IDL? I am not
sure why having sequential id numbers after each field is considered to be so
onerous. I honestly have never heard a single Thrift user complain about this.
Anyone doing more than just that is doing something advanced that wouldn't be
possible without the field IDs (like renaming a field). I think having to deal
with JSON syntax in the Avro IDL is actually more annoying for humans than the
application of field IDs, both with the added syntactic punctuation and the
increased verbosity. If the field IDs are really so objectionable, Thrift could
allow them to be optional for purely dynamic usages.
I also don't see why matching names is considered easier than matching numbers,
which is essentially what the versioning semantics come down to in the end. Am
I missing something here?
-- You state:
"Would you write parsers for Thrift's IDL in every language? Or would
you use JSON, as Avro does, to avoid that?"
Here I totally agree with you: a JSON IDL is better for machine parsing than
Thrift's current IDL, which is targeted more at human parsing. And given that I
agree that some form of dynamic interpretation is a useful feature, I don't see
any reason why a JSON version of the IDL couldn't become part of the picture.
Furthermore, the Thrift IDL compiler could easily be extended to take this JSON
format as both an input (in addition to the current Thrift IDL) and output.
An alternative is would just be to have the other languages bind to the Thrift
IDL parser directly -- most languages bind to C (granted for some it is easier
than others) -- and get back the parsed data structure to interpret off of.
-- By making Avro a sub-project of Hadoop, I believe you will succeed in
producing an improved version of Hadoop Record IO and a better RPC mechanism
than the current Hadoop RPC. However, I don't think that this will result in
better general RPC than Thrift and it will certainly be much less performant
for RPC in a wide range of applications.
Consider an alternative: making Avro more like a sub-project of Thrift or just
implementing it directly in Thrift. In that case, I think the end result will
be a powerful and flexible "one-stop shop" for data serialization for RPC and
archival purposes with the ability to bring both static and dynamic
capabilities as needed for particular application purposes. To me this seems
like a bigger win for both Hadoop and for Thrift.
Thanks for reading through to this point. I look forward to further discussion.
Chad