Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Chad Walters Fri, 03 Apr 2009 08:21:51 -0700

I think I have a pretty good understanding of why he can¹t use Thrift as it
stands.


His primary use case is the same as Hadoop recordio < big files with lots of
similar records in them. So he wants to be able to put a data description
header and then have a big stream of records that conform to that header and
which don¹t need type fields interspersed.

In addition, he wants to have it all to work dynamically so, for example, a
Python script used in Hadoop Streaming can read the header and pull fields
out of records in the stream without needing to have the generated bindings.

I do think we could have (and can still) work to extend Thrift to cover
these use cases - hopefully we can convince Doug to work with us on this.

Chad


On 4/3/09 7:51 AM, "Bryan Duxbury" <[email protected]> wrote:

> I agree with Chad here. We should get in touch with Doug and see why
> he can't use Thrift.
> 
> -Bryan
> 
> On Apr 3, 2009, at 5:47 AM, Chad Walters wrote:
> 
>> > Are there any strong technical reasons why we couldn¹t fold Avro¹s
>> > functionality into Thrift?
>> >
>> > Back when I was trying to get Thrift into a position to replace Hadoop
>> > record IO, we talked about doing much of this.
>> >
>> > I think the main challenges here are:
>> > 1. Runtime parsing of schemas
>> > 2. Lack of static typing for some items
>> >
>> > I suggest that we talk with Doug, perhaps a face-to-face meeting
>> > with some
>> > of us and figure out if we can get him to come on board Thrift. At
>> > the end
>> > of the day, it will be a win for both the Thrift and the Hadoop
>> > communities
>> > for us to make these two Apache projects work closely together,
>> > especially
>> > since I know there is a huge overlap between the two communities.
>> >
>> > Chad
>> >
>> > On 4/2/09 10:48 PM, "David Reiss" <[email protected]> wrote:
>> >
>>> >> For those of you who don't have git, forrest, *and* Java 5
>>> >> (not 6! 5!) installed, I built the docs and put them online:
>>> >>
>>> >> http://www.projectornation.com/avro-doc/spec.html
>>> >>
>>> >> AFAICT, the main differences from Thrift are:
>>> >>
>>> >> - No code generation.  The schema is all in JSON files that are
>>> >> parsed
>>> >>   at runtime.  For Python, this is probably fine.  I'm not really
>>> >> clear
>>> >>   on how it looks for Java (maybe someone can look at the Java
>>> >> tests and
>>> >>   explain it to the rest of us).  For C++, this will definitely make
>>> >>   the avro objects feel clunky because you'll have to access
>>> >> properties
>>> >>   by name.  And the lists won't be statically typed.
>>> >> - The full schema is included with the messages, rather than having
>>> >>   field ids delimit the contents.  This is nice for big Hadoop files
>>> >>   since you only include the schema once.  (It was a technique that
>>> >>   we discussed for Thrift.)  For a system like (I guess) Hadoop that
>>> >>   has long-lived RPC connections with multiple messages passed, I
>>> >> guess
>>> >>   it is not that big of a deal either.  For a system like we have at
>>> >>   Facebook where the web server must connect to the feed/search/chat
>>> >>   server once for each RPC, it is a killer.
>>> >>
>>> >> --David
>> >
> 
>

Re: Avro, a cross-language serialization framework from Doug Cutting, proposed as Hadoop subproject

Reply via email to