I think I have a pretty good understanding of why he can¹t use Thrift as it stands.
His primary use case is the same as Hadoop recordio < big files with lots of similar records in them. So he wants to be able to put a data description header and then have a big stream of records that conform to that header and which don¹t need type fields interspersed. In addition, he wants to have it all to work dynamically so, for example, a Python script used in Hadoop Streaming can read the header and pull fields out of records in the stream without needing to have the generated bindings. I do think we could have (and can still) work to extend Thrift to cover these use cases - hopefully we can convince Doug to work with us on this. Chad On 4/3/09 7:51 AM, "Bryan Duxbury" <[email protected]> wrote: > I agree with Chad here. We should get in touch with Doug and see why > he can't use Thrift. > > -Bryan > > On Apr 3, 2009, at 5:47 AM, Chad Walters wrote: > >> > Are there any strong technical reasons why we couldn¹t fold Avro¹s >> > functionality into Thrift? >> > >> > Back when I was trying to get Thrift into a position to replace Hadoop >> > record IO, we talked about doing much of this. >> > >> > I think the main challenges here are: >> > 1. Runtime parsing of schemas >> > 2. Lack of static typing for some items >> > >> > I suggest that we talk with Doug, perhaps a face-to-face meeting >> > with some >> > of us and figure out if we can get him to come on board Thrift. At >> > the end >> > of the day, it will be a win for both the Thrift and the Hadoop >> > communities >> > for us to make these two Apache projects work closely together, >> > especially >> > since I know there is a huge overlap between the two communities. >> > >> > Chad >> > >> > On 4/2/09 10:48 PM, "David Reiss" <[email protected]> wrote: >> > >>> >> For those of you who don't have git, forrest, *and* Java 5 >>> >> (not 6! 5!) installed, I built the docs and put them online: >>> >> >>> >> http://www.projectornation.com/avro-doc/spec.html >>> >> >>> >> AFAICT, the main differences from Thrift are: >>> >> >>> >> - No code generation. The schema is all in JSON files that are >>> >> parsed >>> >> at runtime. For Python, this is probably fine. I'm not really >>> >> clear >>> >> on how it looks for Java (maybe someone can look at the Java >>> >> tests and >>> >> explain it to the rest of us). For C++, this will definitely make >>> >> the avro objects feel clunky because you'll have to access >>> >> properties >>> >> by name. And the lists won't be statically typed. >>> >> - The full schema is included with the messages, rather than having >>> >> field ids delimit the contents. This is nice for big Hadoop files >>> >> since you only include the schema once. (It was a technique that >>> >> we discussed for Thrift.) For a system like (I guess) Hadoop that >>> >> has long-lived RPC connections with multiple messages passed, I >>> >> guess >>> >> it is not that big of a deal either. For a system like we have at >>> >> Facebook where the web server must connect to the feed/search/chat >>> >> server once for each RPC, it is a killer. >>> >> >>> >> --David >> > > >
