Re: Can anyone recommend me a inter-language data file format?
Zhou, Yunqing wrote: embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. JSON CSV erlang-style records (name,value,value,value) RDF-triples in non-XML representations For all of these, you need to test with data that includes things like high unicode characters, single and double quotes, to see how well they get handled. you can actually append with XML by not having opening/closing tags, just stream out the entries to the tail of the file entry.../entry To read this in an XML parser, include it inside another XML file: ?xml version=1.0? !DOCTYPE log [ !ENTITY log SYSTEM log.xml ] file log; /file I've done this for very big files, as long as you aren't trying to load it in-memory to a DOM, things should work -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Can anyone recommend me a inter-language data file format?
Protocol buffers, thrift? On 11/3/08 4:07 AM, Steve Loughran [EMAIL PROTECTED] wrote: Zhou, Yunqing wrote: embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. JSON CSV erlang-style records (name,value,value,value) RDF-triples in non-XML representations For all of these, you need to test with data that includes things like high unicode characters, single and double quotes, to see how well they get handled. you can actually append with XML by not having opening/closing tags, just stream out the entries to the tail of the file entry.../entry To read this in an XML parser, include it inside another XML file: ?xml version=1.0? !DOCTYPE log [ !ENTITY log SYSTEM log.xml ] file log; /file I've done this for very big files, as long as you aren't trying to load it in-memory to a DOM, things should work -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Can anyone recommend me a inter-language data file format?
I've been using protocol buffers to serialize the data and then encoding them in base64 so that I can then treat them like text. This obviously isn't optimal, but I'm assuming that this is only a short term solution which won't be necessary when non-Java clients become first class citizens of the Hadoop world. Chris On Mon, Nov 3, 2008 at 2:24 PM, Pete Wyckoff [EMAIL PROTECTED] wrote: Protocol buffers, thrift? On 11/3/08 4:07 AM, Steve Loughran [EMAIL PROTECTED] wrote: Zhou, Yunqing wrote: embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. JSON CSV erlang-style records (name,value,value,value) RDF-triples in non-XML representations For all of these, you need to test with data that includes things like high unicode characters, single and double quotes, to see how well they get handled. you can actually append with XML by not having opening/closing tags, just stream out the entries to the tail of the file entry.../entry To read this in an XML parser, include it inside another XML file: ?xml version=1.0? !DOCTYPE log [ !ENTITY log SYSTEM log.xml ] file log; /file I've done this for very big files, as long as you aren't trying to load it in-memory to a DOM, things should work -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Re: Can anyone recommend me a inter-language data file format?
I finally decided to use Protocol Buffers. But there is a problem, when hadoop is handling a file larger than blocksize,it will be splited. How can I determine the boundary of a sequence of protocol buffer records? I was thinking of using hadoop's SequenceFile as a container,but it hasn't a C++ API. Any advices? On Sun, Nov 2, 2008 at 1:45 PM, Bryan Duxbury [EMAIL PROTECTED] wrote: Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a ThriftWritable if you want to be crafty, but you can also just use byte[]s and do the serialization and deserialization yourself. -Bryan On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote: Take a look at Thrift: http://developers.facebook.com/thrift/ Alex On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote: The project I focused on has many modules written in different languages (several modules are hadoop jobs). So I'd like to utilize a common record based data file format for data exchange. XML is not efficient for appending new records. SequenceFile seems not having API of other languages except Java. Protocol Buffers' hadoop API seems under development. any recommendation for this? Thanks
Re: Can anyone recommend me a inter-language data file format?
Sleepycat has a java edition: http://www.oracle.com/technology/products/berkeley-db/index.html I has an interesting open source license. If you dont need to ship it on an install disk your probably good to go with that too. you could also consider Derby. C On Nov 1, 2008, at 7:49 PM, lamfeeling wrote: Consider Embeded Database? Berkeley DB, written in C++, and have interface for many languages. 在2008-11-02?10:15:22,Zhou,?Yunqing?[EMAIL PROTECTED]?写道: The?project?I?focused?on?has?many?modules?written?in?different? languages (several?modules?are?hadoop?jobs). So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? data exchange. XML?is?not?efficient?for?appending?new?records. SequenceFile?seems?not?having?API?of?other?languages?except?Java. Protocol?Buffers'?hadoop?API?seems?under?development. any?recommendation?for?this? Thanks
Re: Can anyone recommend me a inter-language data file format?
Can thrift be easily used in hadoop? a lot of things should be written, input/output format, writables,split method,etc. On Sun, Nov 2, 2008 at 11:01 AM, Alex Loddengaard [EMAIL PROTECTED] wrote: Take a look at Thrift: http://developers.facebook.com/thrift/ Alex On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote: The project I focused on has many modules written in different languages (several modules are hadoop jobs). So I'd like to utilize a common record based data file format for data exchange. XML is not efficient for appending new records. SequenceFile seems not having API of other languages except Java. Protocol Buffers' hadoop API seems under development. any recommendation for this? Thanks
Re: Can anyone recommend me a inter-language data file format?
Consider talking to Doug Cutting. He is playing with the idea of a variant of JSON, I am sure he would love your help. Specifically he is looking at a coding scheme that is easy to read, does not duplicate key names per record and supports file splits. C On Nov 1, 2008, at 8:20 PM, Zhou, Yunqing wrote: embedded database cannot handle large-scale data, not very efficient I have about 1 billion records. these records should be passed through some modules. I mean a data exchange format similar to XML but more flexible and efficient. On Sun, Nov 2, 2008 at 10:49 AM, lamfeeling [EMAIL PROTECTED] wrote: Consider Embeded Database? Berkeley DB, written in C++, and have interface for many languages. 在2008-11-02?10:15:22,Zhou,?Yunqing?[EMAIL PROTECTED]?写道: The?project?I?focused?on?has?many?modules?written?in?different? languages (several?modules?are?hadoop?jobs). So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? data exchange. XML?is?not?efficient?for?appending?new?records. SequenceFile?seems?not?having?API?of?other?languages?except?Java. Protocol?Buffers'?hadoop?API?seems?under?development. any?recommendation?for?this? Thanks
Re: Can anyone recommend me a inter-language data file format?
Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a ThriftWritable if you want to be crafty, but you can also just use byte[]s and do the serialization and deserialization yourself. -Bryan On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote: Take a look at Thrift: http://developers.facebook.com/thrift/ Alex On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote: The project I focused on has many modules written in different languages (several modules are hadoop jobs). So I'd like to utilize a common record based data file format for data exchange. XML is not efficient for appending new records. SequenceFile seems not having API of other languages except Java. Protocol Buffers' hadoop API seems under development. any recommendation for this? Thanks