Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Steve Loughran

Zhou, Yunqing wrote:

embedded database cannot handle large-scale data, not very efficient
I have about 1 billion records.
these records should be passed through some modules.
I mean a data exchange format similar to XML but more flexible and
efficient.



JSON
CSV
erlang-style records (name,value,value,value)
RDF-triples in non-XML representations

For all of these, you need to test with data that includes things like 
high unicode characters, single and double quotes, to see how well they 
get handled.


you can actually append with XML by not having opening/closing tags, 
just stream out the entries to the tail of the file

entry.../entry

To read this in an XML parser, include it inside another XML file:

?xml version=1.0?
!DOCTYPE log [
 !ENTITY log SYSTEM log.xml
]

file
log;
/file

I've done this for very big files, as long as you aren't trying to load 
it in-memory to a DOM, things should work


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Pete Wyckoff

Protocol buffers, thrift?


On 11/3/08 4:07 AM, Steve Loughran [EMAIL PROTECTED] wrote:

Zhou, Yunqing wrote:
 embedded database cannot handle large-scale data, not very efficient
 I have about 1 billion records.
 these records should be passed through some modules.
 I mean a data exchange format similar to XML but more flexible and
 efficient.


JSON
CSV
erlang-style records (name,value,value,value)
RDF-triples in non-XML representations

For all of these, you need to test with data that includes things like
high unicode characters, single and double quotes, to see how well they
get handled.

you can actually append with XML by not having opening/closing tags,
just stream out the entries to the tail of the file
entry.../entry

To read this in an XML parser, include it inside another XML file:

?xml version=1.0?
!DOCTYPE log [
  !ENTITY log SYSTEM log.xml
]

file
log;
/file

I've done this for very big files, as long as you aren't trying to load
it in-memory to a DOM, things should work

--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/




Re: Can anyone recommend me a inter-language data file format?

2008-11-03 Thread Chris Dyer
I've been using protocol buffers to serialize the data and then
encoding them in base64 so that I can then treat them like text.  This
obviously isn't optimal, but I'm assuming that this is only a short
term solution which won't be necessary when non-Java clients become
first class citizens of the Hadoop world.

Chris

On Mon, Nov 3, 2008 at 2:24 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:

 Protocol buffers, thrift?


 On 11/3/08 4:07 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 Zhou, Yunqing wrote:
 embedded database cannot handle large-scale data, not very efficient
 I have about 1 billion records.
 these records should be passed through some modules.
 I mean a data exchange format similar to XML but more flexible and
 efficient.


 JSON
 CSV
 erlang-style records (name,value,value,value)
 RDF-triples in non-XML representations

 For all of these, you need to test with data that includes things like
 high unicode characters, single and double quotes, to see how well they
 get handled.

 you can actually append with XML by not having opening/closing tags,
 just stream out the entries to the tail of the file
 entry.../entry

 To read this in an XML parser, include it inside another XML file:

 ?xml version=1.0?
 !DOCTYPE log [
  !ENTITY log SYSTEM log.xml
 ]

 file
 log;
 /file

 I've done this for very big files, as long as you aren't trying to load
 it in-memory to a DOM, things should work

 --
 Steve Loughran  http://www.1060.org/blogxter/publish/5
 Author: Ant in Action   http://antbook.org/





Re: Can anyone recommend me a inter-language data file format?

2008-11-02 Thread Zhou, Yunqing
I finally decided to use Protocol Buffers.
But there is a problem, when hadoop is handling a file larger than
blocksize,it will be splited.
How can I determine the boundary of a sequence of protocol buffer records?
I was thinking of using hadoop's SequenceFile as a container,but it hasn't a
C++ API.
Any advices?

On Sun, Nov 2, 2008 at 1:45 PM, Bryan Duxbury [EMAIL PROTECTED] wrote:

 Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a
 ThriftWritable if you want to be crafty, but you can also just use byte[]s
 and do the serialization and deserialization yourself.

 -Bryan


 On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote:

 Take a look at Thrift:
 http://developers.facebook.com/thrift/

 Alex

 On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote:

 The project I focused on has many modules written in different languages
 (several modules are hadoop jobs).
 So I'd like to utilize a common record based data file format for data
 exchange.
 XML is not efficient for appending new records.
 SequenceFile seems not having API of other languages except Java.
 Protocol Buffers' hadoop API seems under development.
 any recommendation for this?

 Thanks





Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Chris Collins

Sleepycat has a java edition:

http://www.oracle.com/technology/products/berkeley-db/index.html

I has an interesting open source license.  If you dont need to ship  
it on an install disk your probably good to go with that too.


you could also consider Derby.

C
On Nov 1, 2008, at 7:49 PM, lamfeeling wrote:

Consider Embeded Database? Berkeley DB, written in C++, and have  
interface for many languages.






在2008-11-02?10:15:22,Zhou,?Yunqing?[EMAIL PROTECTED]?写道:
The?project?I?focused?on?has?many?modules?written?in?different? 
languages

(several?modules?are?hadoop?jobs).
So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? 
data

exchange.
XML?is?not?efficient?for?appending?new?records.
SequenceFile?seems?not?having?API?of?other?languages?except?Java.
Protocol?Buffers'?hadoop?API?seems?under?development.
any?recommendation?for?this?

Thanks




Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Zhou, Yunqing
Can thrift be easily used in hadoop?
a lot of things should be written, input/output format, writables,split
method,etc.

On Sun, Nov 2, 2008 at 11:01 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:

 Take a look at Thrift:
 http://developers.facebook.com/thrift/

 Alex

 On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote:

  The project I focused on has many modules written in different languages
  (several modules are hadoop jobs).
  So I'd like to utilize a common record based data file format for data
  exchange.
  XML is not efficient for appending new records.
  SequenceFile seems not having API of other languages except Java.
  Protocol Buffers' hadoop API seems under development.
  any recommendation for this?
 
  Thanks
 



Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Chris Collins
Consider talking to Doug Cutting.  He is playing with the idea of a  
variant of JSON, I am sure he would love your help.  Specifically he  
is looking at a coding scheme that is easy to read, does not duplicate  
key names per record and supports file splits.


C
On Nov 1, 2008, at 8:20 PM, Zhou, Yunqing wrote:


embedded database cannot handle large-scale data, not very efficient
I have about 1 billion records.
these records should be passed through some modules.
I mean a data exchange format similar to XML but more flexible and
efficient.

On Sun, Nov 2, 2008 at 10:49 AM, lamfeeling [EMAIL PROTECTED]  
wrote:


Consider Embeded Database? Berkeley DB, written in C++, and have  
interface

for many languages.





在2008-11-02?10:15:22,Zhou,?Yunqing?[EMAIL PROTECTED]?写道:
The?project?I?focused?on?has?many?modules?written?in?different? 
languages

(several?modules?are?hadoop?jobs).
So?I'd?like?to?utilize?a?common?record?based?data?file?format?for? 
data

exchange.
XML?is?not?efficient?for?appending?new?records.
SequenceFile?seems?not?having?API?of?other?languages?except?Java.
Protocol?Buffers'?hadoop?API?seems?under?development.
any?recommendation?for?this?

Thanks






Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Bryan Duxbury
Agree, we use Thrift at Rapleaf for this purpose. It's trivial to  
make a ThriftWritable if you want to be crafty, but you can also just  
use byte[]s and do the serialization and deserialization yourself.


-Bryan

On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote:


Take a look at Thrift:
http://developers.facebook.com/thrift/

Alex

On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED]  
wrote:


The project I focused on has many modules written in different  
languages

(several modules are hadoop jobs).
So I'd like to utilize a common record based data file format for  
data

exchange.
XML is not efficient for appending new records.
SequenceFile seems not having API of other languages except Java.
Protocol Buffers' hadoop API seems under development.
any recommendation for this?

Thanks