[protobuf] Do Protocol Buffers read entire collection of messages into memory from file?

2010-10-19 Thread ksamdev
Hi,

I am wondering how do Protocol Buffers read input files? Is the entire
file read into memory or some proxy technique is used and entries are
read only when required?

This is a vital feature for large lists, say, some dataset with 10^9
messages.

Do Protocol Buffers use any additional archiving technique (zip, tar,
etc.) to further compress saved information?

sincerely, Sam.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Do Protocol Buffers read entire collection of messages into memory from file?

2010-10-19 Thread Henner Zeller
On Tue, Oct 19, 2010 at 06:45, ksamdev ksam...@gmail.com wrote:
 Hi,

 I am wondering how do Protocol Buffers read input files? Is the entire
 file read into memory or some proxy technique is used and entries are
 read only when required?

If you have a sequence of messages you process then you'd put some
container around them in the file. A very simple scheme would be
  lengh-of-next-messagenext-message.
That way you can read the messages one by one. This has been discussed
several times on this list.
(You are free to hide this in some proxy technique implementation
though it will just complicate things without much gain).

The protocol buffer library tries to be as simple as possible in
providing the pure serialization functionality. It provides everything
to allow sending them over the wire or storing them in files, but you
actually would need to do that yourself (adding that to the core
protocol buffer library would be beyond the scope and you might
already have something you would like to store your data in, such as
Berkely DB for keyed data).


 This is a vital feature for large lists, say, some dataset with 10^9
 messages.

I regularly process datasets with more than 10^9 Protocol Buffer
messages and essentially store them the way I described above.
Depending on the content, it helps to use a compressing scheme on the
file level (Many people use GZip streams).
(For larger datasets it actually makes sense to add some sort of CRC
as disks have a noticeable error rate at that size).

-h

 Do Protocol Buffers use any additional archiving technique (zip, tar,
 etc.) to further compress saved information?

 sincerely, Sam.

 --
 You received this message because you are subscribed to the Google Groups 
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to 
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/protobuf?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.