Re: how to parse a file with millions of records with protobuf

2008-12-08 Thread Jon Skeet

On Dec 7, 11:45 am, nightwalker leo [EMAIL PROTECTED] wrote:
 when I try to parse an addressbook file which has 2^20 records of
 person , my program complains like this:
 libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io
 \coded_stream.cc:459] Reading dangerously large protocol message.  If
 the message turns out to be larger than 67108864 bytes, parsing will
 be halted for security reasons.  To increase the limit (or to disable
 these warnings), see CodedInputStream::SetTotalBytesLimit() in google/
 protobuf/io/coded_stream.h.

 how to deal with the problem in an elegant way instead of increasing
 the limit or simply turning off the warning message?

In my C# port, I have code to write out messages as if they were a
repeated field #1 of a container type, and another class to read in
the same format in a stream manner, one entry at a time.

Would that useful to you?

Jon
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: how to parse a file with millions of records with protobuf

2008-12-08 Thread David Anderson

Do you really need to have the entire file in memory at once? Reading
64M of addresses into memory seems like the wrong approach (I could
be wrong of course, since I don't know what you're doing with them).

If you need to do something with each entry individually, you could do
chunked reads: when writing, instead of serializing the whole message
at once, build several messages of ~10M each, and write them out with
a length prefix. When reading back, use the length prefix to yield
nicely sized chunks of your address book. There may even be a nice way
to do this implicitely at the input/output stream level, if it is
aware of field boundaries, but I don't have a good enough handle on
the implementation to say.

If you need to find a specific entry in the address book, you should
sort the address book. You then chunk in the same manner, and add an
index message at the end of the file that lists the start offset of
all chunks. You can then do binary search over the chunks (even more
efficiently if the index includes the start and end keys of your
chunks, e.g. last names) to locate the chunk you want.

If none of these answers are satisfactory, and you really need the
entire multi-hundred-megabyte message loaded at once, I guess you can
use SetTotalBytesLimit() to raise the safety limits to whatever you
feel is necessary. But usually, when I try to load a bunch of small
messages as one monolithic block, I find that my data format isn't
adapted to what I want to do.

Hope this helps a little
- Dave

On Sun, Dec 7, 2008 at 12:45 PM, nightwalker leo [EMAIL PROTECTED] wrote:

 when I try to parse an addressbook file which has 2^20 records of
 person , my program complains like this:
 libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io
 \coded_stream.cc:459] Reading dangerously large protocol message.  If
 the message turns out to be larger than 67108864 bytes, parsing will
 be halted for security reasons.  To increase the limit (or to disable
 these warnings), see CodedInputStream::SetTotalBytesLimit() in google/
 protobuf/io/coded_stream.h.

 how to deal with the problem in an elegant way instead of increasing
 the limit or simply turning off the warning message?
 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: how to parse a file with millions of records with protobuf

2008-12-08 Thread Kenton Varda
On Sun, Dec 7, 2008 at 3:45 AM, nightwalker leo [EMAIL PROTECTED]wrote:


 when I try to parse an addressbook file which has 2^20 records of
 person , my program complains like this:
 libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io
 \coded_stream.cc:459] Reading dangerously large protocol message.  If
 the message turns out to be larger than 67108864 bytes, parsing will
 be halted for security reasons.  To increase the limit (or to disable
 these warnings), see CodedInputStream::SetTotalBytesLimit() in google/
 protobuf/io/coded_stream.h.

 how to deal with the problem in an elegant way instead of increasing
 the limit or simply turning off the warning message?


The documentation for SetTotalBytesLimit() answers your question:

http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html#CodedInputStream.SetTotalBytesLimit.details

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Quick Hacky text_mode Parse in Python

2008-12-08 Thread Kenton Varda
Hey Petar, isn't there a patch someone was trying to submit that implements
text format parsing?  (For real, not by wrapping protoc.)  What's the status
of that?

On Mon, Dec 8, 2008 at 5:03 AM, Nicholas Reid [EMAIL PROTECTED] wrote:


 Hi All,

 Firstly, just wanted to thank Kenton and the Google team, PB2 is a
 beautiful piece of work! Thanks heaps.

 I will almost certainly go to some deep circle of Programmer's Hell
 for this, but it might be useful for someone until the guys get a
 chance to add text_mode message parsing functionality to the Python
 API. There are almost certainly more elegant ways of doing this.

 Code:

 def parse_text_format(message_string, generated_message_type):
Parses the given Protobuf text_format into a new instance of
 the given type.

# Should be defined globally somewhere
PROTO_FILENAME = person.proto

# Instance new message
obj = generated_message_type()

# Wrap the protoc command-line utility, expects that 'protoc'
 should be on your PATH somewhere
(stdout, stdin) = popen2.popen2(protoc %s --encode=%s %
 (PROTO_FILENAME, message_type.DESCRIPTOR.name), bufsize=1024)

# Feed in the message_string in text_format
stdin.write(message_string)
stdin.close()

# Read out the protoc-encoded binary format
binary_string = stdout.read()
stdout.close()

# Parse the resulting binary representation.
obj.ParseFromString(binary_string)
return obj

 Example:

 Assuming person.proto contains:

 message Person {
required string name = 1;
 }

 Code:

 from person_pb2 import *
 guido = parse_text_format(name: Guido, person)

 Should give you a person object which you can use for nefarious
 purposes.

 Kind regards,

 Nicholas Reid
 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Quick Hacky text_mode Parse in Python

2008-12-08 Thread Petar Petrov
On Mon, Dec 8, 2008 at 10:21 AM, Kenton Varda [EMAIL PROTECTED] wrote:

 Hey Petar, isn't there a patch someone was trying to submit that implements
 text format parsing?  (For real, not by wrapping protoc.)  What's the status
 of that?


I'll review it today.
Hopefully the author hasn't forgotten about it.




 On Mon, Dec 8, 2008 at 5:03 AM, Nicholas Reid [EMAIL PROTECTED] wrote:


 Hi All,

 Firstly, just wanted to thank Kenton and the Google team, PB2 is a
 beautiful piece of work! Thanks heaps.

 I will almost certainly go to some deep circle of Programmer's Hell
 for this, but it might be useful for someone until the guys get a
 chance to add text_mode message parsing functionality to the Python
 API. There are almost certainly more elegant ways of doing this.

 Code:

 def parse_text_format(message_string, generated_message_type):
Parses the given Protobuf text_format into a new instance of
 the given type.

# Should be defined globally somewhere
PROTO_FILENAME = person.proto

# Instance new message
obj = generated_message_type()

# Wrap the protoc command-line utility, expects that 'protoc'
 should be on your PATH somewhere
(stdout, stdin) = popen2.popen2(protoc %s --encode=%s %
 (PROTO_FILENAME, message_type.DESCRIPTOR.name), bufsize=1024)

# Feed in the message_string in text_format
stdin.write(message_string)
stdin.close()

# Read out the protoc-encoded binary format
binary_string = stdout.read()
stdout.close()

# Parse the resulting binary representation.
obj.ParseFromString(binary_string)
return obj

 Example:

 Assuming person.proto contains:

 message Person {
required string name = 1;
 }

 Code:

 from person_pb2 import *
 guido = parse_text_format(name: Guido, person)

 Should give you a person object which you can use for nefarious
 purposes.

 Kind regards,

 Nicholas Reid
 



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Slicing support in Python

2008-12-08 Thread Kenton Varda
On Sat, Dec 6, 2008 at 1:03 AM, Alek Storm [EMAIL PROTECTED] wrote:

 But it does give us a lot of cool functionality, like adding the same
 message to two parents, and (yes!) slicing support.  I thought this was
 common practice in C++, but it's been quite a while since I've coded it.


Nope, in the C++ world we have to worry excessively about ownership, and we
generally make defensive copies rather than trying to allow an object to be
referenced from two places.


 Is it really that useful to have ByteSize() cached for repeated fields?  If
 it's not, we get everything I mentioned above for free.  I'm genuinely not
 sure - it only comes up when serializing the message in wire_format.py.
 What do you think?


Yes, it's just as necessary as it is with optional fields.  The main problem
is that the size of a message must be written before the message contents
itself.  If, while serializing, you call ByteSize() to get this size every
time you write a message, then you'll end up computing the size of
deeply-nested messages many times (once for each outer message within which
they're nested).  Caching avoids that problem.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Slicing support in Python

2008-12-08 Thread Alek Storm
On Mon, Dec 8, 2008 at 1:16 PM, Kenton Varda [EMAIL PROTECTED] wrote:

 On Sat, Dec 6, 2008 at 1:03 AM, Alek Storm [EMAIL PROTECTED] wrote:

 Is it really that useful to have ByteSize() cached for repeated fields?
 If it's not, we get everything I mentioned above for free.  I'm genuinely
 not sure - it only comes up when serializing the message in wire_format.py.
 What do you think?


 Yes, it's just as necessary as it is with optional fields.  The main
 problem is that the size of a message must be written before the message
 contents itself.  If, while serializing, you call ByteSize() to get this
 size every time you write a message, then you'll end up computing the size
 of deeply-nested messages many times (once for each outer message within
 which they're nested).  Caching avoids that problem.


Okay, then we just need to cache the size only during serialization.  The
children's sizes are calculated and stored, then added to the parent's
size.  Write the parent size, then write the parent, then the child size,
then the child, on down the tree.  Then it's O(n) (same as we have
currently) and no ownership problems, because we can drop the weak reference
from child to parent.  Would that work?

Cheers,
Alek Storm

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Doxygen and protobuf

2008-12-08 Thread Scott Stafford

Hi -

Has anyone attempted to use Doxygen with .proto files?  We're
considering the possibility of looking into extending Doxygen to
support the .proto format, but haven't started the project.  Thoughts?

Scott
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: how to parse a file with millions of records with protobuf

2008-12-08 Thread nightwalker leo

thanks,could you give me an example plz

On 12月8日, 下午4时10分, Jon Skeet [EMAIL PROTECTED] wrote:
 On Dec 7, 11:45 am, nightwalker leo [EMAIL PROTECTED] wrote:

  when I try to parse an addressbook file which has 2^20 records of
  person , my program complains like this:
  libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io
  \coded_stream.cc:459] Reading dangerously large protocol message.  If
  the message turns out to be larger than 67108864 bytes, parsing will
  be halted for security reasons.  To increase the limit (or to disable
  these warnings), see CodedInputStream::SetTotalBytesLimit() in google/
  protobuf/io/coded_stream.h.

  how to deal with the problem in an elegant way instead of increasing
  the limit or simply turning off the warning message?

 In my C# port, I have code to write out messages as if they were a
 repeated field #1 of a container type, and another class to read in
 the same format in a stream manner, one entry at a time.

 Would that useful to you?

 Jon
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---