Re: how to parse a file with millions of records with protobuf
On Dec 7, 11:45 am, nightwalker leo [EMAIL PROTECTED] wrote: when I try to parse an addressbook file which has 2^20 records of person , my program complains like this: libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io \coded_stream.cc:459] Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/ protobuf/io/coded_stream.h. how to deal with the problem in an elegant way instead of increasing the limit or simply turning off the warning message? In my C# port, I have code to write out messages as if they were a repeated field #1 of a container type, and another class to read in the same format in a stream manner, one entry at a time. Would that useful to you? Jon --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: how to parse a file with millions of records with protobuf
Do you really need to have the entire file in memory at once? Reading 64M of addresses into memory seems like the wrong approach (I could be wrong of course, since I don't know what you're doing with them). If you need to do something with each entry individually, you could do chunked reads: when writing, instead of serializing the whole message at once, build several messages of ~10M each, and write them out with a length prefix. When reading back, use the length prefix to yield nicely sized chunks of your address book. There may even be a nice way to do this implicitely at the input/output stream level, if it is aware of field boundaries, but I don't have a good enough handle on the implementation to say. If you need to find a specific entry in the address book, you should sort the address book. You then chunk in the same manner, and add an index message at the end of the file that lists the start offset of all chunks. You can then do binary search over the chunks (even more efficiently if the index includes the start and end keys of your chunks, e.g. last names) to locate the chunk you want. If none of these answers are satisfactory, and you really need the entire multi-hundred-megabyte message loaded at once, I guess you can use SetTotalBytesLimit() to raise the safety limits to whatever you feel is necessary. But usually, when I try to load a bunch of small messages as one monolithic block, I find that my data format isn't adapted to what I want to do. Hope this helps a little - Dave On Sun, Dec 7, 2008 at 12:45 PM, nightwalker leo [EMAIL PROTECTED] wrote: when I try to parse an addressbook file which has 2^20 records of person , my program complains like this: libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io \coded_stream.cc:459] Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/ protobuf/io/coded_stream.h. how to deal with the problem in an elegant way instead of increasing the limit or simply turning off the warning message? --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: how to parse a file with millions of records with protobuf
On Sun, Dec 7, 2008 at 3:45 AM, nightwalker leo [EMAIL PROTECTED]wrote: when I try to parse an addressbook file which has 2^20 records of person , my program complains like this: libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io \coded_stream.cc:459] Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/ protobuf/io/coded_stream.h. how to deal with the problem in an elegant way instead of increasing the limit or simply turning off the warning message? The documentation for SetTotalBytesLimit() answers your question: http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.io.coded_stream.html#CodedInputStream.SetTotalBytesLimit.details --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Quick Hacky text_mode Parse in Python
Hey Petar, isn't there a patch someone was trying to submit that implements text format parsing? (For real, not by wrapping protoc.) What's the status of that? On Mon, Dec 8, 2008 at 5:03 AM, Nicholas Reid [EMAIL PROTECTED] wrote: Hi All, Firstly, just wanted to thank Kenton and the Google team, PB2 is a beautiful piece of work! Thanks heaps. I will almost certainly go to some deep circle of Programmer's Hell for this, but it might be useful for someone until the guys get a chance to add text_mode message parsing functionality to the Python API. There are almost certainly more elegant ways of doing this. Code: def parse_text_format(message_string, generated_message_type): Parses the given Protobuf text_format into a new instance of the given type. # Should be defined globally somewhere PROTO_FILENAME = person.proto # Instance new message obj = generated_message_type() # Wrap the protoc command-line utility, expects that 'protoc' should be on your PATH somewhere (stdout, stdin) = popen2.popen2(protoc %s --encode=%s % (PROTO_FILENAME, message_type.DESCRIPTOR.name), bufsize=1024) # Feed in the message_string in text_format stdin.write(message_string) stdin.close() # Read out the protoc-encoded binary format binary_string = stdout.read() stdout.close() # Parse the resulting binary representation. obj.ParseFromString(binary_string) return obj Example: Assuming person.proto contains: message Person { required string name = 1; } Code: from person_pb2 import * guido = parse_text_format(name: Guido, person) Should give you a person object which you can use for nefarious purposes. Kind regards, Nicholas Reid --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Quick Hacky text_mode Parse in Python
On Mon, Dec 8, 2008 at 10:21 AM, Kenton Varda [EMAIL PROTECTED] wrote: Hey Petar, isn't there a patch someone was trying to submit that implements text format parsing? (For real, not by wrapping protoc.) What's the status of that? I'll review it today. Hopefully the author hasn't forgotten about it. On Mon, Dec 8, 2008 at 5:03 AM, Nicholas Reid [EMAIL PROTECTED] wrote: Hi All, Firstly, just wanted to thank Kenton and the Google team, PB2 is a beautiful piece of work! Thanks heaps. I will almost certainly go to some deep circle of Programmer's Hell for this, but it might be useful for someone until the guys get a chance to add text_mode message parsing functionality to the Python API. There are almost certainly more elegant ways of doing this. Code: def parse_text_format(message_string, generated_message_type): Parses the given Protobuf text_format into a new instance of the given type. # Should be defined globally somewhere PROTO_FILENAME = person.proto # Instance new message obj = generated_message_type() # Wrap the protoc command-line utility, expects that 'protoc' should be on your PATH somewhere (stdout, stdin) = popen2.popen2(protoc %s --encode=%s % (PROTO_FILENAME, message_type.DESCRIPTOR.name), bufsize=1024) # Feed in the message_string in text_format stdin.write(message_string) stdin.close() # Read out the protoc-encoded binary format binary_string = stdout.read() stdout.close() # Parse the resulting binary representation. obj.ParseFromString(binary_string) return obj Example: Assuming person.proto contains: message Person { required string name = 1; } Code: from person_pb2 import * guido = parse_text_format(name: Guido, person) Should give you a person object which you can use for nefarious purposes. Kind regards, Nicholas Reid --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Slicing support in Python
On Sat, Dec 6, 2008 at 1:03 AM, Alek Storm [EMAIL PROTECTED] wrote: But it does give us a lot of cool functionality, like adding the same message to two parents, and (yes!) slicing support. I thought this was common practice in C++, but it's been quite a while since I've coded it. Nope, in the C++ world we have to worry excessively about ownership, and we generally make defensive copies rather than trying to allow an object to be referenced from two places. Is it really that useful to have ByteSize() cached for repeated fields? If it's not, we get everything I mentioned above for free. I'm genuinely not sure - it only comes up when serializing the message in wire_format.py. What do you think? Yes, it's just as necessary as it is with optional fields. The main problem is that the size of a message must be written before the message contents itself. If, while serializing, you call ByteSize() to get this size every time you write a message, then you'll end up computing the size of deeply-nested messages many times (once for each outer message within which they're nested). Caching avoids that problem. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: Slicing support in Python
On Mon, Dec 8, 2008 at 1:16 PM, Kenton Varda [EMAIL PROTECTED] wrote: On Sat, Dec 6, 2008 at 1:03 AM, Alek Storm [EMAIL PROTECTED] wrote: Is it really that useful to have ByteSize() cached for repeated fields? If it's not, we get everything I mentioned above for free. I'm genuinely not sure - it only comes up when serializing the message in wire_format.py. What do you think? Yes, it's just as necessary as it is with optional fields. The main problem is that the size of a message must be written before the message contents itself. If, while serializing, you call ByteSize() to get this size every time you write a message, then you'll end up computing the size of deeply-nested messages many times (once for each outer message within which they're nested). Caching avoids that problem. Okay, then we just need to cache the size only during serialization. The children's sizes are calculated and stored, then added to the parent's size. Write the parent size, then write the parent, then the child size, then the child, on down the tree. Then it's O(n) (same as we have currently) and no ownership problems, because we can drop the weak reference from child to parent. Would that work? Cheers, Alek Storm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Doxygen and protobuf
Hi - Has anyone attempted to use Doxygen with .proto files? We're considering the possibility of looking into extending Doxygen to support the .proto format, but haven't started the project. Thoughts? Scott --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---
Re: how to parse a file with millions of records with protobuf
thanks,could you give me an example plz On 12月8日, 下午4时10分, Jon Skeet [EMAIL PROTECTED] wrote: On Dec 7, 11:45 am, nightwalker leo [EMAIL PROTECTED] wrote: when I try to parse an addressbook file which has 2^20 records of person , my program complains like this: libprotobuf WARNING D:\protobuf-2.0.2\src\google\protobuf\io \coded_stream.cc:459] Reading dangerously large protocol message. If the message turns out to be larger than 67108864 bytes, parsing will be halted for security reasons. To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/ protobuf/io/coded_stream.h. how to deal with the problem in an elegant way instead of increasing the limit or simply turning off the warning message? In my C# port, I have code to write out messages as if they were a repeated field #1 of a container type, and another class to read in the same format in a stream manner, one entry at a time. Would that useful to you? Jon --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/protobuf?hl=en -~--~~~~--~~--~--~---