[protobuf] C++ ParseFromCodedStream takes too long
Hello, I have a message type like this: message SearchResponse { enum Status { OK = 200; BAD_REQUEST = 400; REQUEST_TIMEOUT = 408; INTERNAL_SERVER_ERROR = 500; } required Status status = 1; message Result { required int32 docid = 1; optional float score = 2; } repeated Result result = 2; } Now it it possible that the result repeats 5.000.000 times or more. At the moment I'm working with about 2.100.000 repeats and the ParseFromCodedStream function takes about 25 seconds. My Code looks like: google::protobuf::io::ArrayInputStream arrayIn(buffRecive, recived); google::protobuf::io::CodedInputStream codedIn(arrayIn); google::protobuf::io::CodedInputStream::Limit msgLimit = codedIn.PushLimit(recived); srchResp.ParseFromCodedStream(codedIn); codedIn.PopLimit(msgLimit); The parsing and everything works great, but the performance is the problem. The message with 2.100.000 repeats has around 2.600.000 bytes. Is there a way to improve the performance? -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.
Re: [protobuf] C++ ParseFromCodedStream takes too long
On Thu, Mar 20, 2014 at 3:28 AM, christian.kilpatrick.1...@gmail.com wrote: Hello, I have a message type like this: message SearchResponse { enum Status { OK = 200; BAD_REQUEST = 400; REQUEST_TIMEOUT = 408; INTERNAL_SERVER_ERROR = 500; } required Status status = 1; message Result { required int32 docid = 1; optional float score = 2; } repeated Result result = 2; } Now it it possible that the result repeats 5.000.000 times or more. At the moment I'm working with about 2.100.000 repeats and the ParseFromCodedStream function takes about 25 seconds. My Code looks like: google::protobuf::io::ArrayInputStream arrayIn(buffRecive, recived); google::protobuf::io::CodedInputStream codedIn(arrayIn); google::protobuf::io::CodedInputStream::Limit msgLimit = codedIn.PushLimit(recived); srchResp.ParseFromCodedStream(codedIn); codedIn.PopLimit(msgLimit); The parsing and everything works great, but the performance is the problem. The message with 2.100.000 repeats has around 2.600.000 bytes. Is there a way to improve the performance? If that message structure can be changed, I'd suggest the following: repeated int64 docid = 2 [packed=true]; repeated float score = 3 [packed=true]; Use the low bit of docid to indicate whether there's a score (hence the int32 - int64 change... if your docid's fit in 31 bits, no need for that). So your encoding algo would be like add_docid(docid 1 | !!has_score) if (has_score) add_score(score) And your reading algo would be the inverse. Unfortunately this means there's no easy way to seek to a particular doc id and get its score, so some post-processing on the receiver end will be necessary. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.
Re: [protobuf] C++ ParseFromCodedStream takes too long
The message will be changed. Score isn't optional. Every docid has a score. I could try it as you suggested. Without the message. But I'm a little bit confused. I tried it with Java and with C++. Java can parse the Response in less than a second. C++ takes 20-25 seconds. Both times it is the same Message ~2.600.000 bytes large and with ~2.100.000 repeates. How can this be? Shouldn't C++ be faster than Java? -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.
Re: [protobuf] C++ ParseFromCodedStream takes too long
On Thu, Mar 20, 2014 at 8:40 AM, christian.kilpatrick.1...@gmail.com wrote: The message will be changed. Score isn't optional. Every docid has a score. I could try it as you suggested. Without the message. But I'm a little bit confused. I tried it with Java and with C++. Java can parse the Response in less than a second. C++ takes 20-25 seconds. Both times it is the same Message ~2.600.000 bytes large and with ~2.100.000 repeates. How can this be? Your sizes must be wrong. If score isn't optional, your current message should be (tag 2 length varint (for submessage) tag 1 varint tag 2 float) x2.1M That means that each entry will take at least 9 bytes, more if the docid 127. I don't see how this can only take up 2.6MB... Shouldn't C++ be faster than Java? Yes, it should be :) I'll let someone else answer here... You could profile the C++ code and see where the time is going... -ilia -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.
Re: [protobuf] C++ ParseFromCodedStream takes too long
Oh yes you are right. There was a mistake in my calculation. Well I'll try without the submessage. The count of docid and of score is always the same. So there is no Problem for me. But it's sad that java takes around 2 seconds to send the Message to the server, recieve the Response and display the response. And C++ takes 20-25 seconds only for this function call srchResp.ParseFromCodedStream(codedIn); :( I don't think that changing the message will get me around 20 seconds :( Too bad. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.
Re: [protobuf] C++ ParseFromCodedStream takes too long
On 20 March 2014 06:10, christian.kilpatrick.1...@gmail.com wrote: Oh yes you are right. There was a mistake in my calculation. Well I'll try without the submessage. The count of docid and of score is always the same. So there is no Problem for me. But it's sad that java takes around 2 seconds to send the Message to the server, recieve the Response and display the response. And C++ takes 20-25 seconds only for this function call srchResp.ParseFromCodedStream(codedIn); :( I don't think that changing the message will get me around 20 seconds :( Too bad. If you man many things that need to be read into the repeated field, maybe your memory allocator is running into trouble when resizing the message all the time ? Since the repeated field does not know beforehand how many objects are coming, it is forced to re-allocate while parsing. Just re-use the message object, as it can make use of the memory already allocated. Create one C++ object, then use that to parse from stream. When you're done with it, re-use the same object for subsequent parse. I think it is possible to somehow tell the repeated field beforehand to reserve some capacity (similar to a vectorT), but don't know that right now. -h -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To unsubscribe from this group and stop receiving emails from it, send an email to protobuf+unsubscr...@googlegroups.com. To post to this group, send email to protobuf@googlegroups.com. Visit this group at http://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.