[protobuf] C++ ParseFromCodedStream takes too long

2014-03-20 Thread christian . kilpatrick . 1991
Hello,

I have a message type like this:

message SearchResponse {
enum Status {
OK = 200;
BAD_REQUEST = 400;
REQUEST_TIMEOUT = 408;
INTERNAL_SERVER_ERROR = 500;
}
required Status status = 1;

message Result {
required int32 docid = 1;
optional float score = 2;
}
repeated Result result = 2;
}

Now it it possible that the result repeats 5.000.000 times or more. At the 
moment I'm working with about 2.100.000 repeats and the 
ParseFromCodedStream function takes about 25 seconds.

My Code looks like:

google::protobuf::io::ArrayInputStream arrayIn(buffRecive, recived);
google::protobuf::io::CodedInputStream codedIn(arrayIn);
google::protobuf::io::CodedInputStream::Limit msgLimit = 
codedIn.PushLimit(recived);
srchResp.ParseFromCodedStream(codedIn);
codedIn.PopLimit(msgLimit);

The parsing and everything works great, but the performance is the problem. 
The message with 2.100.000 repeats has around 2.600.000 bytes.
Is there a way to improve the performance?

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] C++ ParseFromCodedStream takes too long

2014-03-20 Thread Ilia Mirkin
On Thu, Mar 20, 2014 at 3:28 AM,  christian.kilpatrick.1...@gmail.com wrote:
 Hello,

 I have a message type like this:

 message SearchResponse {
 enum Status {
 OK = 200;
 BAD_REQUEST = 400;
 REQUEST_TIMEOUT = 408;
 INTERNAL_SERVER_ERROR = 500;
 }
 required Status status = 1;

 message Result {
 required int32 docid = 1;
 optional float score = 2;
 }
 repeated Result result = 2;
 }

 Now it it possible that the result repeats 5.000.000 times or more. At the
 moment I'm working with about 2.100.000 repeats and the ParseFromCodedStream
 function takes about 25 seconds.

 My Code looks like:

 google::protobuf::io::ArrayInputStream arrayIn(buffRecive, recived);
 google::protobuf::io::CodedInputStream codedIn(arrayIn);
 google::protobuf::io::CodedInputStream::Limit msgLimit =
 codedIn.PushLimit(recived);
 srchResp.ParseFromCodedStream(codedIn);
 codedIn.PopLimit(msgLimit);

 The parsing and everything works great, but the performance is the problem.
 The message with 2.100.000 repeats has around 2.600.000 bytes.
 Is there a way to improve the performance?

If that message structure can be changed, I'd suggest the following:

repeated int64 docid = 2 [packed=true];
repeated float score = 3 [packed=true];

Use the low bit of docid to indicate whether there's a score (hence
the int32 - int64 change... if your docid's fit in 31 bits, no need
for that). So your encoding algo would be like

add_docid(docid  1 | !!has_score)
if (has_score) add_score(score)

And your reading algo would be the inverse. Unfortunately this means
there's no easy way to seek to a particular doc id and get its score,
so some post-processing on the receiver end will be necessary.


 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to protobuf+unsubscr...@googlegroups.com.
 To post to this group, send email to protobuf@googlegroups.com.
 Visit this group at http://groups.google.com/group/protobuf.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] C++ ParseFromCodedStream takes too long

2014-03-20 Thread christian . kilpatrick . 1991
The message will be changed. Score isn't optional. Every docid has a score.

I could try it as you suggested. Without the message.

But I'm a little bit confused.

I tried it with Java and with C++. Java can parse the Response in less than 
a second. C++ takes 20-25 seconds. Both times it is the same Message 
~2.600.000 bytes large and with ~2.100.000 repeates. How can this be? 
Shouldn't C++ be faster than Java?

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] C++ ParseFromCodedStream takes too long

2014-03-20 Thread Ilia Mirkin
On Thu, Mar 20, 2014 at 8:40 AM,  christian.kilpatrick.1...@gmail.com wrote:
 The message will be changed. Score isn't optional. Every docid has a score.

 I could try it as you suggested. Without the message.

 But I'm a little bit confused.

 I tried it with Java and with C++. Java can parse the Response in less than
 a second. C++ takes 20-25 seconds. Both times it is the same Message
 ~2.600.000 bytes large and with ~2.100.000 repeates. How can this be?

Your sizes must be wrong. If score isn't optional, your current
message should be

(tag 2 length varint (for submessage) tag 1 varint tag 2
float) x2.1M

That means that each entry will take at least 9 bytes, more if the
docid  127. I don't see how this can only take up 2.6MB...

 Shouldn't C++ be faster than Java?

Yes, it should be :) I'll let someone else answer here... You could
profile the C++ code and see where the time is going...

  -ilia

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] C++ ParseFromCodedStream takes too long

2014-03-20 Thread christian . kilpatrick . 1991
Oh yes you are right. There was a mistake in my calculation.

Well I'll try without the submessage. The count of docid and of score is 
always the same. So there is no Problem for me.

But it's sad that java takes around 2 seconds to send the Message to the 
server, recieve the Response and display the response.
And C++ takes 20-25 seconds only for this function call 
srchResp.ParseFromCodedStream(codedIn); :(
I don't think that changing the message will get me around 20 seconds :( 
Too bad.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.


Re: [protobuf] C++ ParseFromCodedStream takes too long

2014-03-20 Thread Henner Zeller
On 20 March 2014 06:10,  christian.kilpatrick.1...@gmail.com wrote:
 Oh yes you are right. There was a mistake in my calculation.

 Well I'll try without the submessage. The count of docid and of score is
 always the same. So there is no Problem for me.

 But it's sad that java takes around 2 seconds to send the Message to the
 server, recieve the Response and display the response.
 And C++ takes 20-25 seconds only for this function call
 srchResp.ParseFromCodedStream(codedIn); :(
 I don't think that changing the message will get me around 20 seconds :( Too
 bad.

If you man many things that need to be read into the repeated field,
maybe your memory allocator is running into trouble when resizing the
message all the time ? Since the repeated field does not know
beforehand how many objects are coming, it is forced to re-allocate
while parsing.

Just re-use the message object, as it can make use of the memory
already allocated. Create one C++ object, then use that to parse from
stream. When you're done with it,
re-use the same object for subsequent parse.
I think it is possible to somehow tell the repeated field beforehand
to reserve some capacity (similar to a vectorT), but don't know that
right now.

-h


 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to protobuf+unsubscr...@googlegroups.com.
 To post to this group, send email to protobuf@googlegroups.com.
 Visit this group at http://groups.google.com/group/protobuf.
 For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to protobuf+unsubscr...@googlegroups.com.
To post to this group, send email to protobuf@googlegroups.com.
Visit this group at http://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.