[protobuf] Re: suggestions on improving the performance?
anymore suggestions? On Jan 16, 11:14 am, alok alok.jad...@gmail.com wrote: google groups linkhttp://groups.google.com/group/protobuf/browse_thread/thread/64a07911... I tested the code with reusing the coded input object. Not much change in the speed performance. void ReadAllMessages(ZeroCopyInputStream *raw_input, stdext::hash_setstd::string instruments) { int item_count = 0; CodedInputStream* in = new CodedInputStream(raw_input); in-SetTotalBytesLimit(1e9, 9e8); while(1) { if(item_count % 20 == 0){ delete in; in = new CodedInputStream(raw_input); in-SetTotalBytesLimit(1e9, 9e8); } if(!ReadNextRecord(in, instruments)) break; item_count++; } cout Finished reading file. Total item_count items read.endl; } I reuse coded input object for every 200k objects. there are total of around 650k objects in the file. I get a feeling, whether this slowness is because of my binary file format. is there anything i can change so that i can read it faster. like eg, removing optional fields and keeping the format as raw as possible etc. regards, Alok On Jan 16, 10:40 am, alok alok.jad...@gmail.com wrote: here is the link to a forum which states why i have to set the limit. http://markmail.org/message/km7mlmj46jgfs3rx#query:+page:1+mid:5f7q3w... excerpt from the link The problem is that CodedInputStream has internal counter of how many bytes are read so far with the same object. In my case, there are a lot of small messages saved in the same file. I do not read them at once and therefore do not care about large messages, limits. I am safe. So, the problem can be easily solved by calling: CodedInputStream input_stream(...); input_stream.SetTotalBytesLimit(1e9, 9e8); My use-case is really about storing extremely large number (up to 1e9) of small messages ~ 10K each. My problem is same as above, so i will have to set the limits on coded input object. Regards, Alok On Jan 16, 10:26 am, alok alok.jad...@gmail.com wrote: I was actually doing that initially, but I kept getting error on Maximum length for a message is reached ( I dont have exact error string at the moment). This was because my input binary file is large and it reaches the limit for coded input very fast. I saw a post on the forum (or maybe on Stack Exchange) which suggested that i should create a new coded_input object for each message. I have to reset the limits for coded input object. user on that thread suggested that its easy to create and destroy coded_input object. These objects are not big. Anyways, I will try it again by resetting the limits on this object. But then, would this be casuing the slowness? I will try and let you know the results. Regards, Alok On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote: You're making a new CodedInputStream for each message -- I think that gives very poor buffering behavior. You should just pass coded_input to ReadAllMessages and keep reusing it. Cheers Daniel On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote: Daniel, i am hoping that my code is incorrect but i am not sure what is wrong or what is really causing this slowness. @ Henner Zeller, sorry i forgot to include the object length in above example. I do store object length for each object. I dont have issues in reading all the objects. Code is working fine. I just want to make sure to be able to make the code run faster now. attaching my code here... File format is File header Record1, Record2, Record3 Each record contains n objects of type defined in proto file. 1st object has header which contains the number of objects in each record. code proto file message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } message AlphaMessage {
[protobuf] Re: suggestions on improving the performance?
Daniel, i am hoping that my code is incorrect but i am not sure what is wrong or what is really causing this slowness. @ Henner Zeller, sorry i forgot to include the object length in above example. I do store object length for each object. I dont have issues in reading all the objects. Code is working fine. I just want to make sure to be able to make the code run faster now. attaching my code here... File format is File header Record1, Record2, Record3 Each record contains n objects of type defined in proto file. 1st object has header which contains the number of objects in each record. code proto file message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } message AlphaMessage { required int32 level = 1; required double alpha = 2; optional double stddev = 3; optional HeaderMessage header = 4; } /code code Reading records from binary file bool ReadNextRecord(CodedInputStream *coded_input, stdext::hash_setstd::string instruments) { uint32 count, objtype, objlen; int i; int objectsread = 0; HeaderMessage *hMsg = NULL; TradeMessage tMsg; QuoteMessage qMsg; CustomMessage cMsg; AlphaMessage aMsg; while(1) { if(!coded_input-ReadLittleEndian32(objtype)) { return false; } if(!coded_input-ReadLittleEndian32(objlen)) { return false; } CodedInputStream::Limit lim = coded_input-PushLimit(objlen); switch(objtype) { case 2: qMsg.ParseFromCodedStream(coded_input); if(qMsg.has_header()) { //hMsg = hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(qMsg.mutable_header()); } objectsread++; break; case 3: tMsg.ParseFromCodedStream(coded_input); if(tMsg.has_header()) { //hMsg = tMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(tMsg.mutable_header()); } objectsread++; break; case 4: aMsg.ParseFromCodedStream(coded_input); if(aMsg.has_header()) { //hMsg = aMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(aMsg.mutable_header()); } objectsread++; break; case 5: cMsg.ParseFromCodedStream(coded_input); if(cMsg.has_header()) { //hMsg = cMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(cMsg.mutable_header()); } objectsread++; break; default: cout Invalid object type objtype endl; return false; break; } coded_input-PopLimit(lim); if(objectsread == hMsg-count()) break; } return true; } void ReadAllMessages(ZeroCopyInputStream *raw_input, stdext::hash_setstd::string instruments) { int item_count = 0; while(1) { CodedInputStream in(raw_input); if(!ReadNextRecord(in, instruments)) break; item_count++; } cout Finished reading file. Total item_count items read.endl; } int
Re: [protobuf] Re: suggestions on improving the performance?
You're making a new CodedInputStream for each message -- I think that gives very poor buffering behavior. You should just pass coded_input to ReadAllMessages and keep reusing it. Cheers Daniel On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote: Daniel, i am hoping that my code is incorrect but i am not sure what is wrong or what is really causing this slowness. @ Henner Zeller, sorry i forgot to include the object length in above example. I do store object length for each object. I dont have issues in reading all the objects. Code is working fine. I just want to make sure to be able to make the code run faster now. attaching my code here... File format is File header Record1, Record2, Record3 Each record contains n objects of type defined in proto file. 1st object has header which contains the number of objects in each record. code proto file message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } message AlphaMessage { required int32 level = 1; required double alpha = 2; optional double stddev = 3; optional HeaderMessage header = 4; } /code code Reading records from binary file bool ReadNextRecord(CodedInputStream *coded_input, stdext::hash_setstd::string instruments) { uint32 count, objtype, objlen; int i; int objectsread = 0; HeaderMessage *hMsg = NULL; TradeMessage tMsg; QuoteMessage qMsg; CustomMessage cMsg; AlphaMessage aMsg; while(1) { if(!coded_input-ReadLittleEndian32(objtype)) { return false; } if(!coded_input-ReadLittleEndian32(objlen)) { return false; } CodedInputStream::Limit lim = coded_input-PushLimit(objlen); switch(objtype) { case 2: qMsg.ParseFromCodedStream(coded_input); if(qMsg.has_header()) { //hMsg = hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(qMsg.mutable_header()); } objectsread++; break; case 3: tMsg.ParseFromCodedStream(coded_input); if(tMsg.has_header()) { //hMsg = tMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(tMsg.mutable_header()); } objectsread++; break; case 4: aMsg.ParseFromCodedStream(coded_input); if(aMsg.has_header()) { //hMsg = aMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(aMsg.mutable_header()); } objectsread++; break; case 5: cMsg.ParseFromCodedStream(coded_input); if(cMsg.has_header()) { //hMsg = cMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(cMsg.mutable_header()); } objectsread++; break; default: cout Invalid object type objtype endl; return false; break; } coded_input-PopLimit(lim); if(objectsread == hMsg-count()) break; } return true; } void ReadAllMessages(ZeroCopyInputStream *raw_input, stdext::hash_setstd::string instruments) {
[protobuf] Re: suggestions on improving the performance?
I was actually doing that initially, but I kept getting error on Maximum length for a message is reached ( I dont have exact error string at the moment). This was because my input binary file is large and it reaches the limit for coded input very fast. I saw a post on the forum (or maybe on Stack Exchange) which suggested that i should create a new coded_input object for each message. I have to reset the limits for coded input object. user on that thread suggested that its easy to create and destroy coded_input object. These objects are not big. Anyways, I will try it again by resetting the limits on this object. But then, would this be casuing the slowness? I will try and let you know the results. Regards, Alok On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote: You're making a new CodedInputStream for each message -- I think that gives very poor buffering behavior. You should just pass coded_input to ReadAllMessages and keep reusing it. Cheers Daniel On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote: Daniel, i am hoping that my code is incorrect but i am not sure what is wrong or what is really causing this slowness. @ Henner Zeller, sorry i forgot to include the object length in above example. I do store object length for each object. I dont have issues in reading all the objects. Code is working fine. I just want to make sure to be able to make the code run faster now. attaching my code here... File format is File header Record1, Record2, Record3 Each record contains n objects of type defined in proto file. 1st object has header which contains the number of objects in each record. code proto file message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } message AlphaMessage { required int32 level = 1; required double alpha = 2; optional double stddev = 3; optional HeaderMessage header = 4; } /code code Reading records from binary file bool ReadNextRecord(CodedInputStream *coded_input, stdext::hash_setstd::string instruments) { uint32 count, objtype, objlen; int i; int objectsread = 0; HeaderMessage *hMsg = NULL; TradeMessage tMsg; QuoteMessage qMsg; CustomMessage cMsg; AlphaMessage aMsg; while(1) { if(!coded_input-ReadLittleEndian32(objtype)) { return false; } if(!coded_input-ReadLittleEndian32(objlen)) { return false; } CodedInputStream::Limit lim = coded_input-PushLimit(objlen); switch(objtype) { case 2: qMsg.ParseFromCodedStream(coded_input); if(qMsg.has_header()) { //hMsg = hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(qMsg.mutable_header()); } objectsread++; break; case 3: tMsg.ParseFromCodedStream(coded_input); if(tMsg.has_header()) { //hMsg = tMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(tMsg.mutable_header()); } objectsread++; break; case 4: aMsg.ParseFromCodedStream(coded_input); if(aMsg.has_header()) { //hMsg = aMsg.mutable_header(); hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(aMsg.mutable_header()); } objectsread++; break;
[protobuf] Re: suggestions on improving the performance?
here is the link to a forum which states why i have to set the limit. http://markmail.org/message/km7mlmj46jgfs3rx#query:+page:1+mid:5f7q3wj2htwajjof+state:results excerpt from the link The problem is that CodedInputStream has internal counter of how many bytes are read so far with the same object. In my case, there are a lot of small messages saved in the same file. I do not read them at once and therefore do not care about large messages, limits. I am safe. So, the problem can be easily solved by calling: CodedInputStream input_stream(...); input_stream.SetTotalBytesLimit(1e9, 9e8); My use-case is really about storing extremely large number (up to 1e9) of small messages ~ 10K each. My problem is same as above, so i will have to set the limits on coded input object. Regards, Alok On Jan 16, 10:26 am, alok alok.jad...@gmail.com wrote: I was actually doing that initially, but I kept getting error on Maximum length for a message is reached ( I dont have exact error string at the moment). This was because my input binary file is large and it reaches the limit for coded input very fast. I saw a post on the forum (or maybe on Stack Exchange) which suggested that i should create a new coded_input object for each message. I have to reset the limits for coded input object. user on that thread suggested that its easy to create and destroy coded_input object. These objects are not big. Anyways, I will try it again by resetting the limits on this object. But then, would this be casuing the slowness? I will try and let you know the results. Regards, Alok On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote: You're making a new CodedInputStream for each message -- I think that gives very poor buffering behavior. You should just pass coded_input to ReadAllMessages and keep reusing it. Cheers Daniel On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote: Daniel, i am hoping that my code is incorrect but i am not sure what is wrong or what is really causing this slowness. @ Henner Zeller, sorry i forgot to include the object length in above example. I do store object length for each object. I dont have issues in reading all the objects. Code is working fine. I just want to make sure to be able to make the code run faster now. attaching my code here... File format is File header Record1, Record2, Record3 Each record contains n objects of type defined in proto file. 1st object has header which contains the number of objects in each record. code proto file message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } message AlphaMessage { required int32 level = 1; required double alpha = 2; optional double stddev = 3; optional HeaderMessage header = 4; } /code code Reading records from binary file bool ReadNextRecord(CodedInputStream *coded_input, stdext::hash_setstd::string instruments) { uint32 count, objtype, objlen; int i; int objectsread = 0; HeaderMessage *hMsg = NULL; TradeMessage tMsg; QuoteMessage qMsg; CustomMessage cMsg; AlphaMessage aMsg; while(1) { if(!coded_input-ReadLittleEndian32(objtype)) { return false; } if(!coded_input-ReadLittleEndian32(objlen)) { return false; } CodedInputStream::Limit lim = coded_input-PushLimit(objlen); switch(objtype) { case 2: qMsg.ParseFromCodedStream(coded_input); if(qMsg.has_header()) { //hMsg = hMsg = new HeaderMessage(); hMsg-Clear(); hMsg-Swap(qMsg.mutable_header()); } objectsread++; break; case 3:
[protobuf] Re: suggestions on improving the performance?
google groups link http://groups.google.com/group/protobuf/browse_thread/thread/64a07911e3c90cd5 I tested the code with reusing the coded input object. Not much change in the speed performance. void ReadAllMessages(ZeroCopyInputStream *raw_input, stdext::hash_setstd::string instruments) { int item_count = 0; CodedInputStream* in = new CodedInputStream(raw_input); in-SetTotalBytesLimit(1e9, 9e8); while(1) { if(item_count % 20 == 0){ delete in; in = new CodedInputStream(raw_input); in-SetTotalBytesLimit(1e9, 9e8); } if(!ReadNextRecord(in, instruments)) break; item_count++; } cout Finished reading file. Total item_count items read.endl; } I reuse coded input object for every 200k objects. there are total of around 650k objects in the file. I get a feeling, whether this slowness is because of my binary file format. is there anything i can change so that i can read it faster. like eg, removing optional fields and keeping the format as raw as possible etc. regards, Alok On Jan 16, 10:40 am, alok alok.jad...@gmail.com wrote: here is the link to a forum which states why i have to set the limit. http://markmail.org/message/km7mlmj46jgfs3rx#query:+page:1+mid:5f7q3w... excerpt from the link The problem is that CodedInputStream has internal counter of how many bytes are read so far with the same object. In my case, there are a lot of small messages saved in the same file. I do not read them at once and therefore do not care about large messages, limits. I am safe. So, the problem can be easily solved by calling: CodedInputStream input_stream(...); input_stream.SetTotalBytesLimit(1e9, 9e8); My use-case is really about storing extremely large number (up to 1e9) of small messages ~ 10K each. My problem is same as above, so i will have to set the limits on coded input object. Regards, Alok On Jan 16, 10:26 am, alok alok.jad...@gmail.com wrote: I was actually doing that initially, but I kept getting error on Maximum length for a message is reached ( I dont have exact error string at the moment). This was because my input binary file is large and it reaches the limit for coded input very fast. I saw a post on the forum (or maybe on Stack Exchange) which suggested that i should create a new coded_input object for each message. I have to reset the limits for coded input object. user on that thread suggested that its easy to create and destroy coded_input object. These objects are not big. Anyways, I will try it again by resetting the limits on this object. But then, would this be casuing the slowness? I will try and let you know the results. Regards, Alok On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote: You're making a new CodedInputStream for each message -- I think that gives very poor buffering behavior. You should just pass coded_input to ReadAllMessages and keep reusing it. Cheers Daniel On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote: Daniel, i am hoping that my code is incorrect but i am not sure what is wrong or what is really causing this slowness. @ Henner Zeller, sorry i forgot to include the object length in above example. I do store object length for each object. I dont have issues in reading all the objects. Code is working fine. I just want to make sure to be able to make the code run faster now. attaching my code here... File format is File header Record1, Record2, Record3 Each record contains n objects of type defined in proto file. 1st object has header which contains the number of objects in each record. code proto file message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } message AlphaMessage { required int32 level = 1; required double alpha = 2; optional double stddev = 3; optional HeaderMessage header = 4; } /code code
[protobuf] Re: suggestions on improving the performance?
any suggestions? experiences? regards, Alok On Jan 11, 1:16 pm, alok alok.jad...@gmail.com wrote: my point is ..should i have one message something like Message Record{ required HeaderMessage header; optional TradeMessage trade; repeated QuoteMessage quotes; // 0 or more repeated CustomMessage customs; // 0 or more } or rather should i keep my file plain as object type, object, objecttype, object without worrying about the concept of a record. Each message in file is usually header + any 1 type of message (trade, quote or custom) .. and mostly only 1 quote or custom message not more. what would be faster to decode? Regards, Alok On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote: Hi everyone, My program is taking more time to read binary files than the text files. I think the issue is with the structure of the binary files that i have designed. (Or could it be possible that binary decoding is slower than text files parsing? ). Data file is a large text file with 1 record per row. upto 1.2 GB. Binary file is around 900 MB. ** - Text file reading takes 3 minutes to read the file. - Binary file reading takes 5 minutes. I saw a very strange behavior. - Just to see how long it takes to skim through binary file, i started reading header on each message which holds the length of the message and then skipped that many bytes using the Skip() function of coded_input object. After making this change, i was expecting that reading through file should take less time, but it took more than 10 minutes. Is skipping not same as adding n bytes to the file pointer? is it slower to skip the object than read it? Are their any guidelines on how the structure should be designed to get the best performance? My current structure looks as below message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } Binary file format is object type, object, object type object ... 1st object of a record holds header with n number of objects in that record. next n-1 objects will not hold header since they all belong to same record (same update time). now n+1th object belongs to the new record and it will hold header for next record. Any advices? Regards, Alok -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] Re: suggestions on improving the performance?
It's extremely unlikely that text parsing is faster than binary parsing on pretty much any message. My guess is that there's something wrong in the way you're reading the binary file -- e.g. no buffering, or possibly a bug where you hand the protobuf library multiple messages concatenated together. It'd be easier to comment if you post the code. Cheers Daniel On Fri, Jan 13, 2012 at 1:22 AM, alok alok.jad...@gmail.com wrote: any suggestions? experiences? regards, Alok On Jan 11, 1:16 pm, alok alok.jad...@gmail.com wrote: my point is ..should i have one message something like Message Record{ required HeaderMessage header; optional TradeMessage trade; repeated QuoteMessage quotes; // 0 or more repeated CustomMessage customs; // 0 or more } or rather should i keep my file plain as object type, object, objecttype, object without worrying about the concept of a record. Each message in file is usually header + any 1 type of message (trade, quote or custom) .. and mostly only 1 quote or custom message not more. what would be faster to decode? Regards, Alok On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote: Hi everyone, My program is taking more time to read binary files than the text files. I think the issue is with the structure of the binary files that i have designed. (Or could it be possible that binary decoding is slower than text files parsing? ). Data file is a large text file with 1 record per row. upto 1.2 GB. Binary file is around 900 MB. ** - Text file reading takes 3 minutes to read the file. - Binary file reading takes 5 minutes. I saw a very strange behavior. - Just to see how long it takes to skim through binary file, i started reading header on each message which holds the length of the message and then skipped that many bytes using the Skip() function of coded_input object. After making this change, i was expecting that reading through file should take less time, but it took more than 10 minutes. Is skipping not same as adding n bytes to the file pointer? is it slower to skip the object than read it? Are their any guidelines on how the structure should be designed to get the best performance? My current structure looks as below message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } Binary file format is object type, object, object type object ... 1st object of a record holds header with n number of objects in that record. next n-1 objects will not hold header since they all belong to same record (same update time). now n+1th object belongs to the new record and it will hold header for next record. Any advices? Regards, Alok -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] Re: suggestions on improving the performance?
On Fri, Jan 13, 2012 at 11:22, Daniel Wright dwri...@google.com wrote: It's extremely unlikely that text parsing is faster than binary parsing on pretty much any message. My guess is that there's something wrong in the way you're reading the binary file -- e.g. no buffering, or possibly a bug where you hand the protobuf library multiple messages concatenated together. In particular, the object type, object, object type object .. doesn't seem to include headers that describe the length of the following message, but such a separator is needed. ( http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming ) It'd be easier to comment if you post the code. Cheers Daniel On Fri, Jan 13, 2012 at 1:22 AM, alok alok.jad...@gmail.com wrote: any suggestions? experiences? regards, Alok On Jan 11, 1:16 pm, alok alok.jad...@gmail.com wrote: my point is ..should i have one message something like Message Record{ required HeaderMessage header; optional TradeMessage trade; repeated QuoteMessage quotes; // 0 or more repeated CustomMessage customs; // 0 or more } or rather should i keep my file plain as object type, object, objecttype, object without worrying about the concept of a record. Each message in file is usually header + any 1 type of message (trade, quote or custom) .. and mostly only 1 quote or custom message not more. what would be faster to decode? Regards, Alok On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote: Hi everyone, My program is taking more time to read binary files than the text files. I think the issue is with the structure of the binary files that i have designed. (Or could it be possible that binary decoding is slower than text files parsing? ). Data file is a large text file with 1 record per row. upto 1.2 GB. Binary file is around 900 MB. ** - Text file reading takes 3 minutes to read the file. - Binary file reading takes 5 minutes. I saw a very strange behavior. - Just to see how long it takes to skim through binary file, i started reading header on each message which holds the length of the message and then skipped that many bytes using the Skip() function of coded_input object. After making this change, i was expecting that reading through file should take less time, but it took more than 10 minutes. Is skipping not same as adding n bytes to the file pointer? is it slower to skip the object than read it? Are their any guidelines on how the structure should be designed to get the best performance? My current structure looks as below message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } Binary file format is object type, object, object type object ... 1st object of a record holds header with n number of objects in that record. next n-1 objects will not hold header since they all belong to same record (same update time). now n+1th object belongs to the new record and it will hold header for next record. Any advices? Regards, Alok -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en. -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] Re: suggestions on improving the performance?
my point is ..should i have one message something like Message Record{ required HeaderMessage header; optional TradeMessage trade; repeated QuoteMessage quotes; // 0 or more repeated CustomMessage customs; // 0 or more } or rather should i keep my file plain as object type, object, objecttype, object without worrying about the concept of a record. Each message in file is usually header + any 1 type of message (trade, quote or custom) .. and mostly only 1 quote or custom message not more. what would be faster to decode? Regards, Alok On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote: Hi everyone, My program is taking more time to read binary files than the text files. I think the issue is with the structure of the binary files that i have designed. (Or could it be possible that binary decoding is slower than text files parsing? ). Data file is a large text file with 1 record per row. upto 1.2 GB. Binary file is around 900 MB. ** - Text file reading takes 3 minutes to read the file. - Binary file reading takes 5 minutes. I saw a very strange behavior. - Just to see how long it takes to skim through binary file, i started reading header on each message which holds the length of the message and then skipped that many bytes using the Skip() function of coded_input object. After making this change, i was expecting that reading through file should take less time, but it took more than 10 minutes. Is skipping not same as adding n bytes to the file pointer? is it slower to skip the object than read it? Are their any guidelines on how the structure should be designed to get the best performance? My current structure looks as below message HeaderMessage { required double timestamp = 1; required string ric_code = 2; required int32 count = 3; required int32 total_message_size = 4; } message QuoteMessage { enum Side { ASK = 0; BID = 1; } required Side type = 1; required int32 level = 2; optional double price = 3; optional int64 size = 4; optional int32 count = 5; optional HeaderMessage header = 6; } message CustomMessage { required string field_name = 1; required double value = 2; optional HeaderMessage header = 3; } message TradeMessage { optional double price = 1; optional int64 size = 2; optional int64 AccumulatedVolume = 3; optional HeaderMessage header = 4; } Binary file format is object type, object, object type object ... 1st object of a record holds header with n number of objects in that record. next n-1 objects will not hold header since they all belong to same record (same update time). now n+1th object belongs to the new record and it will hold header for next record. Any advices? Regards, Alok -- You received this message because you are subscribed to the Google Groups Protocol Buffers group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.