[protobuf] Re: how to use GzipInputStream with multiple messages?

2012-02-05 Thread alok
Could someone give me a usecase / example of using GzipInputStream
(outputstream) for reading a binary file with multiple object types?

Regards,
Alok

On Feb 2, 5:46 pm, alok alok.jad...@gmail.com wrote:
 also, what is the standard way to write and to read from a
 gzipstream.
 I am doing something like this

 to write to stream:
 headerMessage.SerializeToZeroCopyStream(gzip_output);

 to read from stream:
 headerMessage.ParseFromZeroCopyStream(gzip_input,
 headerMessage.ByteSize());

 is the above approach correct? I haven't used gzipstreams with
 protocol buffers before. i am not able to make it work.

 Regards,
 Alok

 On Feb 2, 5:18 pm, alok alok.jad...@gmail.com wrote:







  How do we implement GzipInputStream to read a file with different
  messages. I could achieve this using coded input stream by appending
  the size of the object before the object itself. I am not sure how to
  get same result using GzipInputStream. Could someone please guide me
  here?

  Regards,
  Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: suggestions on improving the performance?

2012-01-16 Thread alok
anymore suggestions?

On Jan 16, 11:14 am, alok alok.jad...@gmail.com wrote:
 google groups 
 linkhttp://groups.google.com/group/protobuf/browse_thread/thread/64a07911...

 I tested the code with reusing the coded input object. Not much change
 in the speed performance.

 void ReadAllMessages(ZeroCopyInputStream *raw_input,
 stdext::hash_setstd::string instruments)
 {
         int item_count = 0;

         CodedInputStream* in = new  CodedInputStream(raw_input);
         in-SetTotalBytesLimit(1e9, 9e8);
         while(1)
         {
                 if(item_count % 20 == 0){
                         delete in;
                         in = new  CodedInputStream(raw_input);
                         in-SetTotalBytesLimit(1e9, 9e8);
                 }
                 if(!ReadNextRecord(in, instruments))
                         break;
                 item_count++;
         }
         cout  Finished reading file. Total item_count items
 read.endl;

 }

 I reuse coded input object for every 200k objects. there are total of
 around 650k objects in the file.

 I get a feeling, whether this slowness is because of my binary file
 format. is there anything i can change so that i can read it faster.
 like eg, removing optional fields and keeping the format as raw as
 possible etc.

 regards,
 Alok

 On Jan 16, 10:40 am, alok alok.jad...@gmail.com wrote:







  here is the link to a forum which states why i have to set the limit.

 http://markmail.org/message/km7mlmj46jgfs3rx#query:+page:1+mid:5f7q3w...

  excerpt from the link

  The problem is that CodedInputStream has internal counter of how many
  bytes are read so far with the same object.

  In my case, there are a lot of small messages saved in the same file.
  I do not read them at once and therefore do not care about large
  messages, limits. I am safe.

  So, the problem can be easily solved by calling:

  CodedInputStream input_stream(...);
  input_stream.SetTotalBytesLimit(1e9, 9e8);

  My use-case is really about storing extremely large number (up to 1e9)
  of small messages ~ 10K each. 

  My problem is same as above, so i will have to set the limits on coded
  input object.

  Regards,
  Alok

  On Jan 16, 10:26 am, alok alok.jad...@gmail.com wrote:

   I was actually doing that initially, but I kept getting error on
   Maximum length for a message is reached ( I dont have exact error
   string at the moment). This was because my input binary file is large
   and it reaches the limit for coded input very fast.

   I saw a post on the forum (or maybe on Stack Exchange) which suggested
   that i should create a new coded_input object for each message. I have
   to reset the limits for coded input object. user on that thread
   suggested that its easy to create and destroy coded_input object.
   These objects are not big.

   Anyways, I will try it again by resetting the limits on this object.
   But then, would this be casuing the slowness? I will try and let you
   know the results.

   Regards,
   Alok

   On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote:

You're making a new CodedInputStream for each message -- I think that 
gives
very poor buffering behavior.  You should just pass coded_input to
ReadAllMessages and keep reusing it.

Cheers
Daniel

On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote:
 Daniel,

 i am hoping that my code is incorrect but i am not sure what is wrong
 or what is really causing this slowness.

 @ Henner Zeller, sorry i forgot to include the object length in above
 example. I do store object length for each object. I dont have issues
 in reading all the objects. Code is working fine. I just want to make
 sure to be able to make the code run faster now.

 attaching my code here...

 File format is

 File header
 Record1, Record2, Record3

 Each record contains n objects of type defined in proto file. 1st
 object has header which contains the number of objects in each record.

 code
 proto file

 message HeaderMessage {
        required double timestamp = 1;
  required string ric_code = 2;
  required int32 count = 3;
  required int32 total_message_size = 4;
 }

 message QuoteMessage {
        enum Side {
    ASK = 0;
    BID = 1;
  }
  required Side type = 1;
        required int32 level = 2;
        optional double price = 3;
        optional int64 size = 4;
        optional int32 count = 5;
        optional HeaderMessage header = 6;
 }

 message CustomMessage {
        required string field_name = 1;
        required double value = 2;
        optional HeaderMessage header = 3;
 }

 message TradeMessage {
        optional double price = 1;
        optional int64 size = 2;
        optional int64 AccumulatedVolume = 3;
        optional HeaderMessage header = 4;
 }

 message AlphaMessage

[protobuf] Re: suggestions on improving the performance?

2012-01-15 Thread alok
(int argc, _TCHAR* argv[])
{
GOOGLE_PROTOBUF_VERIFY_VERSION;

ZeroCopyInputStream *raw_input;
CodedInputStream *coded_input;
stdext::hash_setstd::string instruments;

string filename = S:/users/aaj/sandbox/tickdata/bin/hk/
2011/2011.01.04.bin;
int fd = _open(filename.c_str(), _O_BINARY | O_RDONLY);

if( fd == -1 )
{
printf( Error opening the file. \n );
exit( 1 );
}

raw_input = new FileInputStream(fd);
coded_input = new CodedInputStream(raw_input);

uint32 magic_no;

coded_input-ReadLittleEndian32(magic_no);

cout  HEADER:   \t  magic_noendl;
cout  Reading data objects..  endl;
delete coded_input;
cout  td  '\n';

ReadAllMessages(raw_input, instruments);

cout  td  '\n';

delete raw_input;
_close(fd);
google::protobuf::ShutdownProtobufLibrary();

return 0;
}

/code


On Jan 14, 3:37 am, Henner Zeller henner.zel...@googlemail.com
wrote:
 On Fri, Jan 13, 2012 at 11:22, Daniel Wright dwri...@google.com wrote:
  It's extremely unlikely that text parsing is faster than binary parsing on
  pretty much any message.  My guess is that there's something wrong in the
  way you're reading the binary file -- e.g. no buffering, or possibly a bug
  where you hand the protobuf library multiple messages concatenated together.

 In particular, the
    object type, object, object type object ..
 doesn't seem to include headers that describe the length of the
 following message, but such a separator is needed.
 (http://code.google.com/apis/protocolbuffers/docs/techniques.html#stre...)







   It'd be easier to comment if you post the code.

  Cheers
  Daniel

  On Fri, Jan 13, 2012 at 1:22 AM, alok alok.jad...@gmail.com wrote:

  any suggestions? experiences?

  regards,
  Alok

  On Jan 11, 1:16 pm, alok alok.jad...@gmail.com wrote:
   my point is ..should i have one message something like

   Message Record{
     required HeaderMessage header;
     optional TradeMessage trade;
     repeated QuoteMessage quotes; // 0 or more
     repeated CustomMessage customs; // 0 or more

   }

   or rather should i keep my file plain as
   object type, object, objecttype, object
   without worrying about the concept of a record.

   Each message in file is usually header + any 1 type of message (trade,
   quote or custom) ..  and mostly only 1 quote or custom message not
   more.

   what would be faster to decode?

   Regards,
   Alok

   On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote:

Hi everyone,

My program is taking more time to read binary files than the text
files. I think the issue is with the structure of the binary files
that i have designed. (Or could it be possible that binary decoding is
slower than text files parsing? ).

Data file is a large text file with 1 record per row. upto 1.2 GB.
Binary file is around 900 MB.

**
 - Text file reading takes 3 minutes to read the file.
 - Binary file reading takes 5 minutes.

I saw a very strange behavior.
 - Just to see how long it takes to skim through binary file, i
started reading header on each message which holds the length of the
message and then skipped that many bytes using the Skip() function of
coded_input object. After making this change, i was expecting that
reading through file should take less time, but it took more than 10
minutes. Is skipping not same as adding n bytes to the file pointer?
is it slower to skip the object than read it?

Are their any guidelines on how the structure should be designed to
get the best performance?

My current structure looks as below

message HeaderMessage {
  required double timestamp = 1;
  required string ric_code = 2;
  required int32 count = 3;
  required int32 total_message_size = 4;

}

message QuoteMessage {
        enum Side {
    ASK = 0;
    BID = 1;
  }
  required Side type = 1;
        required int32 level = 2;
        optional double price = 3;
        optional int64 size = 4;
        optional int32 count = 5;
        optional HeaderMessage header = 6;

}

message CustomMessage {
        required string field_name = 1;
        required double value = 2;
        optional HeaderMessage header = 3;

}

message TradeMessage {
        optional double price = 1;
        optional int64 size = 2;
        optional int64 AccumulatedVolume = 3;
        optional HeaderMessage header = 4;

}

Binary file format is
object type, object, object type object ...

1st object of a record holds header with n number of objects in that
record. next n-1 objects will not hold header since they all belong to
same record (same update time).
now n+1th object belongs to the new record and it will hold header

[protobuf] Re: suggestions on improving the performance?

2012-01-15 Thread alok
I was actually doing that initially, but I kept getting error on
Maximum length for a message is reached ( I dont have exact error
string at the moment). This was because my input binary file is large
and it reaches the limit for coded input very fast.

I saw a post on the forum (or maybe on Stack Exchange) which suggested
that i should create a new coded_input object for each message. I have
to reset the limits for coded input object. user on that thread
suggested that its easy to create and destroy coded_input object.
These objects are not big.

Anyways, I will try it again by resetting the limits on this object.
But then, would this be casuing the slowness? I will try and let you
know the results.

Regards,
Alok

On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote:
 You're making a new CodedInputStream for each message -- I think that gives
 very poor buffering behavior.  You should just pass coded_input to
 ReadAllMessages and keep reusing it.

 Cheers
 Daniel







 On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote:
  Daniel,

  i am hoping that my code is incorrect but i am not sure what is wrong
  or what is really causing this slowness.

  @ Henner Zeller, sorry i forgot to include the object length in above
  example. I do store object length for each object. I dont have issues
  in reading all the objects. Code is working fine. I just want to make
  sure to be able to make the code run faster now.

  attaching my code here...

  File format is

  File header
  Record1, Record2, Record3

  Each record contains n objects of type defined in proto file. 1st
  object has header which contains the number of objects in each record.

  code
  proto file

  message HeaderMessage {
         required double timestamp = 1;
   required string ric_code = 2;
   required int32 count = 3;
   required int32 total_message_size = 4;
  }

  message QuoteMessage {
         enum Side {
     ASK = 0;
     BID = 1;
   }
   required Side type = 1;
         required int32 level = 2;
         optional double price = 3;
         optional int64 size = 4;
         optional int32 count = 5;
         optional HeaderMessage header = 6;
  }

  message CustomMessage {
         required string field_name = 1;
         required double value = 2;
         optional HeaderMessage header = 3;
  }

  message TradeMessage {
         optional double price = 1;
         optional int64 size = 2;
         optional int64 AccumulatedVolume = 3;
         optional HeaderMessage header = 4;
  }

  message AlphaMessage {
         required int32 level = 1;
         required double alpha = 2;
         optional double stddev = 3;
          optional HeaderMessage header = 4;
  }

  /code

  code
  Reading records from binary file

  bool ReadNextRecord(CodedInputStream *coded_input,
  stdext::hash_setstd::string instruments)
  {
         uint32 count, objtype, objlen;
         int i;

         int objectsread = 0;
         HeaderMessage *hMsg = NULL;
         TradeMessage tMsg;
         QuoteMessage qMsg;
         CustomMessage cMsg;
         AlphaMessage aMsg;

         while(1)
         {
                 if(!coded_input-ReadLittleEndian32(objtype)) {
                         return false;
                 }
                 if(!coded_input-ReadLittleEndian32(objlen)) {
                         return false;
                 }
                 CodedInputStream::Limit lim =
  coded_input-PushLimit(objlen);

                 switch(objtype)
                 {
                 case 2:
                         qMsg.ParseFromCodedStream(coded_input);
                         if(qMsg.has_header())
                         {
                                 //hMsg =
                                 hMsg = new HeaderMessage();
                                 hMsg-Clear();
                                 hMsg-Swap(qMsg.mutable_header());
                         }
                         objectsread++;
                         break;

                 case 3:
                         tMsg.ParseFromCodedStream(coded_input);
                         if(tMsg.has_header())
                         {
                                 //hMsg = tMsg.mutable_header();
                                 hMsg = new HeaderMessage();
                                 hMsg-Clear();
                                 hMsg-Swap(tMsg.mutable_header());
                         }
                         objectsread++;
                         break;

                 case 4:
                         aMsg.ParseFromCodedStream(coded_input);
                         if(aMsg.has_header())
                         {
                                 //hMsg = aMsg.mutable_header();
                                 hMsg = new HeaderMessage();
                                 hMsg-Clear();
                                 hMsg-Swap(aMsg.mutable_header());
                         }
                         objectsread++;
                         break

[protobuf] Re: suggestions on improving the performance?

2012-01-15 Thread alok
here is the link to a forum which states why i have to set the limit.

http://markmail.org/message/km7mlmj46jgfs3rx#query:+page:1+mid:5f7q3wj2htwajjof+state:results

excerpt from the link

The problem is that CodedInputStream has internal counter of how many
bytes are read so far with the same object.

In my case, there are a lot of small messages saved in the same file.
I do not read them at once and therefore do not care about large
messages, limits. I am safe.

So, the problem can be easily solved by calling:

CodedInputStream input_stream(...);
input_stream.SetTotalBytesLimit(1e9, 9e8);

My use-case is really about storing extremely large number (up to 1e9)
of small messages ~ 10K each. 


My problem is same as above, so i will have to set the limits on coded
input object.

Regards,
Alok


On Jan 16, 10:26 am, alok alok.jad...@gmail.com wrote:
 I was actually doing that initially, but I kept getting error on
 Maximum length for a message is reached ( I dont have exact error
 string at the moment). This was because my input binary file is large
 and it reaches the limit for coded input very fast.

 I saw a post on the forum (or maybe on Stack Exchange) which suggested
 that i should create a new coded_input object for each message. I have
 to reset the limits for coded input object. user on that thread
 suggested that its easy to create and destroy coded_input object.
 These objects are not big.

 Anyways, I will try it again by resetting the limits on this object.
 But then, would this be casuing the slowness? I will try and let you
 know the results.

 Regards,
 Alok

 On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote:







  You're making a new CodedInputStream for each message -- I think that gives
  very poor buffering behavior.  You should just pass coded_input to
  ReadAllMessages and keep reusing it.

  Cheers
  Daniel

  On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote:
   Daniel,

   i am hoping that my code is incorrect but i am not sure what is wrong
   or what is really causing this slowness.

   @ Henner Zeller, sorry i forgot to include the object length in above
   example. I do store object length for each object. I dont have issues
   in reading all the objects. Code is working fine. I just want to make
   sure to be able to make the code run faster now.

   attaching my code here...

   File format is

   File header
   Record1, Record2, Record3

   Each record contains n objects of type defined in proto file. 1st
   object has header which contains the number of objects in each record.

   code
   proto file

   message HeaderMessage {
          required double timestamp = 1;
    required string ric_code = 2;
    required int32 count = 3;
    required int32 total_message_size = 4;
   }

   message QuoteMessage {
          enum Side {
      ASK = 0;
      BID = 1;
    }
    required Side type = 1;
          required int32 level = 2;
          optional double price = 3;
          optional int64 size = 4;
          optional int32 count = 5;
          optional HeaderMessage header = 6;
   }

   message CustomMessage {
          required string field_name = 1;
          required double value = 2;
          optional HeaderMessage header = 3;
   }

   message TradeMessage {
          optional double price = 1;
          optional int64 size = 2;
          optional int64 AccumulatedVolume = 3;
          optional HeaderMessage header = 4;
   }

   message AlphaMessage {
          required int32 level = 1;
          required double alpha = 2;
          optional double stddev = 3;
           optional HeaderMessage header = 4;
   }

   /code

   code
   Reading records from binary file

   bool ReadNextRecord(CodedInputStream *coded_input,
   stdext::hash_setstd::string instruments)
   {
          uint32 count, objtype, objlen;
          int i;

          int objectsread = 0;
          HeaderMessage *hMsg = NULL;
          TradeMessage tMsg;
          QuoteMessage qMsg;
          CustomMessage cMsg;
          AlphaMessage aMsg;

          while(1)
          {
                  if(!coded_input-ReadLittleEndian32(objtype)) {
                          return false;
                  }
                  if(!coded_input-ReadLittleEndian32(objlen)) {
                          return false;
                  }
                  CodedInputStream::Limit lim =
   coded_input-PushLimit(objlen);

                  switch(objtype)
                  {
                  case 2:
                          qMsg.ParseFromCodedStream(coded_input);
                          if(qMsg.has_header())
                          {
                                  //hMsg =
                                  hMsg = new HeaderMessage();
                                  hMsg-Clear();
                                  hMsg-Swap(qMsg.mutable_header());
                          }
                          objectsread++;
                          break;

                  case 3

[protobuf] Re: suggestions on improving the performance?

2012-01-15 Thread alok
google groups link
http://groups.google.com/group/protobuf/browse_thread/thread/64a07911e3c90cd5

I tested the code with reusing the coded input object. Not much change
in the speed performance.

void ReadAllMessages(ZeroCopyInputStream *raw_input,
stdext::hash_setstd::string instruments)
{
int item_count = 0;

CodedInputStream* in = new  CodedInputStream(raw_input);
in-SetTotalBytesLimit(1e9, 9e8);
while(1)
{
if(item_count % 20 == 0){
delete in;
in = new  CodedInputStream(raw_input);
in-SetTotalBytesLimit(1e9, 9e8);
}
if(!ReadNextRecord(in, instruments))
break;
item_count++;
}
cout  Finished reading file. Total item_count items
read.endl;
}

I reuse coded input object for every 200k objects. there are total of
around 650k objects in the file.

I get a feeling, whether this slowness is because of my binary file
format. is there anything i can change so that i can read it faster.
like eg, removing optional fields and keeping the format as raw as
possible etc.

regards,
Alok

On Jan 16, 10:40 am, alok alok.jad...@gmail.com wrote:
 here is the link to a forum which states why i have to set the limit.

 http://markmail.org/message/km7mlmj46jgfs3rx#query:+page:1+mid:5f7q3w...

 excerpt from the link

 The problem is that CodedInputStream has internal counter of how many
 bytes are read so far with the same object.

 In my case, there are a lot of small messages saved in the same file.
 I do not read them at once and therefore do not care about large
 messages, limits. I am safe.

 So, the problem can be easily solved by calling:

 CodedInputStream input_stream(...);
 input_stream.SetTotalBytesLimit(1e9, 9e8);

 My use-case is really about storing extremely large number (up to 1e9)
 of small messages ~ 10K each. 

 My problem is same as above, so i will have to set the limits on coded
 input object.

 Regards,
 Alok

 On Jan 16, 10:26 am, alok alok.jad...@gmail.com wrote:







  I was actually doing that initially, but I kept getting error on
  Maximum length for a message is reached ( I dont have exact error
  string at the moment). This was because my input binary file is large
  and it reaches the limit for coded input very fast.

  I saw a post on the forum (or maybe on Stack Exchange) which suggested
  that i should create a new coded_input object for each message. I have
  to reset the limits for coded input object. user on that thread
  suggested that its easy to create and destroy coded_input object.
  These objects are not big.

  Anyways, I will try it again by resetting the limits on this object.
  But then, would this be casuing the slowness? I will try and let you
  know the results.

  Regards,
  Alok

  On Jan 16, 9:46 am, Daniel Wright dwri...@google.com wrote:

   You're making a new CodedInputStream for each message -- I think that 
   gives
   very poor buffering behavior.  You should just pass coded_input to
   ReadAllMessages and keep reusing it.

   Cheers
   Daniel

   On Sun, Jan 15, 2012 at 4:41 PM, alok alok.jad...@gmail.com wrote:
Daniel,

i am hoping that my code is incorrect but i am not sure what is wrong
or what is really causing this slowness.

@ Henner Zeller, sorry i forgot to include the object length in above
example. I do store object length for each object. I dont have issues
in reading all the objects. Code is working fine. I just want to make
sure to be able to make the code run faster now.

attaching my code here...

File format is

File header
Record1, Record2, Record3

Each record contains n objects of type defined in proto file. 1st
object has header which contains the number of objects in each record.

code
proto file

message HeaderMessage {
       required double timestamp = 1;
 required string ric_code = 2;
 required int32 count = 3;
 required int32 total_message_size = 4;
}

message QuoteMessage {
       enum Side {
   ASK = 0;
   BID = 1;
 }
 required Side type = 1;
       required int32 level = 2;
       optional double price = 3;
       optional int64 size = 4;
       optional int32 count = 5;
       optional HeaderMessage header = 6;
}

message CustomMessage {
       required string field_name = 1;
       required double value = 2;
       optional HeaderMessage header = 3;
}

message TradeMessage {
       optional double price = 1;
       optional int64 size = 2;
       optional int64 AccumulatedVolume = 3;
       optional HeaderMessage header = 4;
}

message AlphaMessage {
       required int32 level = 1;
       required double alpha = 2;
       optional double stddev = 3;
        optional HeaderMessage header = 4;
}

/code

code

[protobuf] Re: suggestions on improving the performance?

2012-01-13 Thread alok
any suggestions? experiences?

regards,
Alok

On Jan 11, 1:16 pm, alok alok.jad...@gmail.com wrote:
 my point is ..should i have one message something like

 Message Record{
   required HeaderMessage header;
   optional TradeMessage trade;
   repeated QuoteMessage quotes; // 0 or more
   repeated CustomMessage customs; // 0 or more

 }

 or rather should i keep my file plain as
 object type, object, objecttype, object
 without worrying about the concept of a record.

 Each message in file is usually header + any 1 type of message (trade,
 quote or custom) ..  and mostly only 1 quote or custom message not
 more.

 what would be faster to decode?

 Regards,
 Alok

 On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote:







  Hi everyone,

  My program is taking more time to read binary files than the text
  files. I think the issue is with the structure of the binary files
  that i have designed. (Or could it be possible that binary decoding is
  slower than text files parsing? ).

  Data file is a large text file with 1 record per row. upto 1.2 GB.
  Binary file is around 900 MB.

  **
   - Text file reading takes 3 minutes to read the file.
   - Binary file reading takes 5 minutes.

  I saw a very strange behavior.
   - Just to see how long it takes to skim through binary file, i
  started reading header on each message which holds the length of the
  message and then skipped that many bytes using the Skip() function of
  coded_input object. After making this change, i was expecting that
  reading through file should take less time, but it took more than 10
  minutes. Is skipping not same as adding n bytes to the file pointer?
  is it slower to skip the object than read it?

  Are their any guidelines on how the structure should be designed to
  get the best performance?

  My current structure looks as below

  message HeaderMessage {
    required double timestamp = 1;
    required string ric_code = 2;
    required int32 count = 3;
    required int32 total_message_size = 4;

  }

  message QuoteMessage {
          enum Side {
      ASK = 0;
      BID = 1;
    }
    required Side type = 1;
          required int32 level = 2;
          optional double price = 3;
          optional int64 size = 4;
          optional int32 count = 5;
          optional HeaderMessage header = 6;

  }

  message CustomMessage {
          required string field_name = 1;
          required double value = 2;
          optional HeaderMessage header = 3;

  }

  message TradeMessage {
          optional double price = 1;
          optional int64 size = 2;
          optional int64 AccumulatedVolume = 3;
          optional HeaderMessage header = 4;

  }

  Binary file format is
  object type, object, object type object ...

  1st object of a record holds header with n number of objects in that
  record. next n-1 objects will not hold header since they all belong to
  same record (same update time).
  now n+1th object belongs to the new record and it will hold header for
  next record.

  Any advices?

  Regards,
  Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] suggestions on improving the performance?

2012-01-10 Thread alok
Hi everyone,

My program is taking more time to read binary files than the text
files. I think the issue is with the structure of the binary files
that i have designed. (Or could it be possible that binary decoding is
slower than text files parsing? ).

Data file is a large text file with 1 record per row. upto 1.2 GB.
Binary file is around 900 MB.

**
 - Text file reading takes 3 minutes to read the file.
 - Binary file reading takes 5 minutes.

I saw a very strange behavior.
 - Just to see how long it takes to skim through binary file, i
started reading header on each message which holds the length of the
message and then skipped that many bytes using the Skip() function of
coded_input object. After making this change, i was expecting that
reading through file should take less time, but it took more than 10
minutes. Is skipping not same as adding n bytes to the file pointer?
is it slower to skip the object than read it?

Are their any guidelines on how the structure should be designed to
get the best performance?

My current structure looks as below

message HeaderMessage {
  required double timestamp = 1;
  required string ric_code = 2;
  required int32 count = 3;
  required int32 total_message_size = 4;
}

message QuoteMessage {
enum Side {
ASK = 0;
BID = 1;
  }
  required Side type = 1;
required int32 level = 2;
optional double price = 3;
optional int64 size = 4;
optional int32 count = 5;
optional HeaderMessage header = 6;
}

message CustomMessage {
required string field_name = 1;
required double value = 2;
optional HeaderMessage header = 3;
}

message TradeMessage {
optional double price = 1;
optional int64 size = 2;
optional int64 AccumulatedVolume = 3;
optional HeaderMessage header = 4;
}


Binary file format is
object type, object, object type object ...

1st object of a record holds header with n number of objects in that
record. next n-1 objects will not hold header since they all belong to
same record (same update time).
now n+1th object belongs to the new record and it will hold header for
next record.

Any advices?

Regards,
Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: suggestions on improving the performance?

2012-01-10 Thread alok
my point is ..should i have one message something like

Message Record{
  required HeaderMessage header;
  optional TradeMessage trade;
  repeated QuoteMessage quotes; // 0 or more
  repeated CustomMessage customs; // 0 or more
}

or rather should i keep my file plain as
object type, object, objecttype, object
without worrying about the concept of a record.

Each message in file is usually header + any 1 type of message (trade,
quote or custom) ..  and mostly only 1 quote or custom message not
more.

what would be faster to decode?

Regards,
Alok


On Jan 11, 12:41 pm, alok alok.jad...@gmail.com wrote:
 Hi everyone,

 My program is taking more time to read binary files than the text
 files. I think the issue is with the structure of the binary files
 that i have designed. (Or could it be possible that binary decoding is
 slower than text files parsing? ).

 Data file is a large text file with 1 record per row. upto 1.2 GB.
 Binary file is around 900 MB.

 **
  - Text file reading takes 3 minutes to read the file.
  - Binary file reading takes 5 minutes.

 I saw a very strange behavior.
  - Just to see how long it takes to skim through binary file, i
 started reading header on each message which holds the length of the
 message and then skipped that many bytes using the Skip() function of
 coded_input object. After making this change, i was expecting that
 reading through file should take less time, but it took more than 10
 minutes. Is skipping not same as adding n bytes to the file pointer?
 is it slower to skip the object than read it?

 Are their any guidelines on how the structure should be designed to
 get the best performance?

 My current structure looks as below

 message HeaderMessage {
   required double timestamp = 1;
   required string ric_code = 2;
   required int32 count = 3;
   required int32 total_message_size = 4;

 }

 message QuoteMessage {
         enum Side {
     ASK = 0;
     BID = 1;
   }
   required Side type = 1;
         required int32 level = 2;
         optional double price = 3;
         optional int64 size = 4;
         optional int32 count = 5;
         optional HeaderMessage header = 6;

 }

 message CustomMessage {
         required string field_name = 1;
         required double value = 2;
         optional HeaderMessage header = 3;

 }

 message TradeMessage {
         optional double price = 1;
         optional int64 size = 2;
         optional int64 AccumulatedVolume = 3;
         optional HeaderMessage header = 4;

 }

 Binary file format is
 object type, object, object type object ...

 1st object of a record holds header with n number of objects in that
 record. next n-1 objects will not hold header since they all belong to
 same record (same update time).
 now n+1th object belongs to the new record and it will hold header for
 next record.

 Any advices?

 Regards,
 Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] nested message not read properly. (_has_bits is not set?)

2011-12-21 Thread alok

Hi,

 I have a nested message that i am trying to read using protocol
buffers. Message looks like as below

code

message HeaderMessage {
  required double timestamp = 1;
  required string ric_code = 2;
  required int32 count = 3;
  required int32 total_message_size = 4;
}

message CustomMessage {
  required string field_name = 1;
  required double value = 2;
  optional HeaderMessage header = 3;
}

/code

Here, I am setting the header field in the CustomMessage and writting
the custom message to binary output file. I know for sure that the
message is written properly to the binary file because I am able to
retrive it properly using the C# library (protobuf-net) to read the
binary file.

I am trying to read the same file using C++ protocol buffers library.
But when i read the object from coded_stream

cMsg.ParseFromCodedStream(coded_input);

I see that the header is not set in the cMsg.

I looked inside protocol buffers library. While reading the object, it
checks for if header is set or not using the following function

inline bool CustomMessage::has_header() const {
  return (_has_bits_[0]  0x0004u) != 0;
}

Above function returns false and header object is not read.


When I am writing the object to binary file, value of _has_bits was
0x00fb6ff8 but when I am reading the custom message from the binary
file, this time the value of _has_bits is unchanged before and after
reading the object. This value is 0x0012fbcc.  For this value,
has_header() function returns false.

so when I call ParseFromCodedStream function, _has_bits value is not
read properly causing a problem in reading header object.

What am I doing wrong in this case? How to resolve this issue?

Thanks for your help.

-Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: nested message not read properly. (_has_bits is not set?)

2011-12-21 Thread alok
Apologies for incorrect values of _has_bits in my prev message. Those
were the addresses of the variable _has_bits.

But the behvaior of the _has_bits is same. When I initialize the
header while writing using cMsg.mutable_header(), it sets the 3rd bit
in _has_bits and the value of _has_bits is 7 ( = 0x0111). But when I
read the cMsg in my reader program, value of _has_bits is 3 ( =
0x0011) 3rd bit is not set and hence the header message is not read.

I am further looking into the code to understand why this bit is not
set.

Appreciate it if you could tell me what am I doing wrong in this case.

Regards,
Alok


On Dec 22, 10:07 am, alok alok.jad...@gmail.com wrote:
 Hi,

  I have a nested message that i am trying to read using protocol
 buffers. Message looks like as below

 code

 message HeaderMessage {
   required double timestamp = 1;
   required string ric_code = 2;
   required int32 count = 3;
   required int32 total_message_size = 4;

 }

 message CustomMessage {
   required string field_name = 1;
   required double value = 2;
   optional HeaderMessage header = 3;

 }

 /code

 Here, I am setting the header field in the CustomMessage and writting
 the custom message to binary output file. I know for sure that the
 message is written properly to the binary file because I am able to
 retrive it properly using the C# library (protobuf-net) to read the
 binary file.

 I am trying to read the same file using C++ protocol buffers library.
 But when i read the object from coded_stream

 cMsg.ParseFromCodedStream(coded_input);

 I see that the header is not set in the cMsg.

 I looked inside protocol buffers library. While reading the object, it
 checks for if header is set or not using the following function

 inline bool CustomMessage::has_header() const {
   return (_has_bits_[0]  0x0004u) != 0;

 }

 Above function returns false and header object is not read.

 When I am writing the object to binary file, value of _has_bits was
 0x00fb6ff8 but when I am reading the custom message from the binary
 file, this time the value of _has_bits is unchanged before and after
 reading the object. This value is 0x0012fbcc.  For this value,
 has_header() function returns false.

 so when I call ParseFromCodedStream function, _has_bits value is not
 read properly causing a problem in reading header object.

 What am I doing wrong in this case? How to resolve this issue?

 Thanks for your help.

 -Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: nested message not read properly. (_has_bits is not set?)

2011-12-21 Thread alok
On further investigation, looks like the issue could be due to the
limits.

I have following code

code
lim = coded_input-PushLimit(objlen);
cMsg.ParseFromCodedStream(coded_input);
coded_input-PopLimit(lim);
cmsgsize = cMsg.ByteSize();
/code

Above, the value of objlen is 44.
The length of cMsg should have been 44 bytes (including header
message) but when I check cMsg.ByteSize(), it returns value 20.
Its reading only 20 bytes instead of 44 bytes.

inside PushLimit() function,

buffer_end_ += buffer_size_after_limit_;

value of buffer_size_after_limit_ is 0 and buffer_end_ ends after 20
bytes only reading partial message.

Help me investigate this issue further.

Regards,
Alok

On Dec 22, 10:49 am, alok alok.jad...@gmail.com wrote:
 Apologies for incorrect values of _has_bits in my prev message. Those
 were the addresses of the variable _has_bits.

 But the behvaior of the _has_bits is same. When I initialize the
 header while writing using cMsg.mutable_header(), it sets the 3rd bit
 in _has_bits and the value of _has_bits is 7 ( = 0x0111). But when I
 read the cMsg in my reader program, value of _has_bits is 3 ( =
 0x0011) 3rd bit is not set and hence the header message is not read.

 I am further looking into the code to understand why this bit is not
 set.

 Appreciate it if you could tell me what am I doing wrong in this case.

 Regards,
 Alok

 On Dec 22, 10:07 am, alok alok.jad...@gmail.com wrote:







  Hi,

   I have a nested message that i am trying to read using protocol
  buffers. Message looks like as below

  code

  message HeaderMessage {
    required double timestamp = 1;
    required string ric_code = 2;
    required int32 count = 3;
    required int32 total_message_size = 4;

  }

  message CustomMessage {
    required string field_name = 1;
    required double value = 2;
    optional HeaderMessage header = 3;

  }

  /code

  Here, I am setting the header field in the CustomMessage and writting
  the custom message to binary output file. I know for sure that the
  message is written properly to the binary file because I am able to
  retrive it properly using the C# library (protobuf-net) to read the
  binary file.

  I am trying to read the same file using C++ protocol buffers library.
  But when i read the object from coded_stream

  cMsg.ParseFromCodedStream(coded_input);

  I see that the header is not set in the cMsg.

  I looked inside protocol buffers library. While reading the object, it
  checks for if header is set or not using the following function

  inline bool CustomMessage::has_header() const {
    return (_has_bits_[0]  0x0004u) != 0;

  }

  Above function returns false and header object is not read.

  When I am writing the object to binary file, value of _has_bits was
  0x00fb6ff8 but when I am reading the custom message from the binary
  file, this time the value of _has_bits is unchanged before and after
  reading the object. This value is 0x0012fbcc.  For this value,
  has_header() function returns false.

  so when I call ParseFromCodedStream function, _has_bits value is not
  read properly causing a problem in reading header object.

  What am I doing wrong in this case? How to resolve this issue?

  Thanks for your help.

  -Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: nested message not read properly. (_has_bits is not set?)

2011-12-21 Thread alok
The issue is resolved now.

Made the same mistake which I did in the past. Forgot to open the file
in the binary mode. It encountered an early eof cuz of text mode. Now
its working fine.

thanks,
Alok

On Dec 22, 11:25 am, alok alok.jad...@gmail.com wrote:
 On further investigation, looks like the issue could be due to the
 limits.

 I have following code

 code
     lim = coded_input-PushLimit(objlen);
     cMsg.ParseFromCodedStream(coded_input);
     coded_input-PopLimit(lim);
     cmsgsize = cMsg.ByteSize();
 /code

 Above, the value of objlen is 44.
 The length of cMsg should have been 44 bytes (including header
 message) but when I check cMsg.ByteSize(), it returns value 20.
 Its reading only 20 bytes instead of 44 bytes.

 inside PushLimit() function,

 buffer_end_ += buffer_size_after_limit_;

 value of buffer_size_after_limit_ is 0 and buffer_end_ ends after 20
 bytes only reading partial message.

 Help me investigate this issue further.

 Regards,
 Alok

 On Dec 22, 10:49 am, alok alok.jad...@gmail.com wrote:







  Apologies for incorrect values of _has_bits in my prev message. Those
  were the addresses of the variable _has_bits.

  But the behvaior of the _has_bits is same. When I initialize the
  header while writing using cMsg.mutable_header(), it sets the 3rd bit
  in _has_bits and the value of _has_bits is 7 ( = 0x0111). But when I
  read the cMsg in my reader program, value of _has_bits is 3 ( =
  0x0011) 3rd bit is not set and hence the header message is not read.

  I am further looking into the code to understand why this bit is not
  set.

  Appreciate it if you could tell me what am I doing wrong in this case.

  Regards,
  Alok

  On Dec 22, 10:07 am, alok alok.jad...@gmail.com wrote:

   Hi,

    I have a nested message that i am trying to read using protocol
   buffers. Message looks like as below

   code

   message HeaderMessage {
     required double timestamp = 1;
     required string ric_code = 2;
     required int32 count = 3;
     required int32 total_message_size = 4;

   }

   message CustomMessage {
     required string field_name = 1;
     required double value = 2;
     optional HeaderMessage header = 3;

   }

   /code

   Here, I am setting the header field in the CustomMessage and writting
   the custom message to binary output file. I know for sure that the
   message is written properly to the binary file because I am able to
   retrive it properly using the C# library (protobuf-net) to read the
   binary file.

   I am trying to read the same file using C++ protocol buffers library.
   But when i read the object from coded_stream

   cMsg.ParseFromCodedStream(coded_input);

   I see that the header is not set in the cMsg.

   I looked inside protocol buffers library. While reading the object, it
   checks for if header is set or not using the following function

   inline bool CustomMessage::has_header() const {
     return (_has_bits_[0]  0x0004u) != 0;

   }

   Above function returns false and header object is not read.

   When I am writing the object to binary file, value of _has_bits was
   0x00fb6ff8 but when I am reading the custom message from the binary
   file, this time the value of _has_bits is unchanged before and after
   reading the object. This value is 0x0012fbcc.  For this value,
   has_header() function returns false.

   so when I call ParseFromCodedStream function, _has_bits value is not
   read properly causing a problem in reading header object.

   What am I doing wrong in this case? How to resolve this issue?

   Thanks for your help.

   -Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Incorrect encoding of protocol buffer message

2011-12-15 Thread alok

Hi,

 I am facing a strange issue when I write a binary file using protocol
buffers library. I had hard time reading the generated binary file in
my C# program. I would find an incorrect byte and unexpectedly
encounter an end of file byte in the middle of file. But interesting
thing is that my C++ reader program was able to read the program
properly but C# program errors. So I am very confused.

After investigating further in the binary file, we saw that the file
had some incorrect bytes inserted while encoding. (Either it is
incorrect or maybe I am missing something).

I can share all my c++ reader/C# reader programs and the binary file
to resolve this issue. We suspect that there is a bug in the library
which is writing the data incorrectly.

Below is the finding from the investigations of the data .

* I am displaying only certain information which is required to
understand the issue. I am printing the actual Message and not the
length and other header bytes associated with this.

code
message TradeMessage {
required double timestamp = 1;
required string ric_code = 2;
required double price = 3;
required int64 size = 4;
required int64 AccumulatedVolume = 5;
}
/code

Some of the objects read from the binary file look as below

(object 1 - good)
09 06 81 95 43 c3 27 dc 40 12 07 30 30 32 35 2e 48 4b 19 00 00 00 00
00 00 20 40 20 00 28 00
(object 2 - good)
09 25 06 81 95 c3 27 dc 40 12 07 30 30 32 34 2e 48 4b 19 00 00 00 00
00 00 00 00 20 00 28 00
(object 3 - corrupt)
09 71 3d 0d 0a d7 c3 27 dc 40 12 07 30 30 32 33 2e 48 4b 19 00 00 00
00 00 00 3b 40 20 00 28

If you look at 3 objects above, each object starts with field 1 which
is timestamp, double. It is encoded as 09 i.e. field 1, wire-type
1 (i.e. 64-bit), so next 8 bytes represent the timestamp value.

If you carefully look at first 10 bytes of each object, you will see
that object 1 and object 2 encodes field 1 properly, but for object 3,
actual payload of field 1 starts from byte # 3. header for this field
is (09 71). 2 bytes. (either 2 bytes of header or 9 bytes of payload)
But in either case, one extra byte is written to my binary file.

Why is this happening? How does c++ reader understands to decode this
data? Is this correct or is there a bug involved here?

Please advice.

Regards,
Alok


-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] how to flush coded stream buffer?

2011-12-15 Thread alok
Hi,

 I am working on a very large file generated using protocol buffers.
At certain point of time during file creating, my app raises no
memory exception. Binary file size at this time is around 700 MB. I
suppose if i could clear out the coded streams buffer, this issue
could be resolved.
Is my suggestion correct to solve this issue? How can we resolve this
issue? I tried to flush using the file descriptors but it didn't
work.

Regards,
Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] How to check end of file stream?

2011-12-14 Thread alok
Hi,

I am using following code while reading through the file stream

while(!coded_input-ExpectAtEnd())
{
coded_input-ReadLittleEndian32(count);
for(i=0; icount; i++)
{
coded_input-ReadLittleEndian32(objtype);
coded_input-ReadLittleEndian32(objlen);
cout  Item = ++item_count type = objtype 
length =
objlenendl;
coded_input-Skip(objlen);
}
}

Above code runs in infinite loop. My file has only 14 objects. File
length is 627 bytes only. But ExpectAtEnd() Never returns true in
above example.

What is the right way to check end of file?

Regards,
Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Why protocol buffer c++ library not reading binary objects properly?

2011-12-08 Thread alok
I created a binary file using a c++ program using protocol buffers. I
had issues reading the binary file in my C# program, so I decided to
write a small c++ program to test the reading.

My proto file is as follows

message TradeMessage {
required double timestamp = 1;
required string ric_code = 2;
required double price = 3;
required int64 size = 4;
required int64 AccumulatedVolume = 5;
}

When writing to protocol buffer, I first write the object type, then
object length and the object itself.

coded_output-WriteLittleEndian32((int) ObjectType_Trade);
coded_output-WriteLittleEndian32(trade.ByteSize());
trade.SerializeToCodedStream(coded_output);

Now, when I am trying to read the same file in my c++ program i see
strange behavior.

My reading code is as follows:

coded_input-ReadLittleEndian32(objtype);
coded_input-ReadLittleEndian32(objlen);
tMsg.ParseFromCodedStream(coded_input);
cout  Expected Size =   objlen  endl;
cout Trade message received for:  tMsg.ric_code()  endl;
cout  TradeMessage Size =   tMsg.ByteSize()  endl;

In this case, i get following output

Expected Size = 33
Trade message received for: .CSAP0104
TradeMessage Size = 42

When I write to file, I write trade.ByteSize() as 33 bytes, but when I
read the same object, the object ByteSize() is 42 bytes i.e. it is
trying to read 42 bytes. But it should be trying to read 33 bytes.
This affects the rest of the fiel reading. I am not sure what is wrong
in this. Please advice.

just to double check.. I compared the protocol buffer generated files
in my reader and writer projects. The generated files are identical.
So I guess, the file coding is different for some reason. I do not
understand why it is different.


Regards,
Alok

Regards, Alok

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Maps in protobuf / 2

2011-06-16 Thread Alok Singh
Haven't found a direct way to create a map, but, we use the following to 
serialize map like data structures.

-
message KeyValue{
required string key = 1;
required string value = 2;
}

message Map {
repeated KeyValue items = 1;
}

message Foo {
required string id = 1;
optional Map map = 2;
}


Alok

On 06/16/2011 01:26 PM, Marco Mistroni wrote:

HI all
 sorry i hijacked a previous thread ..
Is it possibel to define Maps in protobuff?

i have some serverside code which returns a MapString, Double, and i 
was wondering if there was a way in protobuf

to define a Map

could anyone help ?

w/kindest regards
 marco
--
You received this message because you are subscribed to the Google 
Groups Protocol Buffers group.

To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.


--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.