Re: Protocol buffers and EBCDIC

2009-05-22 Thread David Crayford

Thank you for such a exhaustive reply. I compile on an EBCDIC platform, 
a z/OS mainframe. The IBM C++ compiler has an ASCII option to support 
ASCII (UTF8?). My use case is write a server that creates a protocol 
buffer in EBCDIC, translate it
to ASCII and then send it to a Java client on a PC to process. We use 
XML right now and I want to trim the fat!

Kenton Varda wrote:
> On what platform do you *compile* code?  I'm hoping you compile on an 
> ASCII platform and only need the code to run under EBCDIC.  If that's 
> the case, then you do not need to make protoc itself run on EBCDIC, 
> only the code it generates.
>
> tokenizer.cc is only used by protoc (to parse proto files) and 
> TextFormat.  If you build on an ASCII machine and don't need to parse 
> TextFormat then you can ignore tokenizer.cc.  This makes the problem a 
> lot simpler.  I think all you have to do is fix this code:
>
> http://code.google.com/p/protobuf/source/browse/tags/2.1.0/src/google/protobuf/compiler/cpp/cpp_file.cc#445
>
> The code takes the FileDescriptorProto for the proto file, encodes it, 
> and embeds the bytes directly into the generated code as a giant 
> string literal.  Since these bytes are actually binary data, you need 
> to make sure no character set conversion occurs on them.  I'm guessing 
> you could do that by escaping every character as an octal sequence, 
> rather than call CEscape() which only escapes unprintable characters 
> and quotes.
>
> Unfortunately, without TextFormat, many of the unit tests won't work.
>
> ** alternative strategy **
>
> Note that tokenizer.cc does not use ctype.h because its behavior 
> differs in different locales.  If we used ctype.h, then text which is 
> parseable when the locale is English might not be parseable when the 
> locale is, say, Russian.  This is confusing, because in both cases the 
> text being parsed is its own language and should have nothing to do 
> with the user's spoken language.
>
> So what if we extend the same reasoning to character sets?  Let's say 
> that .proto files and TextFormat are ASCII formats and are not defined 
> for any other character set.  If you want to parse text that is in 
> some other character set, you must convert it to ASCII first, then 
> feed it to the tokenizer.  One could easily write a 
> ZeroCopyInputStream implementation which accomplishes this conversion.
>
> So now the problem is -- conceptually, at least -- much easier.  What 
> you want to do is set your C++ compiler so that all string and 
> character literals in the protobuf code are kept in ASCII format, not 
> converted to EBCDIC.  Thus, all the code operates on ASCII, even when 
> running on a system where EBCDIC is standard, so code like ('a' <= c 
> && c <= 'z') will still work, so long as c is an ASCII character. 
>  Note that you would only have to compile the protobuf library and 
> generated .pb.cc files this way -- you can compile your own code using 
> normal settings, since the protobuf headers do not contain any string 
> or character literals (I think).
>
> Now all you have to do is transform text on its way into or out of the 
> protobuf library.  Fortunately you will use the library primarily on 
> binary input and output, so not much transformation will be needed. 
>  The only cases I can think of is when using TextFormat (which is an 
> entirely optional feature mostly for debugging) and error messages. 
>  You can use SetLogHandler() to catch error messages and convert them:
>
> http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.common.html#SetLogHandler.details
>
> With this strategy you should be able to run all the unit tests -- so 
> long as you also compile them with ASCII literals.
>
> ==
>
> Note that regardless of which solution you choose above, the "string" 
> field type in the .proto language is only allowed to contain UTF-8 
> text.  If you intend to store EBCDIC text in your protocol messages, 
> you need to use the "bytes" type instead.  Or, alternatively, you can 
> transform character sets whenever you read/write a "string" field -- 
> this approach would be more tedious but would make it easier for ASCII 
> systems to read/write the same message format.
>
> On Tue, May 19, 2009 at 8:46 PM, David Crayford  > wrote:
>
> David Crayford wrote:
>
> http://en.wikipedia.org/wiki/EBCDIC_1047
>
> Note the non-contiguous letter sequences. EBCDIC sucks, but it
> is used on enterprise machines and is still used for the
> majority of the worlds production data. Protocol Buffers seem
> like a fantastic
> alternative to XML, which is proving to be a huge bottleneck
> for mainframe applications trying to web enable legacy
> applications.
>
> What modules/classes contain the ASCII specific code?
>
>
> ok, it looks like the problem is tokenizer.cc - the character
> classes ain't gonna cut i

Re: Protocol buffers and EBCDIC

2009-05-20 Thread Kenton Varda
On what platform do you *compile* code?  I'm hoping you compile on an ASCII
platform and only need the code to run under EBCDIC.  If that's the case,
then you do not need to make protoc itself run on EBCDIC, only the code it
generates.

tokenizer.cc is only used by protoc (to parse proto files) and TextFormat.
 If you build on an ASCII machine and don't need to parse TextFormat then
you can ignore tokenizer.cc.  This makes the problem a lot simpler.  I think
all you have to do is fix this code:

http://code.google.com/p/protobuf/source/browse/tags/2.1.0/src/google/protobuf/compiler/cpp/cpp_file.cc#445

The code takes the FileDescriptorProto for the proto file, encodes it, and
embeds the bytes directly into the generated code as a giant string literal.
 Since these bytes are actually binary data, you need to make sure no
character set conversion occurs on them.  I'm guessing you could do that by
escaping every character as an octal sequence, rather than call CEscape()
which only escapes unprintable characters and quotes.

Unfortunately, without TextFormat, many of the unit tests won't work.

** alternative strategy **

Note that tokenizer.cc does not use ctype.h because its behavior differs in
different locales.  If we used ctype.h, then text which is parseable when
the locale is English might not be parseable when the locale is, say,
Russian.  This is confusing, because in both cases the text being parsed is
its own language and should have nothing to do with the user's spoken
language.

So what if we extend the same reasoning to character sets?  Let's say that
.proto files and TextFormat are ASCII formats and are not defined for any
other character set.  If you want to parse text that is in some other
character set, you must convert it to ASCII first, then feed it to the
tokenizer.  One could easily write a ZeroCopyInputStream implementation
which accomplishes this conversion.

So now the problem is -- conceptually, at least -- much easier.  What you
want to do is set your C++ compiler so that all string and character
literals in the protobuf code are kept in ASCII format, not converted to
EBCDIC.  Thus, all the code operates on ASCII, even when running on a system
where EBCDIC is standard, so code like ('a' <= c && c <= 'z') will still
work, so long as c is an ASCII character.  Note that you would only have to
compile the protobuf library and generated .pb.cc files this way -- you can
compile your own code using normal settings, since the protobuf headers do
not contain any string or character literals (I think).

Now all you have to do is transform text on its way into or out of the
protobuf library.  Fortunately you will use the library primarily on binary
input and output, so not much transformation will be needed.  The only cases
I can think of is when using TextFormat (which is an entirely optional
feature mostly for debugging) and error messages.  You can use
SetLogHandler() to catch error messages and convert them:

http://code.google.com/apis/protocolbuffers/docs/reference/cpp/google.protobuf.common.html#SetLogHandler.details

With this strategy you should be able to run all the unit tests -- so long
as you also compile them with ASCII literals.

==

Note that regardless of which solution you choose above, the "string" field
type in the .proto language is only allowed to contain UTF-8 text.  If you
intend to store EBCDIC text in your protocol messages, you need to use the
"bytes" type instead.  Or, alternatively, you can transform character sets
whenever you read/write a "string" field -- this approach would be more
tedious but would make it easier for ASCII systems to read/write the same
message format.

On Tue, May 19, 2009 at 8:46 PM, David Crayford  wrote:

> David Crayford wrote:
>
>> http://en.wikipedia.org/wiki/EBCDIC_1047
>>
>> Note the non-contiguous letter sequences. EBCDIC sucks, but it is used on
>> enterprise machines and is still used for the majority of the worlds
>> production data. Protocol Buffers seem like a fantastic
>> alternative to XML, which is proving to be a huge bottleneck for mainframe
>> applications trying to web enable legacy applications.
>>
>> What modules/classes contain the ASCII specific code?
>>
>>
> ok, it looks like the problem is tokenizer.cc - the character classes ain't
> gonna cut it for EBCDIC. I think I could patch that by using cctype, but
> there's some comments in there warning against it's use?
>
>
>  Kenton Varda wrote:
>>
>>> If your compiler transforms string literals to a non-ASCII character set,
>>> then the code generated by the protocol compiler won't work.  We could
>>> perhaps fix this by escaping every character in the embedded descriptor, but
>>> other problems might come up.  I don't have enough experience with EBCDIC to
>>> know what to expect.
>>>
>>> On Tue, May 19, 2009 at 4:03 AM, daveyc >> dcrayf...@gmail.com>> wrote:
>>>
>>>
>>>I built Protocol Buffers on a z/OS mainframe. Build was fine but the
>>>   

Re: Protocol buffers and EBCDIC

2009-05-19 Thread Monty Taylor

David Crayford wrote:
> Monty Taylor wrote:
>> Yeah... ctype works globally. There's also the char_traits stuff. The
>> thing that continues to amaze me is that there is not a good, usable,
>> performant and reentrant charset handling lib for c++. Lemme know if
>> you find a good one...
>>
>>   
> 
> Dunno if it meets your requirements, but ICM from IBM comes to mind. I
> know the xerces XML framework from Apache uses it
> http://site.icu-project.org/.

ICU would be great if it didn't store all strings internally as UCS2.
Big overhead issue for any time you're dealing primarily in utf8 or
latin1. What would be great is a lib that uses the ICU interfaces but is
implemented in utf8 instead...

>> David Crayford  wrote:
>>
>>  
>>> David Crayford wrote:
>>>
 http://en.wikipedia.org/wiki/EBCDIC_1047

 Note the non-contiguous letter sequences. EBCDIC sucks, but it is
 used on enterprise machines and is still used for the majority of
 the worlds production data. Protocol Buffers seem like a fantastic
 alternative to XML, which is proving to be a huge bottleneck for
 mainframe applications trying to web enable legacy applications.

 What modules/classes contain the ASCII specific code?

   
>>> ok, it looks like the problem is tokenizer.cc - the character classes
>>> ain't gonna cut it for EBCDIC. I think I could patch that by using
>>> cctype, but there's some comments in there warning against it's use?
>>>
>>>
 Kenton Varda wrote:
  
> If your compiler transforms string literals to a non-ASCII
> character set, then the code generated by the protocol compiler
> won't work.  We could perhaps fix this by escaping every character
> in the embedded descriptor, but other problems might come up.  I
> don't have enough experience with EBCDIC to know what to expect.
>
> On Tue, May 19, 2009 at 4:03 AM, daveyc  > wrote:
>
>
> I built Protocol Buffers on a z/OS mainframe. Build was fine
> but the
> unit test choked:
>
> libprotobuf FATAL google/protobuf/descriptor.cc:1959] CHECK
> failed:
> proto.ParseFromArray(data,
> size):
>
> Without digging too deep in the code, is the parser capable of
> handling EBCDIC?
> >>
>
> 
   
>>> >>>
>>>
>>> 
> 
> 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol buffers and EBCDIC

2009-05-19 Thread David Crayford

Monty Taylor wrote:
> Yeah... ctype works globally. There's also the char_traits stuff. The thing 
> that continues to amaze me is that there is not a good, usable, performant 
> and reentrant charset handling lib for c++. Lemme know if you find a good 
> one...
>
>   

Dunno if it meets your requirements, but ICM from IBM comes to mind. I 
know the xerces XML framework from Apache uses it 
http://site.icu-project.org/.

> David Crayford  wrote:
>
>   
>> David Crayford wrote:
>> 
>>> http://en.wikipedia.org/wiki/EBCDIC_1047
>>>
>>> Note the non-contiguous letter sequences. EBCDIC sucks, but it is used 
>>> on enterprise machines and is still used for the majority of the 
>>> worlds production data. Protocol Buffers seem like a fantastic
>>> alternative to XML, which is proving to be a huge bottleneck for 
>>> mainframe applications trying to web enable legacy applications.
>>>
>>> What modules/classes contain the ASCII specific code?
>>>
>>>   
>> ok, it looks like the problem is tokenizer.cc - the character classes 
>> ain't gonna cut it for EBCDIC. I think I could patch that by using 
>> cctype, but there's some comments in there warning against it's use?
>>
>> 
>>> Kenton Varda wrote:
>>>   
 If your compiler transforms string literals to a non-ASCII character 
 set, then the code generated by the protocol compiler won't work.  We 
 could perhaps fix this by escaping every character in the embedded 
 descriptor, but other problems might come up.  I don't have enough 
 experience with EBCDIC to know what to expect.

 On Tue, May 19, 2009 at 4:03 AM, daveyc >>> > wrote:


 I built Protocol Buffers on a z/OS mainframe. Build was fine but the
 unit test choked:

 libprotobuf FATAL google/protobuf/descriptor.cc:1959] CHECK failed:
 proto.ParseFromArray(data,
 size):

 Without digging too deep in the code, is the parser capable of
 handling EBCDIC?
 >>

 
>>>   
>> >>
>>
>> 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol buffers and EBCDIC

2009-05-19 Thread David Crayford

David Crayford wrote:
> http://en.wikipedia.org/wiki/EBCDIC_1047
>
> Note the non-contiguous letter sequences. EBCDIC sucks, but it is used 
> on enterprise machines and is still used for the majority of the 
> worlds production data. Protocol Buffers seem like a fantastic
> alternative to XML, which is proving to be a huge bottleneck for 
> mainframe applications trying to web enable legacy applications.
>
> What modules/classes contain the ASCII specific code?
>

ok, it looks like the problem is tokenizer.cc - the character classes 
ain't gonna cut it for EBCDIC. I think I could patch that by using 
cctype, but there's some comments in there warning against it's use?

> Kenton Varda wrote:
>> If your compiler transforms string literals to a non-ASCII character 
>> set, then the code generated by the protocol compiler won't work.  We 
>> could perhaps fix this by escaping every character in the embedded 
>> descriptor, but other problems might come up.  I don't have enough 
>> experience with EBCDIC to know what to expect.
>>
>> On Tue, May 19, 2009 at 4:03 AM, daveyc > > wrote:
>>
>>
>> I built Protocol Buffers on a z/OS mainframe. Build was fine but the
>> unit test choked:
>>
>> libprotobuf FATAL google/protobuf/descriptor.cc:1959] CHECK failed:
>> proto.ParseFromArray(data,
>> size):
>>
>> Without digging too deep in the code, is the parser capable of
>> handling EBCDIC?
>> >>
>>
>
>


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol buffers and EBCDIC

2009-05-19 Thread David Crayford

http://en.wikipedia.org/wiki/EBCDIC_1047

Note the non-contiguous letter sequences. EBCDIC sucks, but it is used 
on enterprise machines and is still used for the majority of the worlds 
production data. Protocol Buffers seem like a fantastic
alternative to XML, which is proving to be a huge bottleneck for 
mainframe applications trying to web enable legacy applications.

What modules/classes contain the ASCII specific code?

Kenton Varda wrote:
> If your compiler transforms string literals to a non-ASCII character 
> set, then the code generated by the protocol compiler won't work.  We 
> could perhaps fix this by escaping every character in the embedded 
> descriptor, but other problems might come up.  I don't have enough 
> experience with EBCDIC to know what to expect.
>
> On Tue, May 19, 2009 at 4:03 AM, daveyc  > wrote:
>
>
> I built Protocol Buffers on a z/OS mainframe. Build was fine but the
> unit test choked:
>
> libprotobuf FATAL google/protobuf/descriptor.cc:1959] CHECK failed:
> proto.ParseFromArray(data,
> size):
>
> Without digging too deep in the code, is the parser capable of
> handling EBCDIC?
> >
>


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: Protocol buffers and EBCDIC

2009-05-19 Thread Kenton Varda
If your compiler transforms string literals to a non-ASCII character set,
then the code generated by the protocol compiler won't work.  We could
perhaps fix this by escaping every character in the embedded descriptor, but
other problems might come up.  I don't have enough experience with EBCDIC to
know what to expect.

On Tue, May 19, 2009 at 4:03 AM, daveyc  wrote:

>
> I built Protocol Buffers on a z/OS mainframe. Build was fine but the
> unit test choked:
>
> libprotobuf FATAL google/protobuf/descriptor.cc:1959] CHECK failed:
> proto.ParseFromArray(data,
> size):
>
> Without digging too deep in the code, is the parser capable of
> handling EBCDIC?
> >
>

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---