[jira] Commented: (THRIFT-395) Python library + compiler does not support unicode strings

Chad Walters (JIRA) Wed, 01 Apr 2009 10:02:35 -0700

    [ 
https://issues.apache.org/jira/browse/THRIFT-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694618#action_12694618
 ]


Chad Walters commented on THRIFT-395:
-------------------------------------

There is a tension in Thrift between allowing types to have their natural 
meaning in each language and having complete interoperability across the full 
suite of languages. This is just one example.

Should I limit the expressiveness of Thrift structures I can use in one 
language (C++) because of the strictures of some other language amongst those 
supported by Thrift that I may or may not be using? Thrift has not always made 
a consistent choice on this.

On the full interoperability side:
 -- unsigned integers are not supported because a number of languages don't 
support them natively.

On the side of "the IDL writer and/or the applications are responsible for 
guaranteeing interoperability":
-- the application is responsible for interoperability of strings between, say, 
Java and C++ -- that is, if you want to interoperate, you need to make sure 
that C++ is sending only UTF8 encoded data in strings
-- the IDL writer is responsible for determining if map keys should only be 
primitive types (required by some languages -- and the JSON protocol btw) or if 
they can be structures or containers as well (limiting interoperability)

I personally have always leaned on the side of more interoperability and 
correcting some of the early architectural warts. If it had been up to me, we 
would have bitten off the backwards compatibility break a while back, changed 
string to only be UTF-8, made restrictions on the types for map keys, made 
binary its own standalone type, etc. However, I can understand the concerns of 
those with a big investment in persisted Thrift data who have pushed back 
against non-backwards compatible changes.

I'll repeat my suggestion and expand on it a little further: Thrift could 
operate in 2 modes: "more flexibility" or "more compatibility". Under "more 
flexibility", it would operate more or less as things are today (eg: I wouldn't 
re-examine the signed vs unsigned decision -- too many opportunities for foot 
shooting there). Under "more compatibility", strings would be required to be 
UTF-8, map keys would be required to be primitives, etc. I would expect that 
most people adopting Thrift now would select "more compatibility" but those 
with specific needs could use the "more flexibility" mode.

Adding a new type for Unicode strings seems fine to me as well -- in fact there 
are already suitable unused type constants in TProtocol.h (UTF8, along with 
UTF7 and UTF16) (see 
http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/cpp/src/protocol/TProtocol.h?view=markup)
 (Side note: Why are those there? I always figured it was leftover cruft from 
something that was worked on at FB and then abandoned. Should they be cleaned 
up?). The only real issue I can see is that it entails changes to all the 
protocols and code generators across all the languages -- it's just work but 
its not hard work.

bq. a. as I have said several times (to no contradictions), simply recompiling 
the IDL after changing to binary is a virtually painless way to get back the 
old behavior of treating all data as binary no matter what it was declared as

I hate to say this since doing what you said would also smooth the way for 
promoting binary to a full fledged type but I am not sure that it's as easy as 
you suggest. I think this will break code since the return type of readBinary() 
is not the same as the return type of readString().


> Python library + compiler does not support unicode strings
> ----------------------------------------------------------
>
>                 Key: THRIFT-395
>                 URL: https://issues.apache.org/jira/browse/THRIFT-395
>             Project: Thrift
>          Issue Type: Bug
>          Components: Compiler (Python), Library (Python)
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>            Priority: Blocker
>             Fix For: 0.1
>
>         Attachments: 
> 0001-python-Minor-cleanup-of-protocols-don-t-use-str.patch, 
> 0002-THRIFT-395.-python-Phase-One-of-support-for-unicode.patch, 
> 0003-THRIFT-395.-python-Phase-Two-of-support-for-unicode.patch, 
> 0004-python-Remove-ridiculous-semicolons-from-gen-code.patch, 
> python-utf8-v2.patch, python-utf8.patch
>
>
> Effectively, all strings in the python bindings are treated as binary strings 
> -- no encoding/decoding to UTF-8 is done.  So if a unicode object is passed 
> to a (regular, non-binary) string, an exception is raised.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-395) Python library + compiler does not support unicode strings

Reply via email to