[ 
https://issues.apache.org/jira/browse/THRIFT-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694376#action_12694376
 ] 

Chad Walters commented on THRIFT-395:
-------------------------------------

Jonathan, you have stumbled on an old ugly problem in Thrift. The 'string' type 
was originally the only way to pass arbitrary binary data around but this 
didn't actually work properly in Java because of its requirement that String's 
carry an encoding. The 'binary' subtype was introduced to fix this. There was 
no agreement that string should enforce UTF-8 encoding, even though this meant 
an inability to enforce interoperability with Java, probably driven in large 
part by pre-existing data at Facebook (and other places?) where strings were 
already for binary data in C++ (at the time, Java was somewhat of a 
second-class citizen for Thrift -- IIRC Facebook's emphasis was on C++. Python, 
PHP). Somehow I imagine that the backwards compatibility issue is not going to 
be taken off the table.

I may not fully understand the issues with Python so forgive me if this 
suggestion is naive: Can we split the difference and have some kind of 
configuration option to "enforce UTF-8" for Python (but make it off by default)?

The policy would then be: use non-UTF8 encoding in strings if you wish, but 
realize that you will not interoperate correctly with Java and C# all the time 
or with Python when "enforce UTF-8" mode is on.


> Python library + compiler does not support unicode strings
> ----------------------------------------------------------
>
>                 Key: THRIFT-395
>                 URL: https://issues.apache.org/jira/browse/THRIFT-395
>             Project: Thrift
>          Issue Type: Bug
>          Components: Compiler (Python), Library (Python)
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>            Priority: Blocker
>             Fix For: 0.1
>
>         Attachments: 
> 0001-python-Minor-cleanup-of-protocols-don-t-use-str.patch, 
> 0002-THRIFT-395.-python-Phase-One-of-support-for-unicode.patch, 
> 0003-THRIFT-395.-python-Phase-Two-of-support-for-unicode.patch, 
> 0004-python-Remove-ridiculous-semicolons-from-gen-code.patch, 
> python-utf8-v2.patch, python-utf8.patch
>
>
> Effectively, all strings in the python bindings are treated as binary strings 
> -- no encoding/decoding to UTF-8 is done.  So if a unicode object is passed 
> to a (regular, non-binary) string, an exception is raised.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to