[jira] Commented: (THRIFT-395) Python library + compiler does not support unicode strings

David Reiss (JIRA) Wed, 01 Apr 2009 13:35:37 -0700

    [ 
https://issues.apache.org/jira/browse/THRIFT-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694722#action_12694722
 ]


David Reiss commented on THRIFT-395:
------------------------------------

Man, I sleep in one morning and miss the whole party.

Since we've strayed a bit from the original topic, let me ask a quick question 
to make sure we're at least aware of the scope of the discussion: What concrete 
changes would you like to see in Thrift (in the area of encodings) *other than* 
the return type of readString in Python 2?

Now, let me just respond to a bunch of stuff in chronological order...

bq. Can we split the difference and have some kind of configuration option to 
"enforce UTF-8" for Python (but make it off by default)?
I'd be fine with this, though the change to the extension module is more 
complicated that the change to the pure-python stuff.

bq. what do you think of adding a new annotation (e.g. string.encoding) for 
specifying the actual string encoding?
I'd also be fine with that.  See THRIFT-414 for my planned approach.

bq. I'd deprecate str strings and, at some point in the future, support unicode 
strings only
If you're talking about Python, I think we should definitely do this for Python 
3, but never do it for Python 2.  If you're talking about all languages, I 
think it is unrealistic because C++, PHP, Perl, and Erlang are not going to 
have robust native Unicode support any time in the foreseeable future.

bq. is utf8 strings the right design decision, absent backwards-compatibility 
concerns [...] I think some people are reluctant to admit 1. because they are 
afraid of 2.
I think that it is not.  Requiring UTF-8 might seem sensible in a 
mostly-English environment, but having support for UTF-16 or a Chinese-oriented 
encoding (for example) can be very useful.  I'm fine saying that Thrift strings 
should be UTF-8 encoded unless otherwise specified (like, by an annotation), 
but enforcing it in environments that could benefit from a non-UTF-8 encoding 
is harmful.

bq. I think adding user-specified encodings adds more complexity than it's worth
bq. I think allowing the user to specify string encoding just adds complexity
I disagree.  I think that if we say that strings should default to UTF-8 unless 
otherwise annotated, it is not a big deal.  I think that removing the ability 
to support other encodings is a big deal.

bq. made restrictions on the types for map keys
I haven't ruled this out, if you want to talk about it.  But it should be a 
separate issue.  And if you are serious, we should do it before the release.

bq. made binary its own standalone type
This is effectively the case already.  The only possible problems arise when 
you change a field from string to binary without changing the field id (which 
is what Jonathan is suggesting, btw), and even then, I think only in the JSON 
protocol.

bq. If you are using the current code and sending binary data as a "string" 
then you are probably using Python on both client and server
C++, Ruby, Perl, PHP, and Erlang also do this.

bq. if I am understanding correctly, Python3 is now in the same camp as Java 
and C# - is that correct?
Exactly.  The "str" type in Python 3 is effectively the same as the "unicode" 
object in Python 2.  It is a string of Unicode code points that cannot be used 
in a context where bytes are expected.

bq. If so, maybe we want to treat Python3 as a different target language from 
Python2
Definitely.

bq. I detect a little bit of pro-Python2 on David's part
That is not my intention.  I actually think the Java/Python3 data model makes 
more sense in most contexts.  But I think that we should treat Python 2 as 
Python 2 (AFAIK, Thrift doesn't work in Python 3), which means that strings are 
strs.  A few examples of this: repr returns a str. Exception messages are strs. 
"" is a str.  Data read from files (even not opened in binary mode) are strs.

bq. Right now most thrift implementations cannot talk to my Java server and 
that is broken.
bq. We interoperate via Thrift across C+\+, Ruby, Java, Python2 and Erlang here 
and everything works just fine. We just make limited use of the 'string' type - 
and make sure that applications only send UTF-8 data via 'string'
bq. In other words, you are sending binary data that happens to be an encoded 
string and calling that a string, which it is not. It is binary data. That's 
working around one bug with another in my book.
Chad is right.  As in all C++, Ruby, PHP, Perl, and Erlang programs, it is the 
simply application's responsibility to ensure that the string is properly UTF-8 
encoded on writing and to interpret the string ast UTF-8 on reading.  I think 
you are assuming that the "string" is a "Unicode string" or a "string of 
Unicode code points".  In Thrift, this is not the case.  It is a string of 
bytes (that are presumably representing text), and it is up to the application 
to ensure that the bytes make sense.  Now, if we want to establish a 
*convention* that the bytes should be a UTF-8-encoded Unicode string unless 
otherwise annotated, that's fine with me, but I think that mandating UTF-8 is a 
harmful restriction, mandating Unicode, while probably fine, is *not* without 
downsides, and forcing applications to use special types for strings is pretty 
much out of the question.

bq. In 2009 a language that doesn't support unicode is barely usable, and will 
almost certainly support unicode soon.
bq. AFAIK all the thrift languages do support unicode already but I could be 
wrong on one or two.
There is a difference between supporting Unicode and having native-feeling 
support for unicode.  If you mean native-feeling support, then most languages 
do *not* have it.
 * C++ has wstring, which can be used for Unicode strings, but they are rarely 
used and there is no support for encoding and decoding.  The native-feeling way 
to write C++ is to use string-of-bytes std::string.
 * Ruby and PHP's string type is a string of bytes.  They have special 
functions for treating them as pre-encoded Unicode strings.  Believe it or not, 
it seems like PHP's support here might actually be better than Ruby's.
 * Erlang is completely Unicode-oblivious.

The only reason that this discussion is coming up here is that Python is the 
only Thrift language (AFAIK) that is on the fence between strings as bytes and 
strings as code points.

> Python library + compiler does not support unicode strings
> ----------------------------------------------------------
>
>                 Key: THRIFT-395
>                 URL: https://issues.apache.org/jira/browse/THRIFT-395
>             Project: Thrift
>          Issue Type: Bug
>          Components: Compiler (Python), Library (Python)
>            Reporter: Jonathan Ellis
>            Assignee: Jonathan Ellis
>            Priority: Blocker
>             Fix For: 0.1
>
>         Attachments: 
> 0001-python-Minor-cleanup-of-protocols-don-t-use-str.patch, 
> 0002-THRIFT-395.-python-Phase-One-of-support-for-unicode.patch, 
> 0003-THRIFT-395.-python-Phase-Two-of-support-for-unicode.patch, 
> 0004-python-Remove-ridiculous-semicolons-from-gen-code.patch, 
> python-utf8-v2.patch, python-utf8.patch
>
>
> Effectively, all strings in the python bindings are treated as binary strings 
> -- no encoding/decoding to UTF-8 is done.  So if a unicode object is passed 
> to a (regular, non-binary) string, an exception is raised.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (THRIFT-395) Python library + compiler does not support unicode strings

Reply via email to