[ https://issues.apache.org/jira/browse/THRIFT-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694521#action_12694521 ]
Jonathan Ellis commented on THRIFT-395: --------------------------------------- @Chad: So there are two questions here: 1. is utf8 strings the right design decision, absent backwards-compatibility concerns 2. is it worth breaking back-compat for I think some people are reluctant to admit 1. because they are afraid of 2. I think the case for 2. can be made as follows: a. as I have said several times (to no contradictions), simply recompiling the IDL after changing to binary is a virtually painless way to get back the old behavior of treating all data as binary no matter what it was declared as b. nobody is forcing you to upgrade. if svn 700000-whatever of thrift works for you, keep using it. if necessary, backporting fixes to a branch is not an unheard-of strategy either. c. you have to weigh the pain against current users from changing behavior, to pain for future users from NOT changing. hopefully the current user base is a fraction of what it will be in a few years. unicode is not going to get less popular in that time. d. if you can't change broken behavior before there is an official release, when CAN you change it? Now, perhaps it's worth the time to briefly explain my use case. I work on the Cassandra distributed database, where we use thrift to let clients in any supported language talk to the Java server. Keys are `string`s so they absolutely have to be compatible cross-platform. Currently they are not. Telling users that "thrift doesn't really support unicode, so in Python you have to set this flag, and in other languages it doesn't work at all and it will be an uphill battle to get a patch accepted" is a non-starter. Cassandra has a high enough barrier to entry as it is, that adding to it unnecessarily is foolish. Without real unicode support we'll have to switch to binary to get behavior that is at least consistent cross platform. > Python library + compiler does not support unicode strings > ---------------------------------------------------------- > > Key: THRIFT-395 > URL: https://issues.apache.org/jira/browse/THRIFT-395 > Project: Thrift > Issue Type: Bug > Components: Compiler (Python), Library (Python) > Reporter: Jonathan Ellis > Assignee: Jonathan Ellis > Priority: Blocker > Fix For: 0.1 > > Attachments: > 0001-python-Minor-cleanup-of-protocols-don-t-use-str.patch, > 0002-THRIFT-395.-python-Phase-One-of-support-for-unicode.patch, > 0003-THRIFT-395.-python-Phase-Two-of-support-for-unicode.patch, > 0004-python-Remove-ridiculous-semicolons-from-gen-code.patch, > python-utf8-v2.patch, python-utf8.patch > > > Effectively, all strings in the python bindings are treated as binary strings > -- no encoding/decoding to UTF-8 is done. So if a unicode object is passed > to a (regular, non-binary) string, an exception is raised. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.