[ https://issues.apache.org/jira/browse/THRIFT-395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694618#action_12694618 ]
Chad Walters commented on THRIFT-395: ------------------------------------- There is a tension in Thrift between allowing types to have their natural meaning in each language and having complete interoperability across the full suite of languages. This is just one example. Should I limit the expressiveness of Thrift structures I can use in one language (C++) because of the strictures of some other language amongst those supported by Thrift that I may or may not be using? Thrift has not always made a consistent choice on this. On the full interoperability side: -- unsigned integers are not supported because a number of languages don't support them natively. On the side of "the IDL writer and/or the applications are responsible for guaranteeing interoperability": -- the application is responsible for interoperability of strings between, say, Java and C++ -- that is, if you want to interoperate, you need to make sure that C++ is sending only UTF8 encoded data in strings -- the IDL writer is responsible for determining if map keys should only be primitive types (required by some languages -- and the JSON protocol btw) or if they can be structures or containers as well (limiting interoperability) I personally have always leaned on the side of more interoperability and correcting some of the early architectural warts. If it had been up to me, we would have bitten off the backwards compatibility break a while back, changed string to only be UTF-8, made restrictions on the types for map keys, made binary its own standalone type, etc. However, I can understand the concerns of those with a big investment in persisted Thrift data who have pushed back against non-backwards compatible changes. I'll repeat my suggestion and expand on it a little further: Thrift could operate in 2 modes: "more flexibility" or "more compatibility". Under "more flexibility", it would operate more or less as things are today (eg: I wouldn't re-examine the signed vs unsigned decision -- too many opportunities for foot shooting there). Under "more compatibility", strings would be required to be UTF-8, map keys would be required to be primitives, etc. I would expect that most people adopting Thrift now would select "more compatibility" but those with specific needs could use the "more flexibility" mode. Adding a new type for Unicode strings seems fine to me as well -- in fact there are already suitable unused type constants in TProtocol.h (UTF8, along with UTF7 and UTF16) (see http://svn.apache.org/viewvc/incubator/thrift/trunk/lib/cpp/src/protocol/TProtocol.h?view=markup) (Side note: Why are those there? I always figured it was leftover cruft from something that was worked on at FB and then abandoned. Should they be cleaned up?). The only real issue I can see is that it entails changes to all the protocols and code generators across all the languages -- it's just work but its not hard work. bq. a. as I have said several times (to no contradictions), simply recompiling the IDL after changing to binary is a virtually painless way to get back the old behavior of treating all data as binary no matter what it was declared as I hate to say this since doing what you said would also smooth the way for promoting binary to a full fledged type but I am not sure that it's as easy as you suggest. I think this will break code since the return type of readBinary() is not the same as the return type of readString(). > Python library + compiler does not support unicode strings > ---------------------------------------------------------- > > Key: THRIFT-395 > URL: https://issues.apache.org/jira/browse/THRIFT-395 > Project: Thrift > Issue Type: Bug > Components: Compiler (Python), Library (Python) > Reporter: Jonathan Ellis > Assignee: Jonathan Ellis > Priority: Blocker > Fix For: 0.1 > > Attachments: > 0001-python-Minor-cleanup-of-protocols-don-t-use-str.patch, > 0002-THRIFT-395.-python-Phase-One-of-support-for-unicode.patch, > 0003-THRIFT-395.-python-Phase-Two-of-support-for-unicode.patch, > 0004-python-Remove-ridiculous-semicolons-from-gen-code.patch, > python-utf8-v2.patch, python-utf8.patch > > > Effectively, all strings in the python bindings are treated as binary strings > -- no encoding/decoding to UTF-8 is done. So if a unicode object is passed > to a (regular, non-binary) string, an exception is raised. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.