[ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12862360#action_12862360 ]
Yonik Seeley commented on THRIFT-765: ------------------------------------- Doug, I think you were right the first time. I took a quick glance at the code, and it doesn't handle surrogate pairs. The resulting output will be CESU-8, not UTF-8. Taking Ken's example, that code point encoded in UTF-8 is 4 bytes, while the encoding this patch implements will yield 6 bytes of CESU-8. A round-trip test is not sufficient to test for correct UTF-8 (they will both decode to the same Java String). > Improved string encoding and decoding performance > ------------------------------------------------- > > Key: THRIFT-765 > URL: https://issues.apache.org/jira/browse/THRIFT-765 > Project: Thrift > Issue Type: Improvement > Components: Library (Java) > Affects Versions: 0.2 > Reporter: Bryan Duxbury > Assignee: Bryan Duxbury > Fix For: 0.3 > > Attachments: thrift-765.patch > > > One of the most consistent time-consuming spots of Thrift serialization and > deserialization is string encoding. For some inscrutable reason, > String.getBytes("UTF-8") is slow. > However, it's recently been brought to my attention that DataOutputStream's > writeUTF method has a faster implementation of UTF-8 encoding and decoding. > We should use this style of encoding. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.