[ https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863870#action_12863870 ]
Bryan Duxbury commented on THRIFT-765: -------------------------------------- I've spent a bunch of time playing around with this. I discovered that the "bugs" with the new encoder/decoder only occur when decoding invalid UTF-8 strings. In particular, the Java standard decoder seems to take a lot more care with validating every byte it examines before combining it into the codepoint and converting it to UTF-16. While the decoding function I have right now is still about 2x the speed of Java's, adding in all the checks required to get us to parity of functionality will almost certainly evaporate the performance benefits. Even worse, it seems like the performance benefits of encoding have somehow disappeared during the process of implemented surrogate pair support. (This one confuses me - my benchmark doesn't even cover a string that contains surrogate pairs. Maybe my original tests were flawed?) The bottom line is that it looks like this is a dead end, unless we are willing to sacrifice "correctness" when decoding invalid UTF-8 encoded strings. You could argue that if it's a bad encoding already, it might be best to detect that and throw rather than silently convert like Java does, but that's a debate for another time. The only way that reviving this issue would make sense is if we are willing to support a separate encoding mechanism for the purpose of avoiding buffer allocation and copies during write. At the moment, we're not equipped to benefit from that, so maybe I'll reevaluate later. > Improved string encoding and decoding performance > ------------------------------------------------- > > Key: THRIFT-765 > URL: https://issues.apache.org/jira/browse/THRIFT-765 > Project: Thrift > Issue Type: Improvement > Components: Library (Java) > Affects Versions: 0.2 > Reporter: Bryan Duxbury > Assignee: Bryan Duxbury > Fix For: 0.4 > > Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, > thrift-765.patch > > > One of the most consistent time-consuming spots of Thrift serialization and > deserialization is string encoding. For some inscrutable reason, > String.getBytes("UTF-8") is slow. > However, it's recently been brought to my attention that DataOutputStream's > writeUTF method has a faster implementation of UTF-8 encoding and decoding. > We should use this style of encoding. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.