[ 
https://issues.apache.org/jira/browse/THRIFT-765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863870#action_12863870
 ] 

Bryan Duxbury commented on THRIFT-765:
--------------------------------------

I've spent a bunch of time playing around with this. I discovered that the 
"bugs" with the new encoder/decoder only occur when decoding invalid UTF-8 
strings. In particular, the Java standard decoder seems to take a lot more care 
with validating every byte it examines before combining it into the codepoint 
and converting it to UTF-16. 

While the decoding function I have right now is still about 2x the speed of 
Java's, adding in all the checks required to get us to parity of functionality 
will almost certainly evaporate the performance benefits. Even worse, it seems 
like the performance benefits of encoding have somehow disappeared during the 
process of implemented surrogate pair support. (This one confuses me - my 
benchmark doesn't even cover a string that contains surrogate pairs. Maybe my 
original tests were flawed?)

The bottom line is that it looks like this is a dead end, unless we are willing 
to sacrifice "correctness" when decoding invalid UTF-8 encoded strings. You 
could argue that if it's a bad encoding already, it might be best to detect 
that and throw rather than silently convert like Java does, but that's a debate 
for another time.

The only way that reviving this issue would make sense is if we are willing to 
support a separate encoding mechanism for the purpose of avoiding buffer 
allocation and copies during write. At the moment, we're not equipped to 
benefit from that, so maybe I'll reevaluate later.


> Improved string encoding and decoding performance
> -------------------------------------------------
>
>                 Key: THRIFT-765
>                 URL: https://issues.apache.org/jira/browse/THRIFT-765
>             Project: Thrift
>          Issue Type: Improvement
>          Components: Library (Java)
>    Affects Versions: 0.2
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>             Fix For: 0.4
>
>         Attachments: thrift-765-redux-v2.patch, thrift-765-redux.patch, 
> thrift-765.patch
>
>
> One of the most consistent time-consuming spots of Thrift serialization and 
> deserialization is string encoding. For some inscrutable reason, 
> String.getBytes("UTF-8") is slow. 
> However, it's recently been brought to my attention that DataOutputStream's 
> writeUTF method has a faster implementation of UTF-8 encoding and decoding. 
> We should use this style of encoding.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to