Re: Lower overhead String encoding/decoding
Hi, I'm still planning to be the sponsor for this RFE and get this one into the jdk9. We are current working on JEP 254 [1][2][3]in which the internal storage mechanism is changed from char[] to byte[]. If this JEP get approved and is scheduled to target JDK9, the decoding/encoding pieces will also need to be updated/ changed accordingly. Given that I would prefer to delay this RFE a little more till all pieces are settled. Do we actually need to wait for Compact Strings to land? The failure scenario would be that Compact Strings are not making it to JDK 9, and this code lingers in history. I mean it's fine to see if there is a clash between the two, but I don't think Richard's change should be blocked by Compact Strings, given it is ready for integration (?), and Compact Strings are not yet. Thanks for your replies gents - most appreciated. IMO the changes are orthogonal at the moment. If compact strings happen then this patch will probably need to be extended, but I imagine at least the implemented code is the non-compact case and will remain - as would API decisions and Javadoc. Of course that's just my opinion and I'm not as informed on the compact strings work as you are ;) regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Hi gents, My apology for the delay, things are little slow during this period of the year:-) I will sponsor the rfe and prepare for the CCC (internally). We may need go through the api and the implementation one more time. I was just wondering if there was any news on the CCC for this patch? Just tracking back on this front as I've not had a reply for over 3 months. It would be good to know the status of this patch. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Richard, I'm still planning to be the sponsor for this RFE and get this one into the jdk9. We are current working on JEP 254 [1][2][3]in which the internal storage mechanism is changed from char[] to byte[]. If this JEP get approved and is scheduled to target JDK9, the decoding/encoding pieces will also need to be updated/ changed accordingly. Given that I would prefer to delay this RFE a little more till all pieces are settled. Regards, Sherman [1] http://openjdk.java.net/jeps/254 [2] https://bugs.openjdk.java.net/browse/JDK-8054307 [3] http://hg.openjdk.java.net/jdk9/sandbox/ On 6/3/15 9:14 AM, Richard Warburton wrote: Hi gents, My apology for the delay, things are little slow during this period of the year:-) I will sponsor the rfe and prepare for the CCC (internally). We may need go through the api and the implementation one more time. I was just wondering if there was any news on the CCC for this patch? Just tracking back on this front as I've not had a reply for over 3 months. It would be good to know the status of this patch. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto https://bugs.openjdk.java.net/browse/JDK-8072392
Re: Lower overhead String encoding/decoding
On 06/03/2015 07:34 PM, Xueming Shen wrote: I'm still planning to be the sponsor for this RFE and get this one into the jdk9. We are current working on JEP 254 [1][2][3]in which the internal storage mechanism is changed from char[] to byte[]. If this JEP get approved and is scheduled to target JDK9, the decoding/encoding pieces will also need to be updated/ changed accordingly. Given that I would prefer to delay this RFE a little more till all pieces are settled. Do we actually need to wait for Compact Strings to land? The failure scenario would be that Compact Strings are not making it to JDK 9, and this code lingers in history. I mean it's fine to see if there is a clash between the two, but I don't think Richard's change should be blocked by Compact Strings, given it is ready for integration (?), and Compact Strings are not yet. Thanks, -Aleksey
Re: Lower overhead String encoding/decoding
Hi Xueming, My apology for the delay, things are little slow during this period of the year:-) I will sponsor the rfe and prepare for the CCC (internally). We may need go through the api and the implementation one more time. I was just wondering if there was any news on the CCC for this patch? regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
On 12/30/14 3:25 PM, Richard Warburton wrote: Hi, Thanks for looking at this patch and agreeing to sponsor - I've pushed fixes for these issues at: http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-10/ http://cr.openjdk.java.net/%7Erwarburton/string-patch-webrev-10/ Hope everyone has had a happy holiday period. Just following up on this issue to make sure that you have all the information needed and I've not been blocking the RFE process. Got a bit concerned due to radio silence. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto Hi Richard, My apology for the delay, things are little slow during this period of the year:-) I will sponsor the rfe and prepare for the CCC (internally). We may need go through the api and the implementation one more time. Regards, -Sherman
Re: Lower overhead String encoding/decoding
Hi, Thanks for looking at this patch and agreeing to sponsor - I've pushed fixes for these issues at: http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-10/ Hope everyone has had a happy holiday period. Just following up on this issue to make sure that you have all the information needed and I've not been blocking the RFE process. Got a bit concerned due to radio silence. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Hi gents, Here are some comments regarding the updated webrev. (1) String(ByteBuffer, String) needs @throw UEE. (2) It should be default replacement byte array not replace string for all getBytes() methods when malformed/unmappable (3) for decoding (new String) from ByteBuffer, since it is guaranteed that all bytes in the input ByteBuffer will be decoded, it might be desirable to clearly specify that the position of the buffer will be advanced to its limit ? (4) ln#1137 has an extra * (5) StringCoding.decode(cs, bb), why do you want to allocate a direct buffer here for untrusted? Basically the output buffer ca will always be a wrapper of a char[], all our charset implementation will have better performance if both input and output are array based (decode on array directly, instead of the slow ByteBuffer) Overall I think this is looking quite good and I think Sherman has captured the main issues. On #3 then wording such as .. the position will be updated isn't clear enough to allow the method be tested, it needs to make it clear that the position is advanced by the number of bytes that were decoded. Thanks for looking at this patch and agreeing to sponsor - I've pushed fixes for these issues at: http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-10/ regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Hi Richard, Here are some comments regarding the updated webrev. (1) String(ByteBuffer, String) needs @throw UEE. (2) It should be default replacement byte array not replace string for all getBytes() methods when malformed/unmappable (3) for decoding (new String) from ByteBuffer, since it is guaranteed that all bytes in the input ByteBuffer will be decoded, it might be desirable to clearly specify that the position of the buffer will be advanced to its limit ? (4) ln#1137 has an extra * (5) StringCoding.decode(cs, bb), why do you want to allocate a direct buffer here for untrusted? Basically the output buffer ca will always be a wrapper of a char[], all our charset implementation will have better performance if both input and output are array based (decode on array directly, instead of the slow ByteBuffer) -Sherman On 11/24/2014 08:45 AM, Richard Warburton wrote: Hi all, I'm sure everyone is very busy, but is there any chance that someone take a look at the latest iteration of this patch? Thanks for taking the time to look at this - most appreciated. I've pushed the latest iteration to http://cr.openjdk.java.net/~ rwarburton/string-patch-webrev-8/http://cr.openjdk.java.net/% 7Erwarburton/string-patch-webrev-8/. I think this is looking good. Thanks - I've pushed a new iteration to http://cr.openjdk.java.net/~ rwarburton/string-patch-webrev-9. For the constructor then the words decoding the specified byte buffer, it might be a bit clearer as decoding the remaining bytes in the For getBytes(ByteBuffer, Charset) then the position is advanced by the bytes written, no need to mention the number of chars read here. In the constructor then you make it clear that malformed/unmappable sequences use the default replacement. This is important to state in the getBytes methods too because the encoding can fail. I've made all these changes. Hi Richard, hi all, Two comments, You replace the nullcheck in getBytes() by a requireNonNull and at the same time, you dont use requireNonNull in String(ByteBuffer,Charset), no very logical isn't it ? Thanks for spotting this Remi - I've fixed this issue in my latest iteration. I think you need a supplementary constructor that takes a ByteBuffer and a charset name as a String, i may be wrong, but it seems that every constructor of String that takes a Charset has a dual constructor that takes a Charset as a String. As far as I remember, a constructor that takes a Charset as a String may be faster because you can share the character decoder instead of creating a new one. A good observation. I've added variants which take a String for the charset name as well the charset variants. Having said that - wouldn't it also be a good idea to replicate the caching on the charset versions as well as the charset name? I don't see any obvious reason why this isn't possible but perhaps there's something I'm missing here. Probably cleaner as a separate patch either way. regards, Richard Warburton http://insightfullogic.com @RichardWarburtohttp://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
On 24/11/2014 18:41, Xueming Shen wrote: Hi Richard, Here are some comments regarding the updated webrev. (1) String(ByteBuffer, String) needs @throw UEE. (2) It should be default replacement byte array not replace string for all getBytes() methods when malformed/unmappable (3) for decoding (new String) from ByteBuffer, since it is guaranteed that all bytes in the input ByteBuffer will be decoded, it might be desirable to clearly specify that the position of the buffer will be advanced to its limit ? (4) ln#1137 has an extra * (5) StringCoding.decode(cs, bb), why do you want to allocate a direct buffer here for untrusted? Basically the output buffer ca will always be a wrapper of a char[], all our charset implementation will have better performance if both input and output are array based (decode on array directly, instead of the slow ByteBuffer) Overall I think this is looking quite good and I think Sherman has captured the main issues. On #3 then wording such as .. the position will be updated isn't clear enough to allow the method be tested, it needs to make it clear that the position is advanced by the number of bytes that were decoded. Sherman - are you going to sponsor this for Richard? -Alan
Re: Lower overhead String encoding/decoding
On 11/24/2014 01:21 PM, Alan Bateman wrote: On 24/11/2014 18:41, Xueming Shen wrote: Hi Richard, Here are some comments regarding the updated webrev. (1) String(ByteBuffer, String) needs @throw UEE. (2) It should be default replacement byte array not replace string for all getBytes() methods when malformed/unmappable (3) for decoding (new String) from ByteBuffer, since it is guaranteed that all bytes in the input ByteBuffer will be decoded, it might be desirable to clearly specify that the position of the buffer will be advanced to its limit ? (4) ln#1137 has an extra * (5) StringCoding.decode(cs, bb), why do you want to allocate a direct buffer here for untrusted? Basically the output buffer ca will always be a wrapper of a char[], all our charset implementation will have better performance if both input and output are array based (decode on array directly, instead of the slow ByteBuffer) Overall I think this is looking quite good and I think Sherman has captured the main issues. On #3 then wording such as .. the position will be updated isn't clear enough to allow the method be tested, it needs to make it clear that the position is advanced by the number of bytes that were decoded. Sherman - are you going to sponsor this for Richard? Yes, I will sponsor this RFE. -Sherman
Re: Lower overhead String encoding/decoding
Hi all, Thanks for taking the time to look at this - most appreciated. I've pushed the latest iteration to http://cr.openjdk.java.net/~ rwarburton/string-patch-webrev-8/ http://cr.openjdk.java.net/% 7Erwarburton/string-patch-webrev-8/. I think this is looking good. Thanks - I've pushed a new iteration to http://cr.openjdk.java.net/~ rwarburton/string-patch-webrev-9. For the constructor then the words decoding the specified byte buffer, it might be a bit clearer as decoding the remaining bytes in the For getBytes(ByteBuffer, Charset) then the position is advanced by the bytes written, no need to mention the number of chars read here. In the constructor then you make it clear that malformed/unmappable sequences use the default replacement. This is important to state in the getBytes methods too because the encoding can fail. I've made all these changes. Hi Richard, hi all, Two comments, You replace the nullcheck in getBytes() by a requireNonNull and at the same time, you dont use requireNonNull in String(ByteBuffer,Charset), no very logical isn't it ? Thanks for spotting this Remi - I've fixed this issue in my latest iteration. I think you need a supplementary constructor that takes a ByteBuffer and a charset name as a String, i may be wrong, but it seems that every constructor of String that takes a Charset has a dual constructor that takes a Charset as a String. As far as I remember, a constructor that takes a Charset as a String may be faster because you can share the character decoder instead of creating a new one. A good observation. I've added variants which take a String for the charset name as well the charset variants. Having said that - wouldn't it also be a good idea to replicate the caching on the charset versions as well as the charset name? I don't see any obvious reason why this isn't possible but perhaps there's something I'm missing here. Probably cleaner as a separate patch either way. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
On 11/15/2014 07:07 PM, Richard Warburton wrote: Hi all, [...] A good observation. I've added variants which take a String for the charset name as well the charset variants. Having said that - wouldn't it also be a good idea to replicate the caching on the charset versions as well as the charset name? I don't see any obvious reason why this isn't possible but perhaps there's something I'm missing here. Probably cleaner as a separate patch either way. regards, I think it will create a security hazard, if a rogue charset declare itself as UTF-8 and create a malicious charset decoder, you don't want this decoder to be shared and reused. Rémi Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Hi, Having said that - wouldn't it also be a good idea to replicate the caching on the charset versions as well as the charset name? I don't see any obvious reason why this isn't possible but perhaps there's something I'm missing here. Probably cleaner as a separate patch either way. regards, I think it will create a security hazard, if a rogue charset declare itself as UTF-8 and create a malicious charset decoder, you don't want this decoder to be shared and reused. When you're passing the name of the charset as a string, then you lookup the charset by name, but if you were using Charsets, then you could check that your cache has the same charset instance. Wouldn't that avoid this security issue? regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
On 26/10/2014 21:10, Richard Warburton wrote: : Thanks for taking the time to look at this - most appreciated. I've pushed the latest iteration to http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-8/ http://cr.openjdk.java.net/%7Erwarburton/string-patch-webrev-8/. I think this is looking good. For the constructor then the words decoding the specified byte buffer, it might be a bit clearer as decoding the remaining bytes in the For getBytes(ByteBuffer, Charset) then the position is advanced by the bytes written, no need to mention the number of chars read here. In the constructor then you make it clear that malformed/unmappable sequences use the default replacement. This is important to state in the getBytes methods too because the encoding can fail. -Alan.
Re: Lower overhead String encoding/decoding
On 10/28/2014 03:52 PM, Alan Bateman wrote: On 26/10/2014 21:10, Richard Warburton wrote: : Thanks for taking the time to look at this - most appreciated. I've pushed the latest iteration to http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-8/ http://cr.openjdk.java.net/%7Erwarburton/string-patch-webrev-8/. I think this is looking good. For the constructor then the words decoding the specified byte buffer, it might be a bit clearer as decoding the remaining bytes in the For getBytes(ByteBuffer, Charset) then the position is advanced by the bytes written, no need to mention the number of chars read here. In the constructor then you make it clear that malformed/unmappable sequences use the default replacement. This is important to state in the getBytes methods too because the encoding can fail. -Alan. Hi Richard, hi all, Two comments, You replace the nullcheck in getBytes() by a requireNonNull and at the same time, you dont use requireNonNull in String(ByteBuffer,Charset), no very logical isn't it ? I think you need a supplementary constructor that takes a ByteBuffer and a charset name as a String, i may be wrong, but it seems that every constructor of String that takes a Charset has a dual constructor that takes a Charset as a String. As far as I remember, a constructor that takes a Charset as a String may be faster because you can share the character decoder instead of creating a new one. cheers, Rémi
Re: Lower overhead String encoding/decoding
On 10/26/2014 02:10 PM, Richard Warburton wrote: The getBytes(ByteBuffer, Charset) method needs a bit more javadoc. You should be able to borrow from text from the CharsetEncoder#encode methods to help with that. I've updated the Javadoc with more information about the encoding and made these changes. I'm not sure if there's anything else that's missing in this case. Hi Richard, You have ...reflect the characters read and the byte written... for the getBtyes(ByteBuffer), the characters read and should be deleted? I'm not that sure if it is necessary to document it clearly that there would be no dangling bytes in case of encoding/getBytes(), when there is no enough room/space to encode a full char... -Sherman
Re: Lower overhead String encoding/decoding
Hi Alan, I think this looks better but I have a few comments on the API. Thanks for taking the time to look at this - most appreciated. I've pushed the latest iteration to http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-8/. For String(ByteBuffer, Charset) then it's still inconsistent to read from the buffer position but not advance it. I think the constructor needs to work like a regular decode in that respect. Related to this is the underflow case where there are insufficient remaining bytes to complete. If you don't advance the position then there is no way to detect this. I've modified the Javadoc and tests. The implementation was actually behaving this way already. The statement about the length of the String .. may not be equal to the length of the subarray might be there from a previous iteration. There isn't any array in the method signature so I think this statement needs to be make a bit clearer. I've rephrased this to refer to the ByteBuffer. For getBytes(byte[], int, int, Charset) then I think the javadoc could say a bit more. It will encode to a maximum of destBuffer.length - destOffset for example. The @return should probably say that it's the number of bytes written to the array rather than copied. Minor nits is that you probably don't want the starting p. Also the finals in the signature seem noisy/not needed. The getBytes(ByteBuffer, Charset) method needs a bit more javadoc. You should be able to borrow from text from the CharsetEncoder#encode methods to help with that. I've updated the Javadoc with more information about the encoding and made these changes. I'm not sure if there's anything else that's missing in this case. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
On 19/10/2014 23:13, Richard Warburton wrote: Hi, Hi Richard, couple comments after a quick scan. Thanks for your comments - only just had a post Javaone chance to look at this stuff. I've pushed an update to: *http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7 http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7* I think this looks better but I have a few comments on the API. For String(ByteBuffer, Charset) then it's still inconsistent to read from the buffer position but not advance it. I think the constructor needs to work like a regular decode in that respect. Related to this is the underflow case where there are insufficient remaining bytes to complete. If you don't advance the position then there is no way to detect this. The statement about the length of the String .. may not be equal to the length of the subarray might be there from a previous iteration. There isn't any array in the method signature so I think this statement needs to be make a bit clearer. For getBytes(byte[], int, int, Charset) then I think the javadoc could say a bit more. It will encode to a maximum of destBuffer.length - destOffset for example. The @return should probably say that it's the number of bytes written to the array rather than copied. Minor nits is that you probably don't want the starting p. Also the finals in the signature seem noisy/not needed. The getBytes(ByteBuffer, Charset) method needs a bit more javadoc. You should be able to borrow from text from the CharsetEncoder#encode methods to help with that. -Alan.
Re: Lower overhead String encoding/decoding
Hi, Hi Richard, couple comments after a quick scan. Thanks for your comments - only just had a post Javaone chance to look at this stuff. I've pushed an update to: *http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7 http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7* (1) #474, the IOBE probably is no longer needed in case of ByteBuffer. parameter. Fixed in new push. (2) for methods that have the ByteBuffer as input, it would be desirable to specify clearly that the bytes are read from position to limit, and whether or not the position will be advanced after decoding/encoding (important). Added Javadoc in new push. (3) getBytes(byte[], offset, charset) while I understand it might be useful to have such a method in certain circumstance, it is usually complicated to make it right/easy for user to actually use it. Consider the fact that the returned number of bytes encoded has no straightforward connection to how many underlying chars have been encoded (so the user of this method really have no idea how many underlying chars have been encoded into the dest buffer, is that enough? how big the buffer need to be to encode the whole string? ) especially the possibility that the last couple byte space might be short of encoding a char. Not like the getChars(), which has a easy, clear and direct link between the out going chars and underlying chars. I would suggest it might be better to leave it out. I think you raise some good points here. In my mind this is how it works, let me know what you think. The client code of this API call should be able to infer the state of the byte[] - the offset parameters is in terms of bytes and refers to the destination buffer. If you want to know how many bytes have been used by the encoding, the method returns an int that shows the length. You can query the length of the String, which gives you the number of outbound chars. (4) StringCoding.decode() #239 remaining() should be used to return limit - position? Fixed in new push. (5) in case of untrusted, it might be more straightforward to get all bytes out of the buffer first (you are allocating a byte buffer here anyway, I don;t see obvious benefit to get a direct buffer, btw) and then pass it to the original/existing byte[]-char[] decoding implementation. We probably will take a deep look at the implementation later when the public api settled. Yes - I was hoping to remove the additional bytebuffer allocation within the method. I'm still not sure of the security implications around removing this code though. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Hi, On 09/22/2014 01:25 PM, Richard Warburton wrote: Hi all, A long-standing issue with Strings in Java is the ease and performance of creating a String from a ByteBuffer. People who are using nio to bring in data off the network will be receiving that data in the form of bytebuffers and converting it to some form of String. For example restful systems receiving XML or Json messages. The current workaround is to create a byte[] from the ByteBuffer - a copying action for any direct bytebuffer - and then pass that to the String. An alternative is to use CharsetDecoder to program a decoding operation on input ByteBuffer(s), writing the result to CharBuffer(s). If the resulting CharBuffer is a single object (big enough), it can be converted to String via simple CharBuffer.toString(). Which is a copy-ing operation. In situations where the number of resulting characters can be anticipated in advance (like when we know in advance the number of bytes to be decoded and the charset used has fixed number of bytes per char or nearly fixed (like with UTF-8), a simple static utility method somewhere in java.lang.nio package could be used to optimize this operation: public static String decodeString(CharsetDecoder dec, ByteBuffer in) throws CharacterCodingException { CharBuffer cb = dec.decode(in); if (cb.length() == cb.hb.length) { // optimized no-copy String construction return SharedSecrets.getJavaLangAccess().newStringUnsafe(cb.hb); } else { return cb.toString(); } } I'd like to propose that we add an additional constructor to the String class that takes a ByteBuffer as an argument, and directly create the char[] value inside the String from the ByteBuffer. Similarly if you have a String that you want to encode onto the wire then you need to call String.getBytes(), then write your byte[] into a ByteBuffer or send it over the network. Again, an alternative is to use CharBuffer.wrap(CharSequence cs, int start, int end) to wrap a String with a CharBuffer facade and then use CharsetEncoder to encode it directly into a resulting ByteBuffer. No additional copy-ing needed. Regards, Peter This ends up allocating a byte[] to do the copy and also trimming the byte[] back down again, usually allocating another byte[]. To address this problem I've added a couple of getBytes() overloads that take byte[] and ByteBuffer arguments and write directly to those buffers. I've put together a patch that implements this to demonstrate the overall direction. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/ I'm happy to take any feedback on direction or the patch itself or the overall idea/approach. I think there are a number of similar API situations in other places as well, for example StringBuffer/StringBuilder instances which could have similar appends directly from ByteBuffer instances instead of byte[] instances. I'll also be at Javaone next week, so if you want to talk about this, just let me know. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto PS: I appreciate that since I'm adding a method to the public API which consequently requires standardisation but I think that this could get incorporated into the Java 9 umbrella JSR.
Re: Lower overhead String encoding/decoding
Hi Alan, Thanks for the feedback. The direction seems reasonable but I wonder about the offset/length (and destOffet) parameters as this isn't consistent with how ByteBuffers were originally intended to be used. That is, when you read the bytes from the wire into a ByteBuffer and flip it then the position and limit will delimit the bytes to be decoded. If the constructor is changed to String(ByteBuffer in, Charset cs) and decodes the remaining bytes in the buffer to a String using the specified Charset then I think would be more consistent. Also I think this would give you a solution to the underflow case. I've updated the webrev to reflect this, removing the offset and length parameters and using position() and limit() instead. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/ Similarly if getBytes is replaced with with a getBytes or encode(ByteBuffer, Charset cs) then then it would encode as many characters as possible into the output buffer and I think would be more consistent and also help with overflow case. I've also applied the this to the getBytes() method. I chose the getBytes() method name for consistency with the existing getBytes() method that returns a byte[]. To my mind encode() is a more natural name for the method, which you mention in your email, do people have a preference here? regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
Hi Richard, couple comments after a quick scan. (1) #474, the IOBE probably is no longer needed in case of ByteBuffer. parameter. (2) for methods that have the ByteBuffer as input, it would be desirable to specify clearly that the bytes are read from position to limit, and whether or not the position will be advanced after decoding/encoding (important). (3) getBytes(byte[], offset, charset) while I understand it might be useful to have such a method in certain circumstance, it is usually complicated to make it right/easy for user to actually use it. Consider the fact that the returned number of bytes encoded has no straightforward connection to how many underlying chars have been encoded (so the user of this method really have no idea how many underlying chars have been encoded into the dest buffer, is that enough? how big the buffer need to be to encode the whole string? ) especially the possibility that the last couple byte space might be short of encoding a char. Not like the getChars(), which has a easy, clear and direct link between the out going chars and underlying chars. I would suggest it might be better to leave it out. (4) StringCoding.decode() #239 remaining() should be used to return limit - position? (5) in case of untrusted, it might be more straightforward to get all bytes out of the buffer first (you are allocating a byte buffer here anyway, I don;t see obvious benefit to get a direct buffer, btw) and then pass it to the original/existing byte[]-char[] decoding implementation. We probably will take a deep look at the implementation later when the public api settled. -Sherman Richard Warburton wrote: Hi Alan, Thanks for the feedback. The direction seems reasonable but I wonder about the offset/length (and destOffet) parameters as this isn't consistent with how ByteBuffers were originally intended to be used. That is, when you read the bytes from the wire into a ByteBuffer and flip it then the position and limit will delimit the bytes to be decoded. If the constructor is changed to String(ByteBuffer in, Charset cs) and decodes the remaining bytes in the buffer to a String using the specified Charset then I think would be more consistent. Also I think this would give you a solution to the underflow case. I've updated the webrev to reflect this, removing the offset and length parameters and using position() and limit() instead. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/ Similarly if getBytes is replaced with with a getBytes or encode(ByteBuffer, Charset cs) then then it would encode as many characters as possible into the output buffer and I think would be more consistent and also help with overflow case. I've also applied the this to the getBytes() method. I chose the getBytes() method name for consistency with the existing getBytes() method that returns a byte[]. To my mind encode() is a more natural name for the method, which you mention in your email, do people have a preference here? regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Re: Lower overhead String encoding/decoding
On 9/25/14 9:24 AM, Xueming Shen wrote: Hi Richard, couple comments after a quick scan. (1) #474, the IOBE probably is no longer needed in case of ByteBuffer. parameter. (2) for methods that have the ByteBuffer as input, it would be desirable to specify clearly that the bytes are read from position to limit, and whether or not the position will be advanced after decoding/encoding (important). (3) getBytes(byte[], offset, charset) while I understand it might be useful to have such a method in certain circumstance, it is usually complicated to make it right/easy for user to actually use it. Consider the fact that the returned number of bytes encoded has no straightforward connection to how many underlying chars have been encoded (so the user of this method really have no idea how many underlying chars have been encoded into the dest buffer, is that enough? how big the buffer need to be to encode the whole string? ) especially the possibility that the last couple byte space might be short of encoding a char. Not like the getChars(), which has a easy, clear and direct link between the out going chars and underlying chars. I would suggest it might be better to leave it out. I would assume this is true for the ByteBuffer version as well...just doubt the usefulness of a method that it might only give you the bytes of part of the target string object. Given the getBytes(ByteBuffer) version has the additional position/limit info form the buffer itself, a workaround is to return the number of underlying chars copied...then you need one more parameter to let user to copy from index' for next round. -sherman (4) StringCoding.decode() #239 remaining() should be used to return limit - position? (5) in case of untrusted, it might be more straightforward to get all bytes out of the buffer first (you are allocating a byte buffer here anyway, I don;t see obvious benefit to get a direct buffer, btw) and then pass it to the original/existing byte[]-char[] decoding implementation. We probably will take a deep look at the implementation later when the public api settled. -Sherman Richard Warburton wrote: Hi Alan, Thanks for the feedback. The direction seems reasonable but I wonder about the offset/length (and destOffet) parameters as this isn't consistent with how ByteBuffers were originally intended to be used. That is, when you read the bytes from the wire into a ByteBuffer and flip it then the position and limit will delimit the bytes to be decoded. If the constructor is changed to String(ByteBuffer in, Charset cs) and decodes the remaining bytes in the buffer to a String using the specified Charset then I think would be more consistent. Also I think this would give you a solution to the underflow case. I've updated the webrev to reflect this, removing the offset and length parameters and using position() and limit() instead. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/ Similarly if getBytes is replaced with with a getBytes or encode(ByteBuffer, Charset cs) then then it would encode as many characters as possible into the output buffer and I think would be more consistent and also help with overflow case. I've also applied the this to the getBytes() method. I chose the getBytes() method name for consistency with the existing getBytes() method that returns a byte[]. To my mind encode() is a more natural name for the method, which you mention in your email, do people have a preference here? regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto
Lower overhead String encoding/decoding
Hi all, A long-standing issue with Strings in Java is the ease and performance of creating a String from a ByteBuffer. People who are using nio to bring in data off the network will be receiving that data in the form of bytebuffers and converting it to some form of String. For example restful systems receiving XML or Json messages. The current workaround is to create a byte[] from the ByteBuffer - a copying action for any direct bytebuffer - and then pass that to the String. I'd like to propose that we add an additional constructor to the String class that takes a ByteBuffer as an argument, and directly create the char[] value inside the String from the ByteBuffer. Similarly if you have a String that you want to encode onto the wire then you need to call String.getBytes(), then write your byte[] into a ByteBuffer or send it over the network. This ends up allocating a byte[] to do the copy and also trimming the byte[] back down again, usually allocating another byte[]. To address this problem I've added a couple of getBytes() overloads that take byte[] and ByteBuffer arguments and write directly to those buffers. I've put together a patch that implements this to demonstrate the overall direction. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/ I'm happy to take any feedback on direction or the patch itself or the overall idea/approach. I think there are a number of similar API situations in other places as well, for example StringBuffer/StringBuilder instances which could have similar appends directly from ByteBuffer instances instead of byte[] instances. I'll also be at Javaone next week, so if you want to talk about this, just let me know. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto PS: I appreciate that since I'm adding a method to the public API which consequently requires standardisation but I think that this could get incorporated into the Java 9 umbrella JSR.
Re: Lower overhead String encoding/decoding
On 22/09/2014 12:25, Richard Warburton wrote: : I've put together a patch that implements this to demonstrate the overall direction. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/ I'm happy to take any feedback on direction or the patch itself or the overall idea/approach. I think there are a number of similar API situations in other places as well, for example StringBuffer/StringBuilder instances which could have similar appends directly from ByteBuffer instances instead of byte[] instances. The direction seems reasonable but I wonder about the offset/length (and destOffet) parameters as this isn't consistent with how ByteBuffers were originally intended to be used. That is, when you read the bytes from the wire into a ByteBuffer and flip it then the position and limit will delimit the bytes to be decoded. If the constructor is changed to String(ByteBuffer in, Charset cs) and decodes the remaining bytes in the buffer to a String using the specified Charset then I think would be more consistent. Also I think this would give you a solution to the underflow case. Similarly if getBytes is replaced with with a getBytes or encode(ByteBuffer, Charset cs) then then it would encode as many characters as possible into the output buffer and I think would be more consistent and also help with overflow case. -Alan.
Re: Lower overhead String encoding/decoding
Compare with https://bugs.openjdk.java.net/browse/JDK-6695386 Maybe that would help a little. -Ulf Am 22.09.2014 um 13:25 schrieb Richard Warburton: Hi all, A long-standing issue with Strings in Java is the ease and performance of creating a String from a ByteBuffer. People who are using nio to bring in data off the network will be receiving that data in the form of bytebuffers and converting it to some form of String. For example restful systems receiving XML or Json messages. The current workaround is to create a byte[] from the ByteBuffer - a copying action for any direct bytebuffer - and then pass that to the String. I'd like to propose that we add an additional constructor to the String class that takes a ByteBuffer as an argument, and directly create the char[] value inside the String from the ByteBuffer. Similarly if you have a String that you want to encode onto the wire then you need to call String.getBytes(), then write your byte[] into a ByteBuffer or send it over the network. This ends up allocating a byte[] to do the copy and also trimming the byte[] back down again, usually allocating another byte[]. To address this problem I've added a couple of getBytes() overloads that take byte[] and ByteBuffer arguments and write directly to those buffers. I've put together a patch that implements this to demonstrate the overall direction. http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/ I'm happy to take any feedback on direction or the patch itself or the overall idea/approach. I think there are a number of similar API situations in other places as well, for example StringBuffer/StringBuilder instances which could have similar appends directly from ByteBuffer instances instead of byte[] instances. I'll also be at Javaone next week, so if you want to talk about this, just let me know. regards, Richard Warburton http://insightfullogic.com @RichardWarburto http://twitter.com/richardwarburto PS: I appreciate that since I'm adding a method to the public API which consequently requires standardisation but I think that this could get incorporated into the Java 9 umbrella JSR.