Re: Lower overhead String encoding/decoding

2015-06-04 Thread Richard Warburton
Hi,

 I'm still planning to be the sponsor for this RFE and get this one
  into the jdk9. We are current working on JEP 254 [1][2][3]in which
  the internal storage mechanism is changed from char[] to byte[].  If
  this JEP get approved and is scheduled to target JDK9, the
  decoding/encoding pieces will also need to be updated/ changed
  accordingly. Given that I would prefer to delay this RFE a little
  more till all pieces are settled.

 Do we actually need to wait for Compact Strings to land? The failure
 scenario would be that Compact Strings are not making it to JDK 9, and
 this code lingers in history.

 I mean it's fine to see if there is a clash between the two, but I don't
 think Richard's change should be blocked by Compact Strings, given it is
 ready for integration (?), and Compact Strings are not yet.


Thanks for your replies gents - most appreciated. IMO the changes are
orthogonal at the moment. If compact strings happen then this patch will
probably need to be extended, but I imagine at least the implemented code
is the non-compact case and will remain - as would API decisions and
Javadoc. Of course that's just my opinion and I'm not as informed on the
compact strings work as you are ;)

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2015-06-03 Thread Richard Warburton
Hi gents,

My apology for the delay, things are little slow during this period of the
 year:-)
 I will sponsor the rfe and prepare for the CCC (internally). We may need
 go through
 the api and the implementation one more time.


 I was just wondering if there was any news on the CCC for this patch?


Just tracking back on this front as I've not had a reply for over 3 months.
It would be good to know the status of this patch.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2015-06-03 Thread Xueming Shen

Richard,

I'm still planning to be the sponsor for this RFE and get this one into 
the jdk9. We are current working on
JEP 254 [1][2][3]in which the internal storage mechanism is changed from 
char[] to byte[].  If this JEP get
approved and is scheduled to target JDK9, the decoding/encoding pieces 
will also need to be updated/
changed accordingly. Given that I would prefer to delay this RFE a 
little more till all pieces are settled.


Regards,
Sherman

[1] http://openjdk.java.net/jeps/254
[2] https://bugs.openjdk.java.net/browse/JDK-8054307
[3] http://hg.openjdk.java.net/jdk9/sandbox/

On 6/3/15 9:14 AM, Richard Warburton wrote:

Hi gents,

My apology for the delay, things are little slow during this
period of the year:-)
I will sponsor the rfe and prepare for the CCC (internally).
We may need go through
the api and the implementation one more time.


I was just wondering if there was any news on the CCC for this patch?


Just tracking back on this front as I've not had a reply for over 3 
months. It would be good to know the status of this patch.


regards,

  Richard Warburton

http://insightfullogic.com
@RichardWarburto http://twitter.com/richardwarburto

https://bugs.openjdk.java.net/browse/JDK-8072392


Re: Lower overhead String encoding/decoding

2015-06-03 Thread Aleksey Shipilev
On 06/03/2015 07:34 PM, Xueming Shen wrote:
 I'm still planning to be the sponsor for this RFE and get this one
 into the jdk9. We are current working on JEP 254 [1][2][3]in which
 the internal storage mechanism is changed from char[] to byte[].  If
 this JEP get approved and is scheduled to target JDK9, the
 decoding/encoding pieces will also need to be updated/ changed
 accordingly. Given that I would prefer to delay this RFE a little
 more till all pieces are settled.

Do we actually need to wait for Compact Strings to land? The failure
scenario would be that Compact Strings are not making it to JDK 9, and
this code lingers in history.

I mean it's fine to see if there is a clash between the two, but I don't
think Richard's change should be blocked by Compact Strings, given it is
ready for integration (?), and Compact Strings are not yet.

Thanks,
-Aleksey



Re: Lower overhead String encoding/decoding

2015-02-21 Thread Richard Warburton
Hi Xueming,

My apology for the delay, things are little slow during this period of the
 year:-)
 I will sponsor the rfe and prepare for the CCC (internally). We may need
 go through
 the api and the implementation one more time.


I was just wondering if there was any news on the CCC for this patch?

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-12-31 Thread Xueming Shen

On 12/30/14 3:25 PM, Richard Warburton wrote:

Hi,

Thanks for looking at this patch and agreeing to sponsor - I've
pushed fixes for these issues at:

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-10/
http://cr.openjdk.java.net/%7Erwarburton/string-patch-webrev-10/


Hope everyone has had a happy holiday period. Just following up on 
this issue to make sure that you have all the information needed and 
I've not been blocking the RFE process. Got a bit concerned due to 
radio silence.


regards,

  Richard Warburton

http://insightfullogic.com
@RichardWarburto http://twitter.com/richardwarburto

Hi Richard,

My apology for the delay, things are little slow during this period of 
the year:-)
I will sponsor the rfe and prepare for the CCC (internally). We may need 
go through

the api and the implementation one more time.

Regards,
-Sherman


Re: Lower overhead String encoding/decoding

2014-12-30 Thread Richard Warburton
Hi,

Thanks for looking at this patch and agreeing to sponsor - I've pushed
 fixes for these issues at:

 http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-10/


Hope everyone has had a happy holiday period. Just following up on this
issue to make sure that you have all the information needed and I've not
been blocking the RFE process. Got a bit concerned due to radio silence.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-11-26 Thread Richard Warburton
Hi gents,

Here are some comments regarding the updated webrev.

 (1) String(ByteBuffer, String) needs @throw UEE.
 (2) It should be default replacement byte array not replace string
 for all
  getBytes() methods when malformed/unmappable
 (3) for decoding (new String) from ByteBuffer, since it is guaranteed
 that all
  bytes in the input ByteBuffer will be decoded, it might be desirable
 to clearly
  specify that the position of the buffer will be advanced to its
 limit ?
 (4) ln#1137 has an extra *
 (5) StringCoding.decode(cs, bb), why do you want to allocate a direct
 buffer
  here for untrusted? Basically the output buffer ca will always
 be a wrapper
  of a char[], all our charset implementation will have better
 performance if
  both input and output are array based (decode on array directly,
 instead of
  the slow ByteBuffer)

 Overall I think this is looking quite good and I think Sherman has
 captured the main issues. On #3 then wording such as .. the position will
 be updated isn't clear enough to allow the method be tested, it needs to
 make it clear that the position is advanced by the number of bytes that
 were decoded.


Thanks for looking at this patch and agreeing to sponsor - I've pushed
fixes for these issues at:

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-10/

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-11-24 Thread Xueming Shen

Hi Richard,

Here are some comments regarding the updated webrev.

(1) String(ByteBuffer, String) needs @throw UEE.
(2) It should be default replacement byte array not replace string for all
 getBytes() methods when malformed/unmappable
(3) for decoding (new String) from ByteBuffer, since it is guaranteed that all
 bytes in the input ByteBuffer will be decoded, it might be desirable to 
clearly
 specify that the position of the buffer will be advanced to its limit ?
(4) ln#1137 has an extra *
(5) StringCoding.decode(cs, bb), why do you want to allocate a direct buffer
 here for untrusted? Basically the output buffer ca will always be a 
wrapper
 of a char[], all our charset implementation will have better performance if
 both input and output are array based (decode on array directly, instead 
of
 the slow ByteBuffer)

-Sherman

On 11/24/2014 08:45 AM, Richard Warburton wrote:

Hi all,

I'm sure everyone is very busy, but is there any chance that someone take a
look at the latest iteration of this patch?

Thanks for taking the time to look at this - most appreciated. I've pushed

the latest iteration to http://cr.openjdk.java.net/~
rwarburton/string-patch-webrev-8/http://cr.openjdk.java.net/%
7Erwarburton/string-patch-webrev-8/.

  I think this is looking good.

Thanks - I've pushed a new iteration to http://cr.openjdk.java.net/~
rwarburton/string-patch-webrev-9.

For the constructor then the words decoding the specified byte buffer,

it might be a bit clearer as decoding the remaining bytes in the 

For getBytes(ByteBuffer, Charset) then the position is advanced by the
bytes written, no need to mention the number of chars read here.

In the constructor then you make it clear that malformed/unmappable
sequences use the default replacement. This is important to state in the
getBytes methods too because the encoding can fail.



I've made all these changes.

Hi Richard, hi all,

Two comments,
You replace the nullcheck in getBytes() by a requireNonNull and at the
same time, you dont use requireNonNull in String(ByteBuffer,Charset),
no very logical isn't it ?


Thanks for spotting this Remi - I've fixed this issue in my latest
iteration.

I think you need a supplementary constructor that takes a ByteBuffer and a

charset name as a String,
i may be wrong, but it seems that every constructor of String that takes
a Charset has a dual constructor that takes a Charset as a String.
As far as I remember, a constructor that takes a Charset as a String may
be faster because you can share the character decoder
instead of creating a new one.


A good observation. I've added variants which take a String for the
charset name as well the charset variants.

Having said that - wouldn't it also be a good idea to replicate the
caching on the charset versions as well as the charset name? I don't see
any obvious reason why this isn't possible but perhaps there's something
I'm missing here. Probably cleaner as a separate patch either way.


regards,

   Richard Warburton

   http://insightfullogic.com
   @RichardWarburtohttp://twitter.com/richardwarburto




Re: Lower overhead String encoding/decoding

2014-11-24 Thread Alan Bateman

On 24/11/2014 18:41, Xueming Shen wrote:

Hi Richard,

Here are some comments regarding the updated webrev.

(1) String(ByteBuffer, String) needs @throw UEE.
(2) It should be default replacement byte array not replace string 
for all

 getBytes() methods when malformed/unmappable
(3) for decoding (new String) from ByteBuffer, since it is guaranteed 
that all
 bytes in the input ByteBuffer will be decoded, it might be 
desirable to clearly
 specify that the position of the buffer will be advanced to its 
limit ?

(4) ln#1137 has an extra *
(5) StringCoding.decode(cs, bb), why do you want to allocate a direct 
buffer
 here for untrusted? Basically the output buffer ca will 
always be a wrapper
 of a char[], all our charset implementation will have better 
performance if
 both input and output are array based (decode on array 
directly, instead of

 the slow ByteBuffer)
Overall I think this is looking quite good and I think Sherman has 
captured the main issues. On #3 then wording such as .. the position 
will be updated isn't clear enough to allow the method be tested, it 
needs to make it clear that the position is advanced by the number of 
bytes that were decoded.


Sherman - are you going to sponsor this for Richard?

-Alan


Re: Lower overhead String encoding/decoding

2014-11-24 Thread Xueming Shen

On 11/24/2014 01:21 PM, Alan Bateman wrote:

On 24/11/2014 18:41, Xueming Shen wrote:

Hi Richard,

Here are some comments regarding the updated webrev.

(1) String(ByteBuffer, String) needs @throw UEE.
(2) It should be default replacement byte array not replace string for all
 getBytes() methods when malformed/unmappable
(3) for decoding (new String) from ByteBuffer, since it is guaranteed that all
 bytes in the input ByteBuffer will be decoded, it might be desirable to 
clearly
 specify that the position of the buffer will be advanced to its limit ?
(4) ln#1137 has an extra *
(5) StringCoding.decode(cs, bb), why do you want to allocate a direct buffer
 here for untrusted? Basically the output buffer ca will always be a 
wrapper
 of a char[], all our charset implementation will have better performance if
 both input and output are array based (decode on array directly, instead 
of
 the slow ByteBuffer)

Overall I think this is looking quite good and I think Sherman has captured the main 
issues. On #3 then wording such as .. the position will be updated isn't 
clear enough to allow the method be tested, it needs to make it clear that the position 
is advanced by the number of bytes that were decoded.

Sherman - are you going to sponsor this for Richard?



Yes, I will sponsor this RFE.

-Sherman


Re: Lower overhead String encoding/decoding

2014-11-15 Thread Richard Warburton
Hi all,

Thanks for taking the time to look at this - most appreciated. I've pushed
 the latest iteration to http://cr.openjdk.java.net/~
 rwarburton/string-patch-webrev-8/ http://cr.openjdk.java.net/%
 7Erwarburton/string-patch-webrev-8/.

  I think this is looking good.


Thanks - I've pushed a new iteration to http://cr.openjdk.java.net/~
rwarburton/string-patch-webrev-9.

For the constructor then the words decoding the specified byte buffer, it
 might be a bit clearer as decoding the remaining bytes in the 

 For getBytes(ByteBuffer, Charset) then the position is advanced by the
 bytes written, no need to mention the number of chars read here.

 In the constructor then you make it clear that malformed/unmappable
 sequences use the default replacement. This is important to state in the
 getBytes methods too because the encoding can fail.


I've made all these changes.

Hi Richard, hi all,
 Two comments,
 You replace the nullcheck in getBytes() by a requireNonNull and at the
 same time, you dont use requireNonNull in String(ByteBuffer,Charset),
 no very logical isn't it ?


Thanks for spotting this Remi - I've fixed this issue in my latest
iteration.

I think you need a supplementary constructor that takes a ByteBuffer and a
 charset name as a String,
 i may be wrong, but it seems that every constructor of String that takes a
 Charset has a dual constructor that takes a Charset as a String.
 As far as I remember, a constructor that takes a Charset as a String may
 be faster because you can share the character decoder
 instead of creating a new one.


A good observation. I've added variants which take a String for the charset
name as well the charset variants.

Having said that - wouldn't it also be a good idea to replicate the caching
on the charset versions as well as the charset name? I don't see any
obvious reason why this isn't possible but perhaps there's something I'm
missing here. Probably cleaner as a separate patch either way.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-11-15 Thread Remi Forax


On 11/15/2014 07:07 PM, Richard Warburton wrote:

Hi all,



[...]



A good observation. I've added variants which take a String for the charset
name as well the charset variants.

Having said that - wouldn't it also be a good idea to replicate the caching
on the charset versions as well as the charset name? I don't see any
obvious reason why this isn't possible but perhaps there's something I'm
missing here. Probably cleaner as a separate patch either way.

regards,


I think it will create a security hazard, if a rogue charset declare 
itself as UTF-8 and create a malicious charset decoder, you don't want 
this decoder to be shared and reused.


Rémi



   Richard Warburton

   http://insightfullogic.com
   @RichardWarburto http://twitter.com/richardwarburto




Re: Lower overhead String encoding/decoding

2014-11-15 Thread Richard Warburton
Hi,

Having said that - wouldn't it also be a good idea to replicate the caching
 on the charset versions as well as the charset name? I don't see any
 obvious reason why this isn't possible but perhaps there's something I'm
 missing here. Probably cleaner as a separate patch either way.

 regards,


 I think it will create a security hazard, if a rogue charset declare
 itself as UTF-8 and create a malicious charset decoder, you don't want this
 decoder to be shared and reused.


When you're passing the name of the charset as a string, then you lookup
the charset by name, but if you were using Charsets, then you could check
that your cache has the same charset instance. Wouldn't that avoid this
security issue?

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-10-28 Thread Alan Bateman

On 26/10/2014 21:10, Richard Warburton wrote:

:

Thanks for taking the time to look at this - most appreciated. I've 
pushed the latest iteration to 
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-8/ 
http://cr.openjdk.java.net/%7Erwarburton/string-patch-webrev-8/.



I think this is looking good.

For the constructor then the words decoding the specified byte buffer, 
it might be a bit clearer as decoding the remaining bytes in the 


For getBytes(ByteBuffer, Charset) then the position is advanced by the 
bytes written, no need to mention the number of chars read here.


In the constructor then you make it clear that malformed/unmappable 
sequences use the default replacement. This is important to state in the 
getBytes methods too because the encoding can fail.


-Alan.




Re: Lower overhead String encoding/decoding

2014-10-28 Thread Remi Forax


On 10/28/2014 03:52 PM, Alan Bateman wrote:

On 26/10/2014 21:10, Richard Warburton wrote:

:

Thanks for taking the time to look at this - most appreciated. I've 
pushed the latest iteration to 
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-8/ 
http://cr.openjdk.java.net/%7Erwarburton/string-patch-webrev-8/.



I think this is looking good.

For the constructor then the words decoding the specified byte 
buffer, it might be a bit clearer as decoding the remaining bytes in 
the 


For getBytes(ByteBuffer, Charset) then the position is advanced by the 
bytes written, no need to mention the number of chars read here.


In the constructor then you make it clear that malformed/unmappable 
sequences use the default replacement. This is important to state in 
the getBytes methods too because the encoding can fail.


-Alan.




Hi Richard, hi all,
Two comments,
You replace the nullcheck in getBytes() by a requireNonNull and at the 
same time, you dont use requireNonNull in String(ByteBuffer,Charset),

no very logical isn't it ?
I think you need a supplementary constructor that takes a ByteBuffer and 
a charset name as a String,
i may be wrong, but it seems that every constructor of String that takes 
a Charset has a dual constructor that takes a Charset as a String.
As far as I remember, a constructor that takes a Charset as a String may 
be faster because you can share the character decoder

instead of creating a new one.

cheers,
Rémi



Re: Lower overhead String encoding/decoding

2014-10-27 Thread Xueming Shen

On 10/26/2014 02:10 PM, Richard Warburton wrote:




The getBytes(ByteBuffer, Charset) method needs a bit more javadoc. You
should be able to borrow from text from the CharsetEncoder#encode methods
to help with that.


I've updated the Javadoc with more information about the encoding and made
these changes. I'm not sure if there's anything else that's missing in this
case.



Hi Richard,

You have ...reflect the characters read and the byte written... for the 
getBtyes(ByteBuffer),
the characters read and should be deleted?

I'm not that sure if it is necessary to document it clearly that there would be 
no dangling
bytes in case of encoding/getBytes(), when there is no enough room/space to 
encode a
full char...

-Sherman



Re: Lower overhead String encoding/decoding

2014-10-26 Thread Richard Warburton
Hi Alan,

I think this looks better but I have a few comments on the API.


Thanks for taking the time to look at this - most appreciated. I've pushed
the latest iteration to
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-8/.

For String(ByteBuffer, Charset) then it's still inconsistent to read from
 the buffer position but not advance it. I think the constructor needs to
 work like a regular decode in that respect. Related to this is the
 underflow case where there are insufficient remaining bytes to complete. If
 you don't advance the position then there is no way to detect this.


I've modified the Javadoc and tests. The implementation was actually
behaving this way already.

The statement about the length of the String .. may not be equal to the
 length of the subarray might  be there from a previous iteration. There
 isn't any array in the method signature so I think this statement needs to
 be make a bit clearer.


I've rephrased this to refer to the ByteBuffer.

For getBytes(byte[], int, int, Charset) then I think the javadoc could say
 a bit more. It will encode to a maximum of destBuffer.length - destOffset
 for example. The @return should probably say that it's the number of bytes
 written to the array rather than copied.  Minor nits is that you probably
 don't want the starting p. Also the finals in the signature seem
 noisy/not needed.


 The getBytes(ByteBuffer, Charset) method needs a bit more javadoc. You
 should be able to borrow from text from the CharsetEncoder#encode methods
 to help with that.


I've updated the Javadoc with more information about the encoding and made
these changes. I'm not sure if there's anything else that's missing in this
case.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-10-20 Thread Alan Bateman

On 19/10/2014 23:13, Richard Warburton wrote:

Hi,

Hi Richard, couple comments after a quick scan.
Thanks for your comments - only just had a post Javaone chance to look at
this stuff. I've pushed an update to:

*http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7*


I think this looks better but I have a few comments on the API.

For String(ByteBuffer, Charset) then it's still inconsistent to read 
from the buffer position but not advance it. I think the constructor 
needs to work like a regular decode in that respect. Related to this is 
the underflow case where there are insufficient remaining bytes to 
complete. If you don't advance the position then there is no way to 
detect this.


The statement about the length of the String .. may not be equal to the 
length of the subarray might  be there from a previous iteration. There 
isn't any array in the method signature so I think this statement needs 
to be make a bit clearer.


For getBytes(byte[], int, int, Charset) then I think the javadoc could 
say a bit more. It will encode to a maximum of destBuffer.length - 
destOffset for example. The @return should probably say that it's the 
number of bytes written to the array rather than copied.  Minor nits 
is that you probably don't want the starting p. Also the finals in the 
signature seem noisy/not needed.


The getBytes(ByteBuffer, Charset) method needs a bit more javadoc. You 
should be able to borrow from text from the CharsetEncoder#encode 
methods to help with that.


-Alan.



Re: Lower overhead String encoding/decoding

2014-10-19 Thread Richard Warburton
Hi,

Hi Richard, couple comments after a quick scan.


Thanks for your comments - only just had a post Javaone chance to look at
this stuff. I've pushed an update to:

*http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7
http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-7*

(1)  #474, the IOBE probably is no longer needed in case of ByteBuffer.
 parameter.


Fixed in new push.

(2) for methods that have the ByteBuffer as input, it would be desirable to
 specify clearly that
  the bytes are read from position to limit, and whether or not the
 position will be advanced
  after decoding/encoding (important).


Added Javadoc in new push.

(3) getBytes(byte[], offset, charset)
  while I understand it might be useful to have such a method in
 certain circumstance, it is
  usually complicated to make it right/easy for user to actually use
 it. Consider the fact that
  the returned number of bytes encoded has no straightforward
 connection to how many
  underlying chars have been encoded (so the user of this method really
 have no idea how
  many underlying chars have been encoded into the dest buffer, is
 that enough? how big
  the buffer need to be to encode the whole string? ) especially
 the possibility that the last
  couple byte space might be short of encoding a char. Not like the
 getChars(), which has
  a easy, clear and direct link between the out going chars and
 underlying chars. I would
  suggest it might be better to leave it out.


I think you raise some good points here. In my mind this is how it works,
let me know what you think.

The client code of this API call should be able to infer the state of the
byte[] - the offset parameters is in terms of bytes and refers to the
destination buffer. If you want to know how many bytes have been used by
the encoding, the method returns an int that shows the length.

You can query the length of the String, which gives you the number of
outbound chars.

(4) StringCoding.decode() #239  remaining() should be used to return
 limit - position?


Fixed in new push.

(5) in case of untrusted, it might be more straightforward to get all
 bytes out of the buffer
  first (you are allocating a byte buffer here anyway, I don;t see
 obvious benefit to get a
  direct buffer, btw) and then pass it to the original/existing
 byte[]-char[] decoding
  implementation. We probably will take a deep look at the
 implementation later when
  the public api settled.


Yes - I was hoping to remove the additional bytebuffer allocation within
the method. I'm still not sure of the security implications around removing
this code though.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-09-29 Thread Peter Levart

Hi,

On 09/22/2014 01:25 PM, Richard Warburton wrote:

Hi all,

A long-standing issue with Strings in Java is the ease and performance of
creating a String from a ByteBuffer. People who are using nio to bring in
data off the network will be receiving that data in the form of bytebuffers
and converting it to some form of String. For example restful systems
receiving XML or Json messages.

The current workaround is to create a byte[] from the ByteBuffer - a
copying action for any direct bytebuffer - and then pass that to the
String.


An alternative is to use CharsetDecoder to program a decoding 
operation on input ByteBuffer(s), writing the result to CharBuffer(s). 
If the resulting CharBuffer is a single object (big enough), it can be 
converted to String via simple CharBuffer.toString(). Which is a 
copy-ing operation. In situations where the number of resulting 
characters can be anticipated in advance (like when we know in advance 
the number of bytes to be decoded and the charset used has fixed number 
of bytes per char or nearly fixed (like with UTF-8), a simple static 
utility method somewhere in java.lang.nio package could be used to 
optimize this operation:


public static String decodeString(CharsetDecoder dec, ByteBuffer in)
throws CharacterCodingException {

CharBuffer cb = dec.decode(in);

if (cb.length() == cb.hb.length) {
// optimized no-copy String construction
return 
SharedSecrets.getJavaLangAccess().newStringUnsafe(cb.hb);

} else {
return cb.toString();
}
}



  I'd like to propose that we add an additional constructor to the
String class that takes a ByteBuffer as an argument, and directly create
the char[] value inside the String from the ByteBuffer.

Similarly if you have a String that you want to encode onto the wire then
you need to call String.getBytes(), then write your byte[] into a
ByteBuffer or send it over the network.


Again, an alternative is to use CharBuffer.wrap(CharSequence cs, int 
start, int end) to wrap a String with a CharBuffer facade and then use 
CharsetEncoder to encode it directly into a resulting ByteBuffer. No 
additional copy-ing needed.


Regards, Peter


This ends up allocating a byte[] to
do the copy and also trimming the byte[] back down again, usually
allocating another byte[]. To address this problem I've added a couple of
getBytes() overloads that take byte[] and ByteBuffer arguments and write
directly to those buffers.

I've put together a patch that implements this to demonstrate the overall
direction.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/

I'm happy to take any feedback on direction or the patch itself or the
overall idea/approach. I think there are a number of similar API situations
in other places as well, for example StringBuffer/StringBuilder instances
which could have similar appends directly from ByteBuffer instances instead
of byte[] instances.

I'll also be at Javaone next week, so if you want to talk about this, just
let me know.

regards,

   Richard Warburton

   http://insightfullogic.com
   @RichardWarburto http://twitter.com/richardwarburto

PS: I appreciate that since I'm adding a method to the public API which
consequently requires standardisation but I think that this could get
incorporated into the Java 9 umbrella JSR.




Re: Lower overhead String encoding/decoding

2014-09-25 Thread Richard Warburton
Hi Alan,

Thanks for the feedback.

The direction seems reasonable but I wonder about the offset/length (and
 destOffet) parameters as this isn't consistent with how ByteBuffers were
 originally intended to be used. That is, when you read the bytes from the
 wire into a ByteBuffer and flip it then the position and limit will delimit
 the bytes to be decoded.

 If the constructor is changed to String(ByteBuffer in, Charset cs) and
 decodes the remaining bytes in the buffer to a String using the specified
 Charset then I think would be more consistent. Also I think this would give
 you a solution to the underflow case.


I've updated the webrev to reflect this, removing the offset and length
parameters and using position() and limit() instead.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/

Similarly if getBytes is replaced with with a getBytes or
 encode(ByteBuffer, Charset cs) then then it would encode as many characters
 as possible into the output buffer and I think would be more consistent and
 also help with overflow case.


I've also applied the this to the getBytes() method. I chose the getBytes()
method name for consistency with the existing getBytes() method that
returns a byte[]. To my mind encode() is a more natural name for the
method, which you mention in your email, do people have a preference here?

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto


Re: Lower overhead String encoding/decoding

2014-09-25 Thread Xueming Shen

Hi Richard, couple comments after a quick scan.

(1)  #474, the IOBE probably is no longer needed in case of ByteBuffer. 
parameter.


(2) for methods that have the ByteBuffer as input, it would be desirable 
to specify clearly that
 the bytes are read from position to limit, and whether or not 
the position will be advanced

 after decoding/encoding (important).

(3) getBytes(byte[], offset, charset)
 while I understand it might be useful to have such a method in 
certain circumstance, it is
 usually complicated to make it right/easy for user to actually use 
it. Consider the fact that
 the returned number of bytes encoded has no straightforward 
connection to how many
 underlying chars have been encoded (so the user of this method 
really have no idea how
 many underlying chars have been encoded into the dest buffer, is 
that enough? how big
 the buffer need to be to encode the whole string? ) especially 
the possibility that the last
 couple byte space might be short of encoding a char. Not like 
the getChars(), which has
 a easy, clear and direct link between the out going chars and 
underlying chars. I would

 suggest it might be better to leave it out.

(4) StringCoding.decode() #239  remaining() should be used to return 
limit - position?


(5) in case of untrusted, it might be more straightforward to get all 
bytes out of the buffer
 first (you are allocating a byte buffer here anyway, I don;t see 
obvious benefit to get a
 direct buffer, btw) and then pass it to the original/existing 
byte[]-char[] decoding
 implementation. We probably will take a deep look at the 
implementation later when

 the public api settled.

-Sherman


Richard Warburton wrote:

Hi Alan,

Thanks for the feedback.

The direction seems reasonable but I wonder about the offset/length (and

destOffet) parameters as this isn't consistent with how ByteBuffers were
originally intended to be used. That is, when you read the bytes from the
wire into a ByteBuffer and flip it then the position and limit will delimit
the bytes to be decoded.

If the constructor is changed to String(ByteBuffer in, Charset cs) and
decodes the remaining bytes in the buffer to a String using the specified
Charset then I think would be more consistent. Also I think this would give
you a solution to the underflow case.


I've updated the webrev to reflect this, removing the offset and length
parameters and using position() and limit() instead.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/

Similarly if getBytes is replaced with with a getBytes or

encode(ByteBuffer, Charset cs) then then it would encode as many characters
as possible into the output buffer and I think would be more consistent and
also help with overflow case.


I've also applied the this to the getBytes() method. I chose the getBytes()
method name for consistency with the existing getBytes() method that
returns a byte[]. To my mind encode() is a more natural name for the
method, which you mention in your email, do people have a preference here?

regards,

   Richard Warburton

   http://insightfullogic.com
   @RichardWarburto http://twitter.com/richardwarburto




Re: Lower overhead String encoding/decoding

2014-09-25 Thread Xueming Shen

On 9/25/14 9:24 AM, Xueming Shen wrote:

Hi Richard, couple comments after a quick scan.

(1)  #474, the IOBE probably is no longer needed in case of 
ByteBuffer. parameter.


(2) for methods that have the ByteBuffer as input, it would be 
desirable to specify clearly that
 the bytes are read from position to limit, and whether or not 
the position will be advanced

 after decoding/encoding (important).

(3) getBytes(byte[], offset, charset)
 while I understand it might be useful to have such a method in 
certain circumstance, it is
 usually complicated to make it right/easy for user to actually 
use it. Consider the fact that
 the returned number of bytes encoded has no straightforward 
connection to how many
 underlying chars have been encoded (so the user of this method 
really have no idea how
 many underlying chars have been encoded into the dest buffer, 
is that enough? how big
 the buffer need to be to encode the whole string? ) 
especially the possibility that the last
 couple byte space might be short of encoding a char. Not like 
the getChars(), which has
 a easy, clear and direct link between the out going chars and 
underlying chars. I would

 suggest it might be better to leave it out.


I would assume this is true for the ByteBuffer version as well...just 
doubt the usefulness of
a method that it might only give you the bytes of part of the target 
string object. Given the
getBytes(ByteBuffer) version has the additional position/limit info form 
the buffer itself, a
workaround is to return the number of underlying chars copied...then 
you need one more

parameter to let user to copy from index' for next round.

-sherman



(4) StringCoding.decode() #239  remaining() should be used to return 
limit - position?


(5) in case of untrusted, it might be more straightforward to get 
all bytes out of the buffer
 first (you are allocating a byte buffer here anyway, I don;t see 
obvious benefit to get a
 direct buffer, btw) and then pass it to the original/existing 
byte[]-char[] decoding
 implementation. We probably will take a deep look at the 
implementation later when

 the public api settled.

-Sherman


Richard Warburton wrote:

Hi Alan,

Thanks for the feedback.

The direction seems reasonable but I wonder about the offset/length (and
destOffet) parameters as this isn't consistent with how ByteBuffers 
were
originally intended to be used. That is, when you read the bytes 
from the
wire into a ByteBuffer and flip it then the position and limit will 
delimit

the bytes to be decoded.

If the constructor is changed to String(ByteBuffer in, Charset cs) and
decodes the remaining bytes in the buffer to a String using the 
specified
Charset then I think would be more consistent. Also I think this 
would give

you a solution to the underflow case.


I've updated the webrev to reflect this, removing the offset and length
parameters and using position() and limit() instead.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-6/

Similarly if getBytes is replaced with with a getBytes or
encode(ByteBuffer, Charset cs) then then it would encode as many 
characters
as possible into the output buffer and I think would be more 
consistent and

also help with overflow case.

I've also applied the this to the getBytes() method. I chose the 
getBytes()

method name for consistency with the existing getBytes() method that
returns a byte[]. To my mind encode() is a more natural name for the
method, which you mention in your email, do people have a preference 
here?


regards,

   Richard Warburton

   http://insightfullogic.com
   @RichardWarburto http://twitter.com/richardwarburto






Lower overhead String encoding/decoding

2014-09-22 Thread Richard Warburton
Hi all,

A long-standing issue with Strings in Java is the ease and performance of
creating a String from a ByteBuffer. People who are using nio to bring in
data off the network will be receiving that data in the form of bytebuffers
and converting it to some form of String. For example restful systems
receiving XML or Json messages.

The current workaround is to create a byte[] from the ByteBuffer - a
copying action for any direct bytebuffer - and then pass that to the
String. I'd like to propose that we add an additional constructor to the
String class that takes a ByteBuffer as an argument, and directly create
the char[] value inside the String from the ByteBuffer.

Similarly if you have a String that you want to encode onto the wire then
you need to call String.getBytes(), then write your byte[] into a
ByteBuffer or send it over the network. This ends up allocating a byte[] to
do the copy and also trimming the byte[] back down again, usually
allocating another byte[]. To address this problem I've added a couple of
getBytes() overloads that take byte[] and ByteBuffer arguments and write
directly to those buffers.

I've put together a patch that implements this to demonstrate the overall
direction.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/

I'm happy to take any feedback on direction or the patch itself or the
overall idea/approach. I think there are a number of similar API situations
in other places as well, for example StringBuffer/StringBuilder instances
which could have similar appends directly from ByteBuffer instances instead
of byte[] instances.

I'll also be at Javaone next week, so if you want to talk about this, just
let me know.

regards,

  Richard Warburton

  http://insightfullogic.com
  @RichardWarburto http://twitter.com/richardwarburto

PS: I appreciate that since I'm adding a method to the public API which
consequently requires standardisation but I think that this could get
incorporated into the Java 9 umbrella JSR.


Re: Lower overhead String encoding/decoding

2014-09-22 Thread Alan Bateman

On 22/09/2014 12:25, Richard Warburton wrote:

:

I've put together a patch that implements this to demonstrate the overall
direction.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/

I'm happy to take any feedback on direction or the patch itself or the
overall idea/approach. I think there are a number of similar API situations
in other places as well, for example StringBuffer/StringBuilder instances
which could have similar appends directly from ByteBuffer instances instead
of byte[] instances.

The direction seems reasonable but I wonder about the offset/length (and 
destOffet) parameters as this isn't consistent with how ByteBuffers were 
originally intended to be used. That is, when you read the bytes from 
the wire into a ByteBuffer and flip it then the position and limit will 
delimit the bytes to be decoded.


If the constructor is changed to String(ByteBuffer in, Charset cs) and 
decodes the remaining bytes in the buffer to a String using the 
specified Charset then I think would be more consistent. Also I think 
this would give you a solution to the underflow case.


Similarly if getBytes is replaced with with a getBytes or 
encode(ByteBuffer, Charset cs) then then it would encode as many 
characters as possible into the output buffer and I think would be more 
consistent and also help with overflow case.


-Alan.


Re: Lower overhead String encoding/decoding

2014-09-22 Thread Ulf Zibis

Compare with https://bugs.openjdk.java.net/browse/JDK-6695386
Maybe that would help a little.

-Ulf


Am 22.09.2014 um 13:25 schrieb Richard Warburton:

Hi all,

A long-standing issue with Strings in Java is the ease and performance of
creating a String from a ByteBuffer. People who are using nio to bring in
data off the network will be receiving that data in the form of bytebuffers
and converting it to some form of String. For example restful systems
receiving XML or Json messages.

The current workaround is to create a byte[] from the ByteBuffer - a
copying action for any direct bytebuffer - and then pass that to the
String. I'd like to propose that we add an additional constructor to the
String class that takes a ByteBuffer as an argument, and directly create
the char[] value inside the String from the ByteBuffer.

Similarly if you have a String that you want to encode onto the wire then
you need to call String.getBytes(), then write your byte[] into a
ByteBuffer or send it over the network. This ends up allocating a byte[] to
do the copy and also trimming the byte[] back down again, usually
allocating another byte[]. To address this problem I've added a couple of
getBytes() overloads that take byte[] and ByteBuffer arguments and write
directly to those buffers.

I've put together a patch that implements this to demonstrate the overall
direction.

http://cr.openjdk.java.net/~rwarburton/string-patch-webrev-5/

I'm happy to take any feedback on direction or the patch itself or the
overall idea/approach. I think there are a number of similar API situations
in other places as well, for example StringBuffer/StringBuilder instances
which could have similar appends directly from ByteBuffer instances instead
of byte[] instances.

I'll also be at Javaone next week, so if you want to talk about this, just
let me know.

regards,

   Richard Warburton

   http://insightfullogic.com
   @RichardWarburto http://twitter.com/richardwarburto

PS: I appreciate that since I'm adding a method to the public API which
consequently requires standardisation but I think that this could get
incorporated into the Java 9 umbrella JSR.