RFR: 8338257: UTF8 lengths should be size_t not int

David Holmes Mon, 12 Aug 2024 22:50:14 -0700

This work has been split out from JDK-8328877: [JNI] The JNI Specification 
needs to address the limitations of integer UTF-8 String lengths


The modified UTF-8 format used by the VM can require up to six bytes to 
represent one unicode character, but six byte characters are stored as UTF-16 
surrogate pairs. Hence the most bytes per character is 3, and so the maximum 
length is 3*`Integer.MAX_VALUE`.  Though with compact strings this reduces to 
2*`Integer.MAX_VALUE`. The low-level UTF8/UNICODE API should therefore define 
UTF8 lengths as `size_t` to accommodate all possible representations. 
Higher-level API's can still use `int` if they know the strings (eg symbols) 
are sufficiently constrained in length.  See the comments in utf8.hpp that 
explain Strings, compact strings and the encoding.

As the existing JNI `GetStringUTFLength` still requires the current truncating 
behaviour of ` UNICODE::utf8_length` we add back `UNICODE::utf8_length_as_int` 
for it to use.

Note that some API's, like ` UNICODE::as_utf8(const T* base, size_t& length)` 
use `length` as an IN/OUT parameter: it is the incoming (int) length of the 
jbyte/jchar array, and the outgoing (size_t) length of the UTF8 sequence. This 
makes some of the call sites a little messy with casts.

Testing:
 - tiers 1-4
 - GHA

-------------

Commit messages:
 - unnecessary cast
 - Fix comments
 - Fix off-by-one error
 - Rollback the GetLargeStringUTFLength addition.
 - Rollback the GetLargeStringUTFLength addition.
 - Initial commit before splitting out UTF8 changes

Changes: https://git.openjdk.org/jdk/pull/20560/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20560&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8338257
  Stats: 243 lines in 16 files changed: 116 ins; 5 del; 122 mod
  Patch: https://git.openjdk.org/jdk/pull/20560.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20560/head:pull/20560

PR: https://git.openjdk.org/jdk/pull/20560

RFR: 8338257: UTF8 lengths should be size_t not int

Reply via email to