Julian Reschke created OAK-7279:
-----------------------------------

             Summary: segment-tar update from java 7 to java 8 may break 
persisted names using invalid characters
                 Key: OAK-7279
                 URL: https://issues.apache.org/jira/browse/OAK-7279
             Project: Jackrabbit Oak
          Issue Type: Bug
          Components: segment-tar
            Reporter: Julian Reschke


segment-tar relies on {{String.getBytes()}} when persisting strings such as 
item names.

The problem is that the behavior for this has been changed in Java 8 with 
respect to invalid strings (here: null characters and unpaired surrogates).

In Java 7, these would roundtrip, as Java was using the so-called "modified 
UTF-8" encoding (see 
https://docs.oracle.com/javase/6/docs/api/java/io/DataInput.html#modified-utf-8).
 This will produce byte sequence that are *not* valid UTF-8.

Java 7 will read them back, but Java 8 will map the non-conforming byte 
sequences to the Unicode replacement character. Note that in particular, 
multiple child entries might get identical names as a consequence.

I'm not sure about the severity of this, and whether something needs to be done 
about it. AFAIC, this is another good reason to reject invalid strings as early 
as possible in the stack.

cc [~mduerig]





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to