[jira] [Comment Edited] (OAK-5506) Segment store apparently doesn't round trip node names with unpaired surrogates
[ https://issues.apache.org/jira/browse/OAK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837784#comment-15837784 ] Julian Reschke edited comment on OAK-5506 at 1/22/18 3:23 PM: -- {{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't seem to be involved (set a breakpoint, didn't get there). FWIW; this (or something like this) is the code that would need to be added: {noformat} private static void checkValidString(String s) throws IOException { for (int i = 0; i < s.length(); i++) { char c1 = s.charAt(i); if (Character.isSurrogate(c1)) { try { char c2 = s.charAt(i + 1); if (Character.isSurrogatePair(c1, c2)) { // proceed i += 1; } else { throw new IOException("Invalid surrogate pair sequence: " + (int) c1 + " " + (int) c2); } } catch (IndexOutOfBoundsException ex) { throw new IOException("String ends in unpaired surrogate character.", ex); } } } } {noformat} So, in general a single pass checking every char in the string. [~mduerig]: agreed, but if we want to reject these values, then we'll have to detect them, right? Thinking of it, the cost would be smaller if we did it in a place where we have to parse the name already (that is, in the JCR layer). was (Author: reschke): {{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't seem to be involved (set a breakpoint, didn't get there). FWIW; this (or something like this) is the code that would need to be added: {noformat} private static void checkValidString(String s) throws IOException { for (int i = 0; i < s.length(); i++) { char c1 = s.charAt(i); if (Character.isSurrogate(c1)) { try { char c2 = s.charAt(i + 1); if (Character.isSurrogatePair(c1, c2)) { // proceed i += 1; } else { throw new IOException("Invalid surrogate pair sequence: " + (int) c1 + " " + (int) c2); } } catch (IndexOutOfBoundsException ex) { throw new IOException("String ends in unpaired surrogate character.", ex); } } } } {noformat} So, in general a single pass checking every char in the string. [~mduerig]: agreed, but if we want to reject these values, then we'll have to detect them, right? Thinking of it, the cost would be smaller if we did it in a place where we have to parse the name already (tha is, in the jcr layer). > Segment store apparently doesn't round trip node names with unpaired > surrogates > --- > > Key: OAK-5506 > URL: https://issues.apache.org/jira/browse/OAK-5506 > Project: Jackrabbit Oak > Issue Type: Wish > Components: segment-tar >Affects Versions: 1.5.18 >Reporter: Julian Reschke >Assignee: Francesco Mari >Priority: Minor > Attachments: OAK-5506-01.patch, OAK-5506-02.patch, > OAK-5506-name-conversion.diff, ValidNamesTest.java > > > Apparently, the following node name is accepted: >{{"foo\ud800"}} > but a subsequent {{getPath()}} call fails: > {noformat} > javax.jcr.InvalidItemStateException: This item [/test_node/foo?] does not > exist anymore > at > org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.checkAlive(ItemDelegate.java:86) > at > org.apache.jackrabbit.oak.jcr.session.operation.ItemOperation.checkPreconditions(ItemOperation.java:34) > at > org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.prePerform(SessionDelegate.java:615) > at > org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:205) > at > org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:112) > at > org.apache.jackrabbit.oak.jcr.session.ItemImpl.getPath(ItemImpl.java:140) > at > org.apache.jackrabbit.oak.jcr.session.NodeImpl.getPath(NodeImpl.java:106) > at > org.apache.jackrabbit.oak.jcr.ValidNamesTest.nameTest(ValidNamesTest.java:271) > at > org.apache.jackrabbit.oak.jcr.ValidNamesTest.testUnpairedSurrogate(ValidNamesTest.java:259) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source){noformat} > (test case follows) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (OAK-5506) Segment store apparently doesn't round trip node names with unpaired surrogates
[ https://issues.apache.org/jira/browse/OAK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15846844#comment-15846844 ] Julian Reschke edited comment on OAK-5506 at 2/6/17 2:09 PM: - Also, the current code uses {{String.getBytes("UTF-8")}}. This will map broken Unicode characters silently to the "replacement character" -- that is, the segment store persists a string that does not represent the input. It might be a good idea to use an API that will actually flags these strings while getting the UTF-8 representation, see http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html. was (Author: reschke): Also, the current code uses {String.getBytes("UTF-8")}. This will map broken Unicode characters silently to the "replacement character" -- that is, the segment store persists a string that does not represent the input. It might be a good idea to use an API that will actually flags these strings while getting the UTF-8 representation, see http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html. > Segment store apparently doesn't round trip node names with unpaired > surrogates > --- > > Key: OAK-5506 > URL: https://issues.apache.org/jira/browse/OAK-5506 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar >Affects Versions: 1.5.18 >Reporter: Julian Reschke >Assignee: Francesco Mari > Fix For: 1.8 > > Attachments: OAK-5506-01.patch, OAK-5506-02.patch, ValidNamesTest.java > > > Apparently, the following node name is accepted: >{{"foo\ud800"}} > but a subsequent {{getPath()}} call fails: > {noformat} > javax.jcr.InvalidItemStateException: This item [/test_node/foo?] does not > exist anymore > at > org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.checkAlive(ItemDelegate.java:86) > at > org.apache.jackrabbit.oak.jcr.session.operation.ItemOperation.checkPreconditions(ItemOperation.java:34) > at > org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.prePerform(SessionDelegate.java:615) > at > org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:205) > at > org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:112) > at > org.apache.jackrabbit.oak.jcr.session.ItemImpl.getPath(ItemImpl.java:140) > at > org.apache.jackrabbit.oak.jcr.session.NodeImpl.getPath(NodeImpl.java:106) > at > org.apache.jackrabbit.oak.jcr.ValidNamesTest.nameTest(ValidNamesTest.java:271) > at > org.apache.jackrabbit.oak.jcr.ValidNamesTest.testUnpairedSurrogate(ValidNamesTest.java:259) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source){noformat} > (test case follows) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (OAK-5506) Segment store apparently doesn't round trip node names with unpaired surrogates
[ https://issues.apache.org/jira/browse/OAK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837784#comment-15837784 ] Julian Reschke edited comment on OAK-5506 at 1/25/17 2:16 PM: -- {{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't seem to be involved (set a breakpoint, didn't get there). FWIW; this (or something like this) is the code that would need to be added: {noformat} private static void checkValidString(String s) throws IOException { for (int i = 0; i < s.length(); i++) { char c1 = s.charAt(i); if (Character.isSurrogate(c1)) { try { char c2 = s.charAt(i + 1); if (Character.isSurrogatePair(c1, c2)) { // proceed i += 1; } else { throw new IOException("Invalid surrogate pair sequence: " + (int) c1 + " " + (int) c2); } } catch (IndexOutOfBoundsException ex) { throw new IOException("String ends in unpaired surrogate character.", ex); } } } } {noformat} So, in general a single pass checking every char in the string. [~mduerig]: agreed, but if we want to reject these values, then we'll have to detect them, right? Thinking of it, the cost would be smaller if we did it in a place where we have to parse the name already (tha is, in the jcr layer). was (Author: reschke): {{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't seem to be involved (set a breakpoint, didn't get there). FWIW; this (or something like this) is the code that would need to be added: {noformat} private static void checkValidString(String s) throws IOException { for (int i = 0; i < s.length(); i++) { char c1 = s.charAt(i); if (Character.isSurrogate(c1)) { try { char c2 = s.charAt(i + 1); if (Character.isSurrogatePair(c1, c2)) { // proceed i += 1; } else { throw new IOException("Invalid surrogate pair sequence: " + (int) c1 + " " + (int) c2); } } catch (IndexOutOfBoundsException ex) { throw new IOException("String ends in unpaired surrogate character.", ex); } } } } {noformat} So, in general a single pass checking every char in the string. [~mduerig]: agreed, but if we want to reject these values, then we'll have to detect them, right? > Segment store apparently doesn't round trip node names with unpaired > surrogates > --- > > Key: OAK-5506 > URL: https://issues.apache.org/jira/browse/OAK-5506 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar >Affects Versions: 1.5.18 >Reporter: Julian Reschke >Assignee: Francesco Mari > Attachments: ValidNamesTest.java > > > Apparently, the following node name is accepted: >{{"foo\ud800"}} > but a subsequent {{getPath()}} call fails: > {noformat} > javax.jcr.InvalidItemStateException: This item [/test_node/foo?] does not > exist anymore > at > org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.checkAlive(ItemDelegate.java:86) > at > org.apache.jackrabbit.oak.jcr.session.operation.ItemOperation.checkPreconditions(ItemOperation.java:34) > at > org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.prePerform(SessionDelegate.java:615) > at > org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:205) > at > org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:112) > at > org.apache.jackrabbit.oak.jcr.session.ItemImpl.getPath(ItemImpl.java:140) > at > org.apache.jackrabbit.oak.jcr.session.NodeImpl.getPath(NodeImpl.java:106) > at > org.apache.jackrabbit.oak.jcr.ValidNamesTest.nameTest(ValidNamesTest.java:271) > at > org.apache.jackrabbit.oak.jcr.ValidNamesTest.testUnpairedSurrogate(ValidNamesTest.java:259) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source){noformat} > (test case follows) -- This message was sent by Atlassian JIRA (v6.3.4#6332)