[jira] [Comment Edited] (OAK-5506) Segment store apparently doesn't round trip node names with unpaired surrogates

2018-01-22 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837784#comment-15837784
 ] 

Julian Reschke edited comment on OAK-5506 at 1/22/18 3:23 PM:
--

{{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't 
seem to be involved (set a breakpoint, didn't get there).

FWIW; this (or something like this) is the code that would need to be added:
{noformat}
private static void checkValidString(String s) throws IOException {
for (int i = 0; i < s.length(); i++) {
char c1 = s.charAt(i);
if (Character.isSurrogate(c1)) {
try {
char c2 = s.charAt(i + 1);
if (Character.isSurrogatePair(c1, c2)) {
// proceed
i += 1;
} else {
throw new IOException("Invalid surrogate pair sequence: 
" + (int) c1 + " " + (int) c2);
}
} catch (IndexOutOfBoundsException ex) {
throw new IOException("String ends in unpaired surrogate 
character.", ex);
}
}
}
}
{noformat}
So, in general a single pass checking every char in the string.

[~mduerig]: agreed, but if we want to reject these values, then we'll have to 
detect them, right? Thinking of it, the cost would be smaller if we did it in a 
place where we have to parse the name already (that is, in the JCR layer).


was (Author: reschke):
{{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't 
seem to be involved (set a breakpoint, didn't get there).

FWIW; this (or something like this) is the code that would need to be added:
{noformat}
private static void checkValidString(String s) throws IOException {
for (int i = 0; i < s.length(); i++) {
char c1 = s.charAt(i);
if (Character.isSurrogate(c1)) {
try {
char c2 = s.charAt(i + 1);
if (Character.isSurrogatePair(c1, c2)) {
// proceed
i += 1;
} else {
throw new IOException("Invalid surrogate pair sequence: 
" + (int) c1 + " " + (int) c2);
}
} catch (IndexOutOfBoundsException ex) {
throw new IOException("String ends in unpaired surrogate 
character.", ex);
}
}
}
}
{noformat}

So, in general a single pass checking every char in the string.

[~mduerig]: agreed, but if we want to reject these values, then we'll have to 
detect them, right? Thinking of it, the cost would be smaller if we did it in a 
place where we have to parse the name already (tha is, in the jcr layer). 

> Segment store apparently doesn't round trip node names with unpaired 
> surrogates
> ---
>
> Key: OAK-5506
> URL: https://issues.apache.org/jira/browse/OAK-5506
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: segment-tar
>Affects Versions: 1.5.18
>Reporter: Julian Reschke
>Assignee: Francesco Mari
>Priority: Minor
> Attachments: OAK-5506-01.patch, OAK-5506-02.patch, 
> OAK-5506-name-conversion.diff, ValidNamesTest.java
>
>
> Apparently, the following node name is accepted:
>{{"foo\ud800"}}
> but a subsequent {{getPath()}} call fails:
> {noformat}
> javax.jcr.InvalidItemStateException: This item [/test_node/foo?] does not 
> exist anymore
> at 
> org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.checkAlive(ItemDelegate.java:86)
> at 
> org.apache.jackrabbit.oak.jcr.session.operation.ItemOperation.checkPreconditions(ItemOperation.java:34)
> at 
> org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.prePerform(SessionDelegate.java:615)
> at 
> org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:205)
> at 
> org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:112)
> at 
> org.apache.jackrabbit.oak.jcr.session.ItemImpl.getPath(ItemImpl.java:140)
> at 
> org.apache.jackrabbit.oak.jcr.session.NodeImpl.getPath(NodeImpl.java:106)
> at 
> org.apache.jackrabbit.oak.jcr.ValidNamesTest.nameTest(ValidNamesTest.java:271)
> at 
> org.apache.jackrabbit.oak.jcr.ValidNamesTest.testUnpairedSurrogate(ValidNamesTest.java:259)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source){noformat}
> (test case follows)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (OAK-5506) Segment store apparently doesn't round trip node names with unpaired surrogates

2017-02-06 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15846844#comment-15846844
 ] 

Julian Reschke edited comment on OAK-5506 at 2/6/17 2:09 PM:
-

Also,

the current code uses {{String.getBytes("UTF-8")}}. This will map broken 
Unicode characters silently to the "replacement character" -- that is, the 
segment store persists a string that does not represent the input.

It might be a good idea to use an API that will actually flags these strings 
while getting the UTF-8 representation, see 
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html.


was (Author: reschke):
Also,

the current code uses {String.getBytes("UTF-8")}. This will map broken Unicode 
characters silently to the "replacement character" -- that is, the segment 
store persists a string that does not represent the input.

It might be a good idea to use an API that will actually flags these strings 
while getting the UTF-8 representation, see 
http://docs.oracle.com/javase/7/docs/api/java/nio/charset/CharsetEncoder.html.

> Segment store apparently doesn't round trip node names with unpaired 
> surrogates
> ---
>
> Key: OAK-5506
> URL: https://issues.apache.org/jira/browse/OAK-5506
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: segment-tar
>Affects Versions: 1.5.18
>Reporter: Julian Reschke
>Assignee: Francesco Mari
> Fix For: 1.8
>
> Attachments: OAK-5506-01.patch, OAK-5506-02.patch, ValidNamesTest.java
>
>
> Apparently, the following node name is accepted:
>{{"foo\ud800"}}
> but a subsequent {{getPath()}} call fails:
> {noformat}
> javax.jcr.InvalidItemStateException: This item [/test_node/foo?] does not 
> exist anymore
> at 
> org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.checkAlive(ItemDelegate.java:86)
> at 
> org.apache.jackrabbit.oak.jcr.session.operation.ItemOperation.checkPreconditions(ItemOperation.java:34)
> at 
> org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.prePerform(SessionDelegate.java:615)
> at 
> org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:205)
> at 
> org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:112)
> at 
> org.apache.jackrabbit.oak.jcr.session.ItemImpl.getPath(ItemImpl.java:140)
> at 
> org.apache.jackrabbit.oak.jcr.session.NodeImpl.getPath(NodeImpl.java:106)
> at 
> org.apache.jackrabbit.oak.jcr.ValidNamesTest.nameTest(ValidNamesTest.java:271)
> at 
> org.apache.jackrabbit.oak.jcr.ValidNamesTest.testUnpairedSurrogate(ValidNamesTest.java:259)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source){noformat}
> (test case follows)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (OAK-5506) Segment store apparently doesn't round trip node names with unpaired surrogates

2017-01-25 Thread Julian Reschke (JIRA)

[ 
https://issues.apache.org/jira/browse/OAK-5506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837784#comment-15837784
 ] 

Julian Reschke edited comment on OAK-5506 at 1/25/17 2:16 PM:
--

{{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't 
seem to be involved (set a breakpoint, didn't get there).

FWIW; this (or something like this) is the code that would need to be added:
{noformat}
private static void checkValidString(String s) throws IOException {
for (int i = 0; i < s.length(); i++) {
char c1 = s.charAt(i);
if (Character.isSurrogate(c1)) {
try {
char c2 = s.charAt(i + 1);
if (Character.isSurrogatePair(c1, c2)) {
// proceed
i += 1;
} else {
throw new IOException("Invalid surrogate pair sequence: 
" + (int) c1 + " " + (int) c2);
}
} catch (IndexOutOfBoundsException ex) {
throw new IOException("String ends in unpaired surrogate 
character.", ex);
}
}
}
}
{noformat}

So, in general a single pass checking every char in the string.

[~mduerig]: agreed, but if we want to reject these values, then we'll have to 
detect them, right? Thinking of it, the cost would be smaller if we did it in a 
place where we have to parse the name already (tha is, in the jcr layer). 


was (Author: reschke):
{{o.a.j.o.segment.SegmentWriter.SegmentWriteOperation#writeString}} doesn't 
seem to be involved (set a breakpoint, didn't get there).

FWIW; this (or something like this) is the code that would need to be added:
{noformat}
private static void checkValidString(String s) throws IOException {
for (int i = 0; i < s.length(); i++) {
char c1 = s.charAt(i);
if (Character.isSurrogate(c1)) {
try {
char c2 = s.charAt(i + 1);
if (Character.isSurrogatePair(c1, c2)) {
// proceed
i += 1;
} else {
throw new IOException("Invalid surrogate pair sequence: 
" + (int) c1 + " " + (int) c2);
}
} catch (IndexOutOfBoundsException ex) {
throw new IOException("String ends in unpaired surrogate 
character.", ex);
}
}
}
}
{noformat}

So, in general a single pass checking every char in the string.

[~mduerig]: agreed, but if we want to reject these values, then we'll have to 
detect them, right?

> Segment store apparently doesn't round trip node names with unpaired 
> surrogates
> ---
>
> Key: OAK-5506
> URL: https://issues.apache.org/jira/browse/OAK-5506
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: segment-tar
>Affects Versions: 1.5.18
>Reporter: Julian Reschke
>Assignee: Francesco Mari
> Attachments: ValidNamesTest.java
>
>
> Apparently, the following node name is accepted:
>{{"foo\ud800"}}
> but a subsequent {{getPath()}} call fails:
> {noformat}
> javax.jcr.InvalidItemStateException: This item [/test_node/foo?] does not 
> exist anymore
> at 
> org.apache.jackrabbit.oak.jcr.delegate.ItemDelegate.checkAlive(ItemDelegate.java:86)
> at 
> org.apache.jackrabbit.oak.jcr.session.operation.ItemOperation.checkPreconditions(ItemOperation.java:34)
> at 
> org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.prePerform(SessionDelegate.java:615)
> at 
> org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate.perform(SessionDelegate.java:205)
> at 
> org.apache.jackrabbit.oak.jcr.session.ItemImpl.perform(ItemImpl.java:112)
> at 
> org.apache.jackrabbit.oak.jcr.session.ItemImpl.getPath(ItemImpl.java:140)
> at 
> org.apache.jackrabbit.oak.jcr.session.NodeImpl.getPath(NodeImpl.java:106)
> at 
> org.apache.jackrabbit.oak.jcr.ValidNamesTest.nameTest(ValidNamesTest.java:271)
> at 
> org.apache.jackrabbit.oak.jcr.ValidNamesTest.testUnpairedSurrogate(ValidNamesTest.java:259)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source){noformat}
> (test case follows)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)