Re: consecutive numeric character references

Bernd Fondermann Mon, 03 Jun 2013 11:21:56 -0700

Hi Gene,

thanks for the patch. It's been applied to master.


  Bernd


On Sun, Jun 2, 2013 at 4:12 PM, Gene McCulley <[email protected]>wrote:

> There is a bug in XMLParser when parsing consecutive numeric character
> references because UNESCAPE_UNICODE_PATTERN greedily matches.
>
> We found this when exchanging JSON snippets over chat. The chat client was
> turning the preceding whitespace in a JSON snippet into
> &#160;&#160;&#160;&#160; and this causes XMLParser to throw an exception
> trying to parse it as one character instead of as four characters.
>
> Here is a patch. (I'm not sure if this is the best approach. Maybe it
> would be better to make the whole expression lazy?):
>
> diff --git
> a/nbxml/src/main/java/org/apache/vysper/xml/sax/impl/XMLParser.java
> b/nbxml/src/main/java/org/apache/vysper/xml/sax/impl/XML
> index 96ca040..6be4bb4 100644
> --- a/nbxml/src/main/java/org/apache/vysper/xml/sax/impl/XMLParser.java
> +++ b/nbxml/src/main/java/org/apache/vysper/xml/sax/impl/XMLParser.java
> @@ -54,7 +54,7 @@ public class XMLParser implements TokenListener {
>
>      public static final Pattern NAME_PREFIX_PATTERN =
> Pattern.compile("^xml", Pattern.CASE_INSENSITIVE);
>
> -    public static final Pattern UNESCAPE_UNICODE_PATTERN =
> Pattern.compile("\\&\\#(x?)(.+);");
> +    public static final Pattern UNESCAPE_UNICODE_PATTERN =
> Pattern.compile("\\&\\#(x?)([0-9a-fA-F]++);");
>
>      private ContentHandler contentHandler;
>
> diff --git
> a/nbxml/src/test/java/org/apache/vysper/xml/sax/impl/ParseTextTestCase.java
> b/nbxml/src/test/java/org/apache/vysper/xml/sax/
> index 5305a9b..2f3112e 100644
> ---
> a/nbxml/src/test/java/org/apache/vysper/xml/sax/impl/ParseTextTestCase.java
> +++
> b/nbxml/src/test/java/org/apache/vysper/xml/sax/impl/ParseTextTestCase.java
> @@ -88,6 +88,18 @@ public class ParseTextTestCase extends
> AbstractAsyncXMLReaderTestCase {
>          assertFalse(events.hasNext());
>      }
>
> +    public void testConsecutiveUnicodeEscape() throws Exception {
> +        Iterator<TestEvent> events =
> parse("<root>&#160;&#160;&#160;&#160;</root>").iterator();
> +
> +        assertStartDocument(events.next());
> +        assertStartElement("", "root", "root", events.next());
> +        assertText("\u00A0\u00A0\u00A0\u00A0", events.next());
> +        assertEndElement("", "root", "root", events.next());
> +        assertEndDocument(events.next());
> +
> +        assertFalse(events.hasNext());
> +    }
> +
>      public void testTextOnly() throws Exception {
>          Iterator<TestEvent> events = parse("text</root>").iterator();
>
>
>

Re: consecutive numeric character references

Reply via email to