Re: XMLChar.isNameStart error?

2018-02-14 Thread Andy Seaborne

This is about "editions" of XML 1.0.

On 14/02/18 10:52, Claude Warren wrote:

My error.  I should have specifed XML 1.0 as that is the spec that I drew
the test code from:  https://www.w3.org/TR/xml/#NT-NameStartChar


Is the XMLChar in the JDK correct? I don't know what edition the 
built-in Java XML parser supports.thing).




So this is an error in Xerces to meet the XML 1.0 naming spec.  I have
opened a defect with Xerces (
https://issues.apache.org/jira/browse/XERCESJ-1690)  but I don't expect
much movement there.


Apache Xerces claims suport for "XML 1.0 (4th Edition)", not edition 5.

I looked at XML 1.0 edition 4 and it looks different
"| [#x0100-#x0131] | [#x0134-#x013E] |"
no x132.

Xerces was going  to release 2.12 last year but I think that ran out of 
energy.  No sure what edition is targeted.




Jena is not so heavily tied Xerces.   Theer are only a couple of files 
that import org.apache.xerces datatype code.


We could extract the datatype source and adopt, then use the Java 
builtin parser or any other because we then don't depend/ship Xerces.


Xerces gets to tbe the XML parser by ServiceLoading.



>> it will not split the URL correctly.

A "feature" of RDF/XML

Actually, there isn't a "correct split" though we all expect split at 
"/" or "#".


Andy




Claude


On Wed, Feb 14, 2018 at 10:38 AM, Rob Vesse  wrote:


If memory serves this is mostly historical, once upon a time RDF/XML was
the only serialisation available and so everything had to be XML compliant.
Obviously things have evolved over time but the implementation is
conservative in this regards.

Also I think XML 1.1 post-dates RDF/XML and various other specifications
all of which are defined in terms of XML 1.0. For maximum compatibility it
is better for us to be conservative because most of the ecosystem has not
adopted XML 1.1 yet

Rob

On 14/02/2018, 09:04, "Claude Warren"  wrote:

 The issue is that predicate namespaces are parsed with XMLChar.  So if
I
 have one that is correctly formed based on XML 1.1 spec but the XMLChar
 code does not recognizes the first character of the local name it will
not
 split the URL correctly.  All code that depende upon
 Resource.getNamespace() and Resource.getLocalName() will be
incorrect.  It
 seems to me this is a low level problem.

 While it should be easy to fix the parsing problem, I am not certain
what
 effect that will have on any other code that is dependent upon the
Xerces
 code (where XMLChar originates).

 Claude

 On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne 
wrote:

 > Maybe SplitIRI will help?
 >
 > It does Turtle splitting as well as XML.
 >
 > Andy
 >
 >
 > On 13/02/18 17:39, Claude Warren wrote:
 >
 >> It is used in org.apache.jena.rdf.model.impl.Util namespace
splitting
 >> code.
 >>
 >> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne 
wrote:
 >>
 >> Where is XMLChar.isNameStart being used?
 >>>
 >>>
 >>> On 13/02/18 13:10, Claude Warren wrote:
 >>>
 >>> Is there a reason that Jena does not support the full range of XML
name
  start characters?
 
  see https://www.w3.org/TR/xml/#NT-NameStartChar
 
  I wrote a quick test and found that there were a number of
characters
  that
  Jena does not support.
  Miscategorization appears to start at 0x132.  There are 936990
  miscategorized characters.
 
  The issue is actually in the Xerces util class XMLChar
 
  Is this because of the version of Xerces we are stuck with?  Is
there a
  way
  around this issue?
 
  Claude
 
  p.s. Since I can't attach a file, here is the test code I wrote.
 
  import static org.junit.Assert.assertTrue;
 
  import org.apache.xerces.util.XMLChar;
  import org.junit.Test;
 
  public class NameTest {
    /*
 * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
[#xC0-#xD6] |
  [#xD8-#xF6] |
 * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
  [#x200C-#x200D] |
 * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
  [#xF900-#xFDCF] |
 * [#xFDF0-#xFFFD] | [#x1-#xE]
 */
 
    int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_'
}, {
  0xC0,
  0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
    { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C,
0x200D }, {
  0x2070, 0x218F }, { 0x2C00, 0x2FEF },
    { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0,
0xFFFD
  }, {
  0x1, 0xE } };
 
    @Test
    public void testNameStart() {
 
    for (int[] range : ranges) {
    for

Re: XMLChar.isNameStart error?

2018-02-14 Thread Claude Warren
My error.  I should have specifed XML 1.0 as that is the spec that I drew
the test code from:  https://www.w3.org/TR/xml/#NT-NameStartChar

So this is an error in Xerces to meet the XML 1.0 naming spec.  I have
opened a defect with Xerces (
https://issues.apache.org/jira/browse/XERCESJ-1690)  but I don't expect
much movement there.

Claude


On Wed, Feb 14, 2018 at 10:38 AM, Rob Vesse  wrote:

> If memory serves this is mostly historical, once upon a time RDF/XML was
> the only serialisation available and so everything had to be XML compliant.
> Obviously things have evolved over time but the implementation is
> conservative in this regards.
>
> Also I think XML 1.1 post-dates RDF/XML and various other specifications
> all of which are defined in terms of XML 1.0. For maximum compatibility it
> is better for us to be conservative because most of the ecosystem has not
> adopted XML 1.1 yet
>
> Rob
>
> On 14/02/2018, 09:04, "Claude Warren"  wrote:
>
> The issue is that predicate namespaces are parsed with XMLChar.  So if
> I
> have one that is correctly formed based on XML 1.1 spec but the XMLChar
> code does not recognizes the first character of the local name it will
> not
> split the URL correctly.  All code that depende upon
> Resource.getNamespace() and Resource.getLocalName() will be
> incorrect.  It
> seems to me this is a low level problem.
>
> While it should be easy to fix the parsing problem, I am not certain
> what
> effect that will have on any other code that is dependent upon the
> Xerces
> code (where XMLChar originates).
>
> Claude
>
> On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne 
> wrote:
>
> > Maybe SplitIRI will help?
> >
> > It does Turtle splitting as well as XML.
> >
> > Andy
> >
> >
> > On 13/02/18 17:39, Claude Warren wrote:
> >
> >> It is used in org.apache.jena.rdf.model.impl.Util namespace
> splitting
> >> code.
> >>
> >> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne 
> wrote:
> >>
> >> Where is XMLChar.isNameStart being used?
> >>>
> >>>
> >>> On 13/02/18 13:10, Claude Warren wrote:
> >>>
> >>> Is there a reason that Jena does not support the full range of XML
> name
>  start characters?
> 
>  see https://www.w3.org/TR/xml/#NT-NameStartChar
> 
>  I wrote a quick test and found that there were a number of
> characters
>  that
>  Jena does not support.
>  Miscategorization appears to start at 0x132.  There are 936990
>  miscategorized characters.
> 
>  The issue is actually in the Xerces util class XMLChar
> 
>  Is this because of the version of Xerces we are stuck with?  Is
> there a
>  way
>  around this issue?
> 
>  Claude
> 
>  p.s. Since I can't attach a file, here is the test code I wrote.
> 
>  import static org.junit.Assert.assertTrue;
> 
>  import org.apache.xerces.util.XMLChar;
>  import org.junit.Test;
> 
>  public class NameTest {
>    /*
> * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] |
> [#xC0-#xD6] |
>  [#xD8-#xF6] |
> * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
>  [#x200C-#x200D] |
> * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>  [#xF900-#xFDCF] |
> * [#xFDF0-#xFFFD] | [#x1-#xE]
> */
> 
>    int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_'
> }, {
>  0xC0,
>  0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>    { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C,
> 0x200D }, {
>  0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>    { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0,
> 0xFFFD
>  }, {
>  0x1, 0xE } };
> 
>    @Test
>    public void testNameStart() {
> 
>    for (int[] range : ranges) {
>    for (int c = range[0]; c <= range[1]; c++) {
>    assertTrue( String.format( "character %s
>  0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>    }
>    }
> 
>    }
> 
>    @Test
>    public void listNameStartErr() {
>    int cnt = 0;
>    for (int[] range : ranges) {
>    for (int c = range[0]; c <= range[1]; c++) {
>    if (!XMLChar.isNameStart( c ))
>    {
>    System.out.print( String.format( "0x%s
>  ",Integer.toHexString( c )) );
>    cnt++;
>    if (cnt % 25 == 0)
>    {

Re: XMLChar.isNameStart error?

2018-02-14 Thread Rob Vesse
If memory serves this is mostly historical, once upon a time RDF/XML was the 
only serialisation available and so everything had to be XML compliant. 
Obviously things have evolved over time but the implementation is conservative 
in this regards.

Also I think XML 1.1 post-dates RDF/XML and various other specifications all of 
which are defined in terms of XML 1.0. For maximum compatibility it is better 
for us to be conservative because most of the ecosystem has not adopted XML 1.1 
yet

Rob

On 14/02/2018, 09:04, "Claude Warren"  wrote:

The issue is that predicate namespaces are parsed with XMLChar.  So if I
have one that is correctly formed based on XML 1.1 spec but the XMLChar
code does not recognizes the first character of the local name it will not
split the URL correctly.  All code that depende upon
Resource.getNamespace() and Resource.getLocalName() will be incorrect.  It
seems to me this is a low level problem.

While it should be easy to fix the parsing problem, I am not certain what
effect that will have on any other code that is dependent upon the Xerces
code (where XMLChar originates).

Claude

On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne  wrote:

> Maybe SplitIRI will help?
>
> It does Turtle splitting as well as XML.
>
> Andy
>
>
> On 13/02/18 17:39, Claude Warren wrote:
>
>> It is used in org.apache.jena.rdf.model.impl.Util namespace splitting
>> code.
>>
>> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne  wrote:
>>
>> Where is XMLChar.isNameStart being used?
>>>
>>>
>>> On 13/02/18 13:10, Claude Warren wrote:
>>>
>>> Is there a reason that Jena does not support the full range of XML name
 start characters?

 see https://www.w3.org/TR/xml/#NT-NameStartChar

 I wrote a quick test and found that there were a number of characters
 that
 Jena does not support.
 Miscategorization appears to start at 0x132.  There are 936990
 miscategorized characters.

 The issue is actually in the Xerces util class XMLChar

 Is this because of the version of Xerces we are stuck with?  Is there a
 way
 around this issue?

 Claude

 p.s. Since I can't attach a file, here is the test code I wrote.

 import static org.junit.Assert.assertTrue;

 import org.apache.xerces.util.XMLChar;
 import org.junit.Test;

 public class NameTest {
   /*
* NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
 [#xD8-#xF6] |
* [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
 [#x200C-#x200D] |
* [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
 [#xF900-#xFDCF] |
* [#xFDF0-#xFFFD] | [#x1-#xE]
*/

   int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, {
 0xC0,
 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
   { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, 
{
 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
   { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD
 }, {
 0x1, 0xE } };

   @Test
   public void testNameStart() {

   for (int[] range : ranges) {
   for (int c = range[0]; c <= range[1]; c++) {
   assertTrue( String.format( "character %s
 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
   }
   }

   }

   @Test
   public void listNameStartErr() {
   int cnt = 0;
   for (int[] range : ranges) {
   for (int c = range[0]; c <= range[1]; c++) {
   if (!XMLChar.isNameStart( c ))
   {
   System.out.print( String.format( "0x%s
 ",Integer.toHexString( c )) );
   cnt++;
   if (cnt % 25 == 0)
   {
   System.out.println();
   }

   }

   }
   }
   System.out.println();
   System.out.println( cnt+" characters miscategorized"  );
   }

 }




>>
>>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren







Re: XMLChar.isNameStart error?

2018-02-14 Thread Claude Warren
The issue is that predicate namespaces are parsed with XMLChar.  So if I
have one that is correctly formed based on XML 1.1 spec but the XMLChar
code does not recognizes the first character of the local name it will not
split the URL correctly.  All code that depende upon
Resource.getNamespace() and Resource.getLocalName() will be incorrect.  It
seems to me this is a low level problem.

While it should be easy to fix the parsing problem, I am not certain what
effect that will have on any other code that is dependent upon the Xerces
code (where XMLChar originates).

Claude

On Tue, Feb 13, 2018 at 6:50 PM, Andy Seaborne  wrote:

> Maybe SplitIRI will help?
>
> It does Turtle splitting as well as XML.
>
> Andy
>
>
> On 13/02/18 17:39, Claude Warren wrote:
>
>> It is used in org.apache.jena.rdf.model.impl.Util namespace splitting
>> code.
>>
>> On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne  wrote:
>>
>> Where is XMLChar.isNameStart being used?
>>>
>>>
>>> On 13/02/18 13:10, Claude Warren wrote:
>>>
>>> Is there a reason that Jena does not support the full range of XML name
 start characters?

 see https://www.w3.org/TR/xml/#NT-NameStartChar

 I wrote a quick test and found that there were a number of characters
 that
 Jena does not support.
 Miscategorization appears to start at 0x132.  There are 936990
 miscategorized characters.

 The issue is actually in the Xerces util class XMLChar

 Is this because of the version of Xerces we are stuck with?  Is there a
 way
 around this issue?

 Claude

 p.s. Since I can't attach a file, here is the test code I wrote.

 import static org.junit.Assert.assertTrue;

 import org.apache.xerces.util.XMLChar;
 import org.junit.Test;

 public class NameTest {
   /*
* NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
 [#xD8-#xF6] |
* [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
 [#x200C-#x200D] |
* [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
 [#xF900-#xFDCF] |
* [#xFDF0-#xFFFD] | [#x1-#xE]
*/

   int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, {
 0xC0,
 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
   { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
   { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD
 }, {
 0x1, 0xE } };

   @Test
   public void testNameStart() {

   for (int[] range : ranges) {
   for (int c = range[0]; c <= range[1]; c++) {
   assertTrue( String.format( "character %s
 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
   }
   }

   }

   @Test
   public void listNameStartErr() {
   int cnt = 0;
   for (int[] range : ranges) {
   for (int c = range[0]; c <= range[1]; c++) {
   if (!XMLChar.isNameStart( c ))
   {
   System.out.print( String.format( "0x%s
 ",Integer.toHexString( c )) );
   cnt++;
   if (cnt % 25 == 0)
   {
   System.out.println();
   }

   }

   }
   }
   System.out.println();
   System.out.println( cnt+" characters miscategorized"  );
   }

 }




>>
>>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren


Re: XMLChar.isNameStart error?

2018-02-13 Thread Andy Seaborne

Maybe SplitIRI will help?

It does Turtle splitting as well as XML.

Andy

On 13/02/18 17:39, Claude Warren wrote:

It is used in org.apache.jena.rdf.model.impl.Util namespace splitting code.

On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne  wrote:


Where is XMLChar.isNameStart being used?


On 13/02/18 13:10, Claude Warren wrote:


Is there a reason that Jena does not support the full range of XML name
start characters?

see https://www.w3.org/TR/xml/#NT-NameStartChar

I wrote a quick test and found that there were a number of characters that
Jena does not support.
Miscategorization appears to start at 0x132.  There are 936990
miscategorized characters.

The issue is actually in the Xerces util class XMLChar

Is this because of the version of Xerces we are stuck with?  Is there a
way
around this issue?

Claude

p.s. Since I can't attach a file, here is the test code I wrote.

import static org.junit.Assert.assertTrue;

import org.apache.xerces.util.XMLChar;
import org.junit.Test;

public class NameTest {
  /*
   * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] |
   * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
   * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] |
   * [#xFDF0-#xFFFD] | [#x1-#xE]
   */

  int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
  { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
0x2070, 0x218F }, { 0x2C00, 0x2FEF },
  { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
0x1, 0xE } };

  @Test
  public void testNameStart() {

  for (int[] range : ranges) {
  for (int c = range[0]; c <= range[1]; c++) {
  assertTrue( String.format( "character %s
0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
  }
  }

  }

  @Test
  public void listNameStartErr() {
  int cnt = 0;
  for (int[] range : ranges) {
  for (int c = range[0]; c <= range[1]; c++) {
  if (!XMLChar.isNameStart( c ))
  {
  System.out.print( String.format( "0x%s
",Integer.toHexString( c )) );
  cnt++;
  if (cnt % 25 == 0)
  {
  System.out.println();
  }

  }

  }
  }
  System.out.println();
  System.out.println( cnt+" characters miscategorized"  );
  }

}








Re: XMLChar.isNameStart error?

2018-02-13 Thread Claude Warren
It is used in org.apache.jena.rdf.model.impl.Util namespace splitting code.

On Tue, Feb 13, 2018 at 4:44 PM, Andy Seaborne  wrote:

> Where is XMLChar.isNameStart being used?
>
>
> On 13/02/18 13:10, Claude Warren wrote:
>
>> Is there a reason that Jena does not support the full range of XML name
>> start characters?
>>
>> see https://www.w3.org/TR/xml/#NT-NameStartChar
>>
>> I wrote a quick test and found that there were a number of characters that
>> Jena does not support.
>> Miscategorization appears to start at 0x132.  There are 936990
>> miscategorized characters.
>>
>> The issue is actually in the Xerces util class XMLChar
>>
>> Is this because of the version of Xerces we are stuck with?  Is there a
>> way
>> around this issue?
>>
>> Claude
>>
>> p.s. Since I can't attach a file, here is the test code I wrote.
>>
>> import static org.junit.Assert.assertTrue;
>>
>> import org.apache.xerces.util.XMLChar;
>> import org.junit.Test;
>>
>> public class NameTest {
>>  /*
>>   * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
>> [#xD8-#xF6] |
>>   * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
>>   * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
>> [#xF900-#xFDCF] |
>>   * [#xFDF0-#xFFFD] | [#x1-#xE]
>>   */
>>
>>  int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
>> 0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
>>  { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
>> 0x2070, 0x218F }, { 0x2C00, 0x2FEF },
>>  { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
>> 0x1, 0xE } };
>>
>>  @Test
>>  public void testNameStart() {
>>
>>  for (int[] range : ranges) {
>>  for (int c = range[0]; c <= range[1]; c++) {
>>  assertTrue( String.format( "character %s
>> 0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
>>  }
>>  }
>>
>>  }
>>
>>  @Test
>>  public void listNameStartErr() {
>>  int cnt = 0;
>>  for (int[] range : ranges) {
>>  for (int c = range[0]; c <= range[1]; c++) {
>>  if (!XMLChar.isNameStart( c ))
>>  {
>>  System.out.print( String.format( "0x%s
>> ",Integer.toHexString( c )) );
>>  cnt++;
>>  if (cnt % 25 == 0)
>>  {
>>  System.out.println();
>>  }
>>
>>  }
>>
>>  }
>>  }
>>  System.out.println();
>>  System.out.println( cnt+" characters miscategorized"  );
>>  }
>>
>> }
>>
>>
>>


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren


Re: XMLChar.isNameStart error?

2018-02-13 Thread Andy Seaborne

Where is XMLChar.isNameStart being used?

On 13/02/18 13:10, Claude Warren wrote:

Is there a reason that Jena does not support the full range of XML name
start characters?

see https://www.w3.org/TR/xml/#NT-NameStartChar

I wrote a quick test and found that there were a number of characters that
Jena does not support.
Miscategorization appears to start at 0x132.  There are 936990
miscategorized characters.

The issue is actually in the Xerces util class XMLChar

Is this because of the version of Xerces we are stuck with?  Is there a way
around this issue?

Claude

p.s. Since I can't attach a file, here is the test code I wrote.

import static org.junit.Assert.assertTrue;

import org.apache.xerces.util.XMLChar;
import org.junit.Test;

public class NameTest {
 /*
  * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] |
  * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
  * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] |
  * [#xFDF0-#xFFFD] | [#x1-#xE]
  */

 int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
 { 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
0x2070, 0x218F }, { 0x2C00, 0x2FEF },
 { 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
0x1, 0xE } };

 @Test
 public void testNameStart() {

 for (int[] range : ranges) {
 for (int c = range[0]; c <= range[1]; c++) {
 assertTrue( String.format( "character %s
0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
 }
 }

 }

 @Test
 public void listNameStartErr() {
 int cnt = 0;
 for (int[] range : ranges) {
 for (int c = range[0]; c <= range[1]; c++) {
 if (!XMLChar.isNameStart( c ))
 {
 System.out.print( String.format( "0x%s
",Integer.toHexString( c )) );
 cnt++;
 if (cnt % 25 == 0)
 {
 System.out.println();
 }

 }

 }
 }
 System.out.println();
 System.out.println( cnt+" characters miscategorized"  );
 }

}




XMLChar.isNameStart error?

2018-02-13 Thread Claude Warren
Is there a reason that Jena does not support the full range of XML name
start characters?

see https://www.w3.org/TR/xml/#NT-NameStartChar

I wrote a quick test and found that there were a number of characters that
Jena does not support.
Miscategorization appears to start at 0x132.  There are 936990
miscategorized characters.

The issue is actually in the Xerces util class XMLChar

Is this because of the version of Xerces we are stuck with?  Is there a way
around this issue?

Claude

p.s. Since I can't attach a file, here is the test code I wrote.

import static org.junit.Assert.assertTrue;

import org.apache.xerces.util.XMLChar;
import org.junit.Test;

public class NameTest {
/*
 * NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] |
[#xD8-#xF6] |
 * [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
 * [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] |
[#xF900-#xFDCF] |
 * [#xFDF0-#xFFFD] | [#x1-#xE]
 */

int[][] ranges = { { ':', ':' }, { 'A', 'Z' }, { '_', '_' }, { 0xC0,
0xD6 }, { 0xD8, 0xF6 }, { 0xF8, 0x2FF },
{ 0x370, 0x37D }, { 0x37F, 0x1FFF }, { 0x200C, 0x200D }, {
0x2070, 0x218F }, { 0x2C00, 0x2FEF },
{ 0x3001, 0xD7FF }, { 0xF900, 0xFDCF }, { 0xFDF0, 0xFFFD }, {
0x1, 0xE } };

@Test
public void testNameStart() {

for (int[] range : ranges) {
for (int c = range[0]; c <= range[1]; c++) {
assertTrue( String.format( "character %s
0x%s",c,Integer.toHexString( c )) , XMLChar.isNameStart( c ) );
}
}

}

@Test
public void listNameStartErr() {
int cnt = 0;
for (int[] range : ranges) {
for (int c = range[0]; c <= range[1]; c++) {
if (!XMLChar.isNameStart( c ))
{
System.out.print( String.format( "0x%s
",Integer.toHexString( c )) );
cnt++;
if (cnt % 25 == 0)
{
System.out.println();
}

}

}
}
System.out.println();
System.out.println( cnt+" characters miscategorized"  );
}

}


-- 
I like: Like Like - The likeliest place on the web

LinkedIn: http://www.linkedin.com/in/claudewarren