[jira] [Commented] (XALANJ-2540) Very inefficient default behaviour for looking up DTMManager

2018-04-16 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439952#comment-16439952
 ] 

Gary Gregory commented on XALANJ-2540:
--

Do keep in mind that it has been a long time since a release and if there is a 
2.7.3 it will likely be from the branch.

> Very inefficient default behaviour for looking up DTMManager
> 
>
> Key: XALANJ-2540
> URL: https://issues.apache.org/jira/browse/XALANJ-2540
> Project: XalanJ2
>  Issue Type: Improvement
>  Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>  Components: DTM, XPath
>Affects Versions: 2.7.1, 2.7
>Reporter: Lukas Eder
>Priority: Major
>
> I have analysed an issue that has been bothering me for some time. When 
> executing XPath evaluations, it looks like a very significant amount of time 
> is spent in the initialisation of the XPathContext. I have asked this 
> question on Stack Overflow and answered it myself:
> http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance
> I think the default behaviour of 
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName() is quite 
> sub-optimal and should be improved, statically. I imagine, it is unlikely 
> that this configuration is going to change once classes have been loaded. 
> Hence, the fallback lookup of META-INF/service/org.apache.xml.dtm.DTMManager 
> should only be done once.
> For reference, here's the question and answer again in JIRA:
> 
> I have come to an astonishing conclusion that this:
> Element e = (Element) 
> document.getElementsByTagName("SomeElementName").item(0);
> String result = ((Element) e).getTextContent();
> Seems to be an incredible 100x faster than this:
> // Accounts for 30%, can be cached
> XPathFactory factory = XPathFactory.newInstance();
> // Negligible
> XPath xpath = factory.newXPath();
> // Accounts for 70% (caching a compiled expression doesn't change much...)
> String result = (String) xpath.evaluate(
>   "//SomeElementName", document, XPathConstants.STRING);
> I'm using the JVM's default implementation of JAXP:
> org.apache.xpath.jaxp.XPathFactoryImpl
> org.apache.xpath.jaxp.XPathImpl
> I'm really confused, because it's easy to see how JAXP could optimise the 
> above XPath query to actually execute a simple getElementsByTagName() 
> instead. But it doesn't seem to do that. This problem is limited to around 
> 5-6 frequently used XPath calls, that are abstracted and hidden by an API. 
> Those queries involve simple paths (e.g. /a/b/c, no variables, conditions) 
> against an always available DOM Document only. So, if an optimisation can be 
> done, it will be quite easy to achieve.
> 
> I have debugged and profiled my test-case and Xalan/JAXP in general. I 
> managed to identify the big major problem in
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName()
> It can be seen that every one of the 10k test XPath evaluations led to the 
> classloader trying to lookup the DTMManager instance in some sort of default 
> configuration. This configuration is not loaded into memory but accessed 
> every time. Furthermore, this access seems to be protected by a lock on the 
> ObjectFactory.class itself. When the access fails (by default), then the 
> configuration is loaded from the xalan.jar file's
> META-INF/service/org.apache.xml.dtm.DTMManager
> configuration file. Every time!:
> Fortunately, this behaviour can be overridden by specifying a JVM parameter 
> like this:
> -Dorg.apache.xml.dtm.DTMManager=
>   org.apache.xml.dtm.ref.DTMManagerDefault
> or
> -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
>   com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
> So here's a performance improvement overview for 10k consecutive XPath 
> evaluations of //SomeNodeName against a 90k XML file (measured with 
> System.nanoTime():
> measured library: Xalan 2.7.0 | Xalan 2.7.1 | Saxon-HE 9.3 | jaxen 
> 1.1.3   
> 
> without optimisation: 10400ms |  4717ms |  | 
> 25500ms
> reusing XPathFactory:  5995ms |  2829ms |  |
> reusing XPath   :  5900ms |  2890ms |  |
> reusing XPathExpression :  5800ms |  2915ms |  16000ms | 
> 25000ms
> adding the JVM param:  1163ms |   761ms |n/a   |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439454#comment-16439454
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/16/18 1:38 PM:


Fix works for me. +1 to start a release of XALANJ (and maybe also XERCES).


was (Author: thetaphi):
Fix works for me.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Commented] (XALANJ-2540) Very inefficient default behaviour for looking up DTMManager

2018-04-16 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439437#comment-16439437
 ] 

Gary Gregory commented on XALANJ-2540:
--

It's been a long time but I am pretty sure I released 2.7.1 out of 
[https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint/]

Gary

> Very inefficient default behaviour for looking up DTMManager
> 
>
> Key: XALANJ-2540
> URL: https://issues.apache.org/jira/browse/XALANJ-2540
> Project: XalanJ2
>  Issue Type: Improvement
>  Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>  Components: DTM, XPath
>Affects Versions: 2.7.1, 2.7
>Reporter: Lukas Eder
>Priority: Major
>
> I have analysed an issue that has been bothering me for some time. When 
> executing XPath evaluations, it looks like a very significant amount of time 
> is spent in the initialisation of the XPathContext. I have asked this 
> question on Stack Overflow and answered it myself:
> http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance
> I think the default behaviour of 
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName() is quite 
> sub-optimal and should be improved, statically. I imagine, it is unlikely 
> that this configuration is going to change once classes have been loaded. 
> Hence, the fallback lookup of META-INF/service/org.apache.xml.dtm.DTMManager 
> should only be done once.
> For reference, here's the question and answer again in JIRA:
> 
> I have come to an astonishing conclusion that this:
> Element e = (Element) 
> document.getElementsByTagName("SomeElementName").item(0);
> String result = ((Element) e).getTextContent();
> Seems to be an incredible 100x faster than this:
> // Accounts for 30%, can be cached
> XPathFactory factory = XPathFactory.newInstance();
> // Negligible
> XPath xpath = factory.newXPath();
> // Accounts for 70% (caching a compiled expression doesn't change much...)
> String result = (String) xpath.evaluate(
>   "//SomeElementName", document, XPathConstants.STRING);
> I'm using the JVM's default implementation of JAXP:
> org.apache.xpath.jaxp.XPathFactoryImpl
> org.apache.xpath.jaxp.XPathImpl
> I'm really confused, because it's easy to see how JAXP could optimise the 
> above XPath query to actually execute a simple getElementsByTagName() 
> instead. But it doesn't seem to do that. This problem is limited to around 
> 5-6 frequently used XPath calls, that are abstracted and hidden by an API. 
> Those queries involve simple paths (e.g. /a/b/c, no variables, conditions) 
> against an always available DOM Document only. So, if an optimisation can be 
> done, it will be quite easy to achieve.
> 
> I have debugged and profiled my test-case and Xalan/JAXP in general. I 
> managed to identify the big major problem in
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName()
> It can be seen that every one of the 10k test XPath evaluations led to the 
> classloader trying to lookup the DTMManager instance in some sort of default 
> configuration. This configuration is not loaded into memory but accessed 
> every time. Furthermore, this access seems to be protected by a lock on the 
> ObjectFactory.class itself. When the access fails (by default), then the 
> configuration is loaded from the xalan.jar file's
> META-INF/service/org.apache.xml.dtm.DTMManager
> configuration file. Every time!:
> Fortunately, this behaviour can be overridden by specifying a JVM parameter 
> like this:
> -Dorg.apache.xml.dtm.DTMManager=
>   org.apache.xml.dtm.ref.DTMManagerDefault
> or
> -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
>   com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
> So here's a performance improvement overview for 10k consecutive XPath 
> evaluations of //SomeNodeName against a 90k XML file (measured with 
> System.nanoTime():
> measured library: Xalan 2.7.0 | Xalan 2.7.1 | Saxon-HE 9.3 | jaxen 
> 1.1.3   
> 
> without optimisation: 10400ms |  4717ms |  | 
> 25500ms
> reusing XPathFactory:  5995ms |  2829ms |  |
> reusing XPath   :  5900ms |  2890ms |  |
> reusing XPathExpression :  5800ms |  2915ms |  16000ms | 
> 25000ms
> adding the JVM param:  1163ms |   761ms |n/a   |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Commented] (XALANJ-2540) Very inefficient default behaviour for looking up DTMManager

2018-04-16 Thread Matthew Broadhead (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439427#comment-16439427
 ] 

Matthew Broadhead commented on XALANJ-2540:
---

if i go to the xalan frontpage [https://xalan.apache.org/] it says the code can 
be found at [http://svn.apache.org/repos/asf/xalan/xalan-j/trunk/] which just 
says "Not found".  I am trying to find the 
org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName() function mentioned in 
the original bug report.  Can anyone help?

> Very inefficient default behaviour for looking up DTMManager
> 
>
> Key: XALANJ-2540
> URL: https://issues.apache.org/jira/browse/XALANJ-2540
> Project: XalanJ2
>  Issue Type: Improvement
>  Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>  Components: DTM, XPath
>Affects Versions: 2.7.1, 2.7
>Reporter: Lukas Eder
>Priority: Major
>
> I have analysed an issue that has been bothering me for some time. When 
> executing XPath evaluations, it looks like a very significant amount of time 
> is spent in the initialisation of the XPathContext. I have asked this 
> question on Stack Overflow and answered it myself:
> http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance
> I think the default behaviour of 
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName() is quite 
> sub-optimal and should be improved, statically. I imagine, it is unlikely 
> that this configuration is going to change once classes have been loaded. 
> Hence, the fallback lookup of META-INF/service/org.apache.xml.dtm.DTMManager 
> should only be done once.
> For reference, here's the question and answer again in JIRA:
> 
> I have come to an astonishing conclusion that this:
> Element e = (Element) 
> document.getElementsByTagName("SomeElementName").item(0);
> String result = ((Element) e).getTextContent();
> Seems to be an incredible 100x faster than this:
> // Accounts for 30%, can be cached
> XPathFactory factory = XPathFactory.newInstance();
> // Negligible
> XPath xpath = factory.newXPath();
> // Accounts for 70% (caching a compiled expression doesn't change much...)
> String result = (String) xpath.evaluate(
>   "//SomeElementName", document, XPathConstants.STRING);
> I'm using the JVM's default implementation of JAXP:
> org.apache.xpath.jaxp.XPathFactoryImpl
> org.apache.xpath.jaxp.XPathImpl
> I'm really confused, because it's easy to see how JAXP could optimise the 
> above XPath query to actually execute a simple getElementsByTagName() 
> instead. But it doesn't seem to do that. This problem is limited to around 
> 5-6 frequently used XPath calls, that are abstracted and hidden by an API. 
> Those queries involve simple paths (e.g. /a/b/c, no variables, conditions) 
> against an always available DOM Document only. So, if an optimisation can be 
> done, it will be quite easy to achieve.
> 
> I have debugged and profiled my test-case and Xalan/JAXP in general. I 
> managed to identify the big major problem in
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName()
> It can be seen that every one of the 10k test XPath evaluations led to the 
> classloader trying to lookup the DTMManager instance in some sort of default 
> configuration. This configuration is not loaded into memory but accessed 
> every time. Furthermore, this access seems to be protected by a lock on the 
> ObjectFactory.class itself. When the access fails (by default), then the 
> configuration is loaded from the xalan.jar file's
> META-INF/service/org.apache.xml.dtm.DTMManager
> configuration file. Every time!:
> Fortunately, this behaviour can be overridden by specifying a JVM parameter 
> like this:
> -Dorg.apache.xml.dtm.DTMManager=
>   org.apache.xml.dtm.ref.DTMManagerDefault
> or
> -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
>   com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
> So here's a performance improvement overview for 10k consecutive XPath 
> evaluations of //SomeNodeName against a 90k XML file (measured with 
> System.nanoTime():
> measured library: Xalan 2.7.0 | Xalan 2.7.1 | Saxon-HE 9.3 | jaxen 
> 1.1.3   
> 
> without optimisation: 10400ms |  4717ms |  | 
> 25500ms
> reusing XPathFactory:  5995ms |  2829ms |  |
> reusing XPath   :  5900ms |  2890ms |  |
> reusing XPathExpression :  5800ms |  2915ms |  16000ms | 
> 25000ms
> adding the JVM param:  1163ms |   761ms |n/a   |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439386#comment-16439386
 ] 

Uwe Schindler commented on XALANJ-2419:
---

bq. (Or are you using a jarjar'ed build inside Solr or Lucene?)

Hah, you noticed it. Yes, I am Lucene/Solr. But this issue is more about XML 
processing in a local project of mine. I know that Solr is affected by this...

I also have a workaround for Apache Axis 1.4 (which was easy to fix without 
patching).

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439379#comment-16439379
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/16/18 12:48 PM:
-

Thanks for the fix, I will test it in a moment.

About a release: I am Apache member and committer, so I might start a thread to 
push a release. As these bugs are horrible and make almost any XML handling of 
stuff like Emojis broken, we should maybe do a bugfix for serializer.jar 
release. Keep in mind, this would also require to make a Xerces release, as 
Xerces and Xalan share serializer.jar (I think they depend on each other on 
Maven central).

I would try to manage to do help with a relaese. This fix is indeed simple. 
Somebody should just commit it (I could theoretically do it, but that should be 
done by non-project members only as last resort), and press somebody else would 
press the button for release.

BTW, also my own projects like Apache Solr are affected by this bug (people 
that still use XML instead of JSON with Solr).


was (Author: thetaphi):
Thanks for the fix, I will test it in a moment.

About a release: I am Apache member and committer, so I might start a thread to 
push a release. As these bugs are horrible and make almost any XML handling of 
stuff like Emojis broken, we should maybe do a bugfix releaser for 
serializer.jar release. Keep in mind, this would also require to make a Xerces 
release, as Xerces and Xalan share serializer.jar (I think they depend on each 
other on Maven central).

I would try to manage to do help with a relaese. This fix is indeed simple. 
Somebody should just commit it (I could theoretically do it, but that should be 
done by non-project members only as last resort), and press somebody else would 
press the button for release.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439379#comment-16439379
 ] 

Uwe Schindler commented on XALANJ-2419:
---

Thanks for the fix, I will test it in a moment.

About a release: I am Apache member and committer, so I might start a thread to 
push a release. As these bugs are horrible and make almost any XML handling of 
stuff like Emojis broken, we should maybe do a bugfix releaser for 
serializer.jar release. Keep in mind, this would also require to make a Xerces 
release, as Xerces and Xalan share serializer.jar (I think they depend on each 
other on Maven central).

I would try to manage to do help with a relaese. This fix is indeed simple. 
Somebody should just commit it (I could theoretically do it, but that should be 
done by non-project members only as last resort), and press somebody else would 
press the button for release.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Updated] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Steen Møller updated XALANJ-2419:

Attachment: (was: XALANJ-2419-fix-v2.txt)

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439335#comment-16439335
 ] 

Jesper Steen Møller commented on XALANJ-2419:
-

Hi [~thetaphi] - Version 3 adds the fix for normal HTML attribute content as 
well as URL attributes encoded without URL escapes.

A ToHTMLStream test is added, which also tests these corner cases (the 
UTF-8+URL-escaped byte sequences were as expected, but now has a test).

But how do we get anybody to cut a new release?

(Or are you using a jarjar'ed build inside Solr or Lucene?)

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Updated] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Steen Møller updated XALANJ-2419:

Attachment: XALANJ-2419-tests-v3.txt

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-fix-v3.txt, 
> XALANJ-2419-tests-v2.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Updated] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Steen Møller updated XALANJ-2419:

Attachment: XALANJ-2419-fix-v3.txt

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-fix-v3.txt, 
> XALANJ-2419-tests-v2.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Commented] (XALANJ-2540) Very inefficient default behaviour for looking up DTMManager

2018-04-16 Thread Matthew Broadhead (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439216#comment-16439216
 ] 

Matthew Broadhead commented on XALANJ-2540:
---

?

> Very inefficient default behaviour for looking up DTMManager
> 
>
> Key: XALANJ-2540
> URL: https://issues.apache.org/jira/browse/XALANJ-2540
> Project: XalanJ2
>  Issue Type: Improvement
>  Security Level: No security risk; visible to anyone(Ordinary problems in 
> Xalan projects.  Anybody can view the issue.) 
>  Components: DTM, XPath
>Affects Versions: 2.7.1, 2.7
>Reporter: Lukas Eder
>Priority: Major
>
> I have analysed an issue that has been bothering me for some time. When 
> executing XPath evaluations, it looks like a very significant amount of time 
> is spent in the initialisation of the XPathContext. I have asked this 
> question on Stack Overflow and answered it myself:
> http://stackoverflow.com/questions/6340802/java-xpath-apache-jaxp-implementation-performance
> I think the default behaviour of 
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName() is quite 
> sub-optimal and should be improved, statically. I imagine, it is unlikely 
> that this configuration is going to change once classes have been loaded. 
> Hence, the fallback lookup of META-INF/service/org.apache.xml.dtm.DTMManager 
> should only be done once.
> For reference, here's the question and answer again in JIRA:
> 
> I have come to an astonishing conclusion that this:
> Element e = (Element) 
> document.getElementsByTagName("SomeElementName").item(0);
> String result = ((Element) e).getTextContent();
> Seems to be an incredible 100x faster than this:
> // Accounts for 30%, can be cached
> XPathFactory factory = XPathFactory.newInstance();
> // Negligible
> XPath xpath = factory.newXPath();
> // Accounts for 70% (caching a compiled expression doesn't change much...)
> String result = (String) xpath.evaluate(
>   "//SomeElementName", document, XPathConstants.STRING);
> I'm using the JVM's default implementation of JAXP:
> org.apache.xpath.jaxp.XPathFactoryImpl
> org.apache.xpath.jaxp.XPathImpl
> I'm really confused, because it's easy to see how JAXP could optimise the 
> above XPath query to actually execute a simple getElementsByTagName() 
> instead. But it doesn't seem to do that. This problem is limited to around 
> 5-6 frequently used XPath calls, that are abstracted and hidden by an API. 
> Those queries involve simple paths (e.g. /a/b/c, no variables, conditions) 
> against an always available DOM Document only. So, if an optimisation can be 
> done, it will be quite easy to achieve.
> 
> I have debugged and profiled my test-case and Xalan/JAXP in general. I 
> managed to identify the big major problem in
> org.apache.xml.dtm.ObjectFactory.lookUpFactoryClassName()
> It can be seen that every one of the 10k test XPath evaluations led to the 
> classloader trying to lookup the DTMManager instance in some sort of default 
> configuration. This configuration is not loaded into memory but accessed 
> every time. Furthermore, this access seems to be protected by a lock on the 
> ObjectFactory.class itself. When the access fails (by default), then the 
> configuration is loaded from the xalan.jar file's
> META-INF/service/org.apache.xml.dtm.DTMManager
> configuration file. Every time!:
> Fortunately, this behaviour can be overridden by specifying a JVM parameter 
> like this:
> -Dorg.apache.xml.dtm.DTMManager=
>   org.apache.xml.dtm.ref.DTMManagerDefault
> or
> -Dcom.sun.org.apache.xml.internal.dtm.DTMManager=
>   com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault
> So here's a performance improvement overview for 10k consecutive XPath 
> evaluations of //SomeNodeName against a 90k XML file (measured with 
> System.nanoTime():
> measured library: Xalan 2.7.0 | Xalan 2.7.1 | Saxon-HE 9.3 | jaxen 
> 1.1.3   
> 
> without optimisation: 10400ms |  4717ms |  | 
> 25500ms
> reusing XPathFactory:  5995ms |  2829ms |  |
> reusing XPath   :  5900ms |  2890ms |  |
> reusing XPathExpression :  5800ms |  2915ms |  16000ms | 
> 25000ms
> adding the JVM param:  1163ms |   761ms |n/a   |



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439023#comment-16439023
 ] 

Uwe Schindler commented on XALANJ-2419:
---

Hi Jesper,
thanks! I applied the same patch like your's to my local checkout yesterday and 
I can confirm it fixes the XML case.

But it does not work for my HTML example above, the only workaround for the 
HTML encode is like it was here (if you pass an encoding of UTF-16 and use a 
writer to write it to an UTF-8 file - and you don't have a header with charset 
in HTML serializations).

The issue in ToHTML stream seems to be a counting problem (it looks like it 
print the whole surrogate correctly, but it forgot to increment the counter, so 
it prints a hex escape of the second part):
- I had no HREF attributes in my test, so i was not affected by a URL encoding 
corner case.
- Normal attributes seem to have the above input character counting problem, 
the astral character is written correctly, but the low surrogate is printed as 
escape.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("