[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions

2015-02-03 Thread Stephan Siano (JIRA)

[ 
https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303335#comment-14303335
 ] 

Stephan Siano commented on CAMEL-8273:
--

OK, I think I have a better grasp of the problem now:

A good solution for a single XPath evaluation from some byte/stream-like source 
using Saxon is to convert the data to SAXSource and then feed it into the XPath 
evaluation (the JDK parser only supports Node and InputSource and the latter 
allows XXE injection).

So far so good. However if we have two consecutive XPath evaluations the 
SAXSource way is not so good:
let's assume we have an InputStream and are doing a setBody with an XPath 
expression that has a SAXSource as documentType and returns a Document (which 
is a DocumentOverNodeInfo). If we now do another XPath evaluation (with 
SAXSource documentType) on the same data the Document will be converted to a 
String in oder to be wrapped into a SAXSource. Saxon's XPath evaluator will 
then build another TinyTree from that data, So we end up with two TinyTrees and 
a String in Memory. Let's assume the TinyTrees consume 4 times the binary 
document size in Memory each and the String consumes two time the binary 
document, we end up with ten times the binary document size, which is about as 
much as a DOM document consumes from the beginning (and we are parsing an XML 
document again that was already parsed).

What do you think about the following approach?
We consider Node input as first class citizen and we are not doing any type 
conversion on that (as we can safely assume that each XPath implementation will 
be able to handle these) no matter which value documentType has.


 More flexible selection of default documentType in XPath expressions
 

 Key: CAMEL-8273
 URL: https://issues.apache.org/jira/browse/CAMEL-8273
 Project: Camel
  Issue Type: Improvement
  Components: camel-core
Reporter: Stephan Siano
Assignee: Claus Ibsen
 Fix For: Future

 Attachments: 
 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch


 In the current implementation of XPath if no documentType is defined (likely 
 in most cases) the document used for XPath evaluation is parsed into a (DOM) 
 Document using the JDK XML parser before applying the XPath expression on it.
 For large documents this might be resource intensive, especially if the XPath 
 is evaluated using a more efficient parser like Saxon.
 With the current implementation it is possible to workaround this by setting 
 a documentType attribute to the XPath expression, but doing this efficiently 
 requires some internal knowledge about the previous component in the camel 
 route (which type it creates) and the qualities of the used XML parser (e.g. 
 the JDK parser accepts only InputSource and Node as input types for XPath 
 evaluation whereas Saxon does also support other types like SAXSource).
 The attached patch will make the data type used by default for XPath 
 evaluation more flexible (depending on the type of the input).
 There are two cases to differentiate:
 documentType is set on the XPath expression:
 current implementation:
 1. try to convert to the documentType
 2. if that fails do some extra conversions for some additional data types 
 (WrappedFile, BeanInvocation, String)
 3. if that fails throw an exception
 new implementation:
 1. try to convert to the documentType
 2. if that fails, use the message if it is of type Node, InputSource or 
 DOMSource or do some type conversions for specific data types (WrappedFile, 
 BeanInvocation, String, InputStream, Reader, byte[]...)
 3. if that fails throw an exception
 documentType is not set on the XPath expresson
 old implementation:
 this is actually the same as if documentType was set to Document
 new implementation:
 1. Use the message if it is of type Node, InputSource or DOMSource or do some 
 type conversions for specific data types (WrappedFile, BeanInvocation, 
 String, InputStream, Reader, byte[]...) (to InputSource)
 2. If the old message is not of one of the types above, convert to DOM 
 Document
 3. If this fails throw an Exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions

2015-02-01 Thread Stephan Siano (JIRA)

[ 
https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300929#comment-14300929
 ] 

Stephan Siano commented on CAMEL-8273:
--

Saxon cannot do XPath in streaming mode (I actually don't think that this is 
even possible to have a full XPath implementation with streaming), but it 
supports XPath with TinyTree (which is much smaller than the Xerces DOM). If 
the XML parsing is done during the XPath evaluation (the document it provided 
not as a DOM tree but something else like InputSource) Saxon will parse into 
that TinyTree, which was actually the purpose of my patch. Unfortunately I 
overlooked the XXE thing.

I think I will check two things now:
1. whether Saxon will also allow XXE attacks if some non parsed type (like 
InputSource) is used for the conversion
2. If that is the case convert to NodeInfo (which is the Saxon interface for 
DOM-Like nodes (the TinyTree is a implementation of that)) and do the XPath 
parsing with that.

Both ways require to set the documentInfo parameter to something else than 
Document. Unfortunately I don't see a way to do that automatically in case 
saxon is used...

 More flexible selection of default documentType in XPath expressions
 

 Key: CAMEL-8273
 URL: https://issues.apache.org/jira/browse/CAMEL-8273
 Project: Camel
  Issue Type: Improvement
  Components: camel-core
Reporter: Stephan Siano
Assignee: Claus Ibsen
 Fix For: Future

 Attachments: 
 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch


 In the current implementation of XPath if no documentType is defined (likely 
 in most cases) the document used for XPath evaluation is parsed into a (DOM) 
 Document using the JDK XML parser before applying the XPath expression on it.
 For large documents this might be resource intensive, especially if the XPath 
 is evaluated using a more efficient parser like Saxon.
 With the current implementation it is possible to workaround this by setting 
 a documentType attribute to the XPath expression, but doing this efficiently 
 requires some internal knowledge about the previous component in the camel 
 route (which type it creates) and the qualities of the used XML parser (e.g. 
 the JDK parser accepts only InputSource and Node as input types for XPath 
 evaluation whereas Saxon does also support other types like SAXSource).
 The attached patch will make the data type used by default for XPath 
 evaluation more flexible (depending on the type of the input).
 There are two cases to differentiate:
 documentType is set on the XPath expression:
 current implementation:
 1. try to convert to the documentType
 2. if that fails do some extra conversions for some additional data types 
 (WrappedFile, BeanInvocation, String)
 3. if that fails throw an exception
 new implementation:
 1. try to convert to the documentType
 2. if that fails, use the message if it is of type Node, InputSource or 
 DOMSource or do some type conversions for specific data types (WrappedFile, 
 BeanInvocation, String, InputStream, Reader, byte[]...)
 3. if that fails throw an exception
 documentType is not set on the XPath expresson
 old implementation:
 this is actually the same as if documentType was set to Document
 new implementation:
 1. Use the message if it is of type Node, InputSource or DOMSource or do some 
 type conversions for specific data types (WrappedFile, BeanInvocation, 
 String, InputStream, Reader, byte[]...) (to InputSource)
 2. If the old message is not of one of the types above, convert to DOM 
 Document
 3. If this fails throw an Exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions

2015-01-27 Thread Claus Ibsen (JIRA)

[ 
https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294756#comment-14294756
 ] 

Claus Ibsen commented on CAMEL-8273:


Yeah AFAIR the JDK xpath engine relies on working with DOM structures. 

I have heard about custom xpath like engines that work with streams (not 
supporting all of xpath, but covers most use cases anyway) but they were custom 
made. 

There is vtd-xml for example
http://camel.apache.org/vtd-xml

And the wso2 guys seems to have a custom streaming xpath engine in their 
commerical esb 
https://github.com/apache/synapse/tree/trunk/java/modules/core/src/main/java/org/apache/synapse/util

But not in the open source Apache synapse project
https://github.com/apache/synapse/tree/trunk/java/modules/core/src/main/java/org/apache/synapse/util

Notice how the streaming xpath compiler is missing at Apache. 

Also not sure how far saxon can go with streams only

Some old SO questions
http://stackoverflow.com/questions/996103/streaming-xpath-evaluation

 More flexible selection of default documentType in XPath expressions
 

 Key: CAMEL-8273
 URL: https://issues.apache.org/jira/browse/CAMEL-8273
 Project: Camel
  Issue Type: Improvement
  Components: camel-core
Reporter: Stephan Siano
Assignee: Claus Ibsen
 Fix For: Future

 Attachments: 
 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch


 In the current implementation of XPath if no documentType is defined (likely 
 in most cases) the document used for XPath evaluation is parsed into a (DOM) 
 Document using the JDK XML parser before applying the XPath expression on it.
 For large documents this might be resource intensive, especially if the XPath 
 is evaluated using a more efficient parser like Saxon.
 With the current implementation it is possible to workaround this by setting 
 a documentType attribute to the XPath expression, but doing this efficiently 
 requires some internal knowledge about the previous component in the camel 
 route (which type it creates) and the qualities of the used XML parser (e.g. 
 the JDK parser accepts only InputSource and Node as input types for XPath 
 evaluation whereas Saxon does also support other types like SAXSource).
 The attached patch will make the data type used by default for XPath 
 evaluation more flexible (depending on the type of the input).
 There are two cases to differentiate:
 documentType is set on the XPath expression:
 current implementation:
 1. try to convert to the documentType
 2. if that fails do some extra conversions for some additional data types 
 (WrappedFile, BeanInvocation, String)
 3. if that fails throw an exception
 new implementation:
 1. try to convert to the documentType
 2. if that fails, use the message if it is of type Node, InputSource or 
 DOMSource or do some type conversions for specific data types (WrappedFile, 
 BeanInvocation, String, InputStream, Reader, byte[]...)
 3. if that fails throw an exception
 documentType is not set on the XPath expresson
 old implementation:
 this is actually the same as if documentType was set to Document
 new implementation:
 1. Use the message if it is of type Node, InputSource or DOMSource or do some 
 type conversions for specific data types (WrappedFile, BeanInvocation, 
 String, InputStream, Reader, byte[]...) (to InputSource)
 2. If the old message is not of one of the types above, convert to DOM 
 Document
 3. If this fails throw an Exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions

2015-01-26 Thread Stephan Siano (JIRA)

[ 
https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291819#comment-14291819
 ] 

Stephan Siano commented on CAMEL-8273:
--

Crap, you are right, I only ran the unit tests in org.apache.camel.language, 
not the ones in org.apache.builder.xml.

I am not 100% sure what the failed XPathTest.testXPathSplitConcurrent() means 
(it evaluates an XPath and then concurrently tries to create a Document from 
the Nodes with a TypeConverter in 100 threads and I actually don't understand 
why that behavior should change depending on the DocumentFactoryImpl being 
instantiated by Camel or by the JDK within the XPath.eval method), but the 
failed XPathFeatureTest.testXPathResult() looks like a showstopper for the 
whole approach to me. If the XPath implementation from the JDK gets an 
InputSource as source or the evaluation it will intantiate a DOM parser with 
default settings (that allow XXE) and I see no way around that.

I will do some further analyis on that, but it might really be necessary to do 
the DOM conversion before the XPath (as in the current coding)

 More flexible selection of default documentType in XPath expressions
 

 Key: CAMEL-8273
 URL: https://issues.apache.org/jira/browse/CAMEL-8273
 Project: Camel
  Issue Type: Improvement
  Components: camel-core
Reporter: Stephan Siano
Assignee: Claus Ibsen
 Fix For: 2.15.0

 Attachments: 
 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch


 In the current implementation of XPath if no documentType is defined (likely 
 in most cases) the document used for XPath evaluation is parsed into a (DOM) 
 Document using the JDK XML parser before applying the XPath expression on it.
 For large documents this might be resource intensive, especially if the XPath 
 is evaluated using a more efficient parser like Saxon.
 With the current implementation it is possible to workaround this by setting 
 a documentType attribute to the XPath expression, but doing this efficiently 
 requires some internal knowledge about the previous component in the camel 
 route (which type it creates) and the qualities of the used XML parser (e.g. 
 the JDK parser accepts only InputSource and Node as input types for XPath 
 evaluation whereas Saxon does also support other types like SAXSource).
 The attached patch will make the data type used by default for XPath 
 evaluation more flexible (depending on the type of the input).
 There are two cases to differentiate:
 documentType is set on the XPath expression:
 current implementation:
 1. try to convert to the documentType
 2. if that fails do some extra conversions for some additional data types 
 (WrappedFile, BeanInvocation, String)
 3. if that fails throw an exception
 new implementation:
 1. try to convert to the documentType
 2. if that fails, use the message if it is of type Node, InputSource or 
 DOMSource or do some type conversions for specific data types (WrappedFile, 
 BeanInvocation, String, InputStream, Reader, byte[]...)
 3. if that fails throw an exception
 documentType is not set on the XPath expresson
 old implementation:
 this is actually the same as if documentType was set to Document
 new implementation:
 1. Use the message if it is of type Node, InputSource or DOMSource or do some 
 type conversions for specific data types (WrappedFile, BeanInvocation, 
 String, InputStream, Reader, byte[]...) (to InputSource)
 2. If the old message is not of one of the types above, convert to DOM 
 Document
 3. If this fails throw an Exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions

2015-01-26 Thread Stephan Siano (JIRA)

[ 
https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293031#comment-14293031
 ] 

Stephan Siano commented on CAMEL-8273:
--

It seems as if the issues with this solution are unresolveable. I have to 
withdraw this patch. Sorry for the hassle.

 More flexible selection of default documentType in XPath expressions
 

 Key: CAMEL-8273
 URL: https://issues.apache.org/jira/browse/CAMEL-8273
 Project: Camel
  Issue Type: Improvement
  Components: camel-core
Reporter: Stephan Siano
Assignee: Claus Ibsen
 Fix For: 2.15.0

 Attachments: 
 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch


 In the current implementation of XPath if no documentType is defined (likely 
 in most cases) the document used for XPath evaluation is parsed into a (DOM) 
 Document using the JDK XML parser before applying the XPath expression on it.
 For large documents this might be resource intensive, especially if the XPath 
 is evaluated using a more efficient parser like Saxon.
 With the current implementation it is possible to workaround this by setting 
 a documentType attribute to the XPath expression, but doing this efficiently 
 requires some internal knowledge about the previous component in the camel 
 route (which type it creates) and the qualities of the used XML parser (e.g. 
 the JDK parser accepts only InputSource and Node as input types for XPath 
 evaluation whereas Saxon does also support other types like SAXSource).
 The attached patch will make the data type used by default for XPath 
 evaluation more flexible (depending on the type of the input).
 There are two cases to differentiate:
 documentType is set on the XPath expression:
 current implementation:
 1. try to convert to the documentType
 2. if that fails do some extra conversions for some additional data types 
 (WrappedFile, BeanInvocation, String)
 3. if that fails throw an exception
 new implementation:
 1. try to convert to the documentType
 2. if that fails, use the message if it is of type Node, InputSource or 
 DOMSource or do some type conversions for specific data types (WrappedFile, 
 BeanInvocation, String, InputStream, Reader, byte[]...)
 3. if that fails throw an exception
 documentType is not set on the XPath expresson
 old implementation:
 this is actually the same as if documentType was set to Document
 new implementation:
 1. Use the message if it is of type Node, InputSource or DOMSource or do some 
 type conversions for specific data types (WrappedFile, BeanInvocation, 
 String, InputStream, Reader, byte[]...) (to InputSource)
 2. If the old message is not of one of the types above, convert to DOM 
 Document
 3. If this fails throw an Exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions

2015-01-26 Thread Claus Ibsen (JIRA)

[ 
https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291609#comment-14291609
 ] 

Claus Ibsen commented on CAMEL-8273:


I get some unit test failures in camel-core such as

 XPathTestTestSupport.runBare:58-testXPathSplitConcurrent:381 null

 More flexible selection of default documentType in XPath expressions
 

 Key: CAMEL-8273
 URL: https://issues.apache.org/jira/browse/CAMEL-8273
 Project: Camel
  Issue Type: Improvement
  Components: camel-core
Reporter: Stephan Siano
Assignee: Claus Ibsen
 Fix For: 2.15.0

 Attachments: 
 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch


 In the current implementation of XPath if no documentType is defined (likely 
 in most cases) the document used for XPath evaluation is parsed into a (DOM) 
 Document using the JDK XML parser before applying the XPath expression on it.
 For large documents this might be resource intensive, especially if the XPath 
 is evaluated using a more efficient parser like Saxon.
 With the current implementation it is possible to workaround this by setting 
 a documentType attribute to the XPath expression, but doing this efficiently 
 requires some internal knowledge about the previous component in the camel 
 route (which type it creates) and the qualities of the used XML parser (e.g. 
 the JDK parser accepts only InputSource and Node as input types for XPath 
 evaluation whereas Saxon does also support other types like SAXSource).
 The attached patch will make the data type used by default for XPath 
 evaluation more flexible (depending on the type of the input).
 There are two cases to differentiate:
 documentType is set on the XPath expression:
 current implementation:
 1. try to convert to the documentType
 2. if that fails do some extra conversions for some additional data types 
 (WrappedFile, BeanInvocation, String)
 3. if that fails throw an exception
 new implementation:
 1. try to convert to the documentType
 2. if that fails, use the message if it is of type Node, InputSource or 
 DOMSource or do some type conversions for specific data types (WrappedFile, 
 BeanInvocation, String, InputStream, Reader, byte[]...)
 3. if that fails throw an exception
 documentType is not set on the XPath expresson
 old implementation:
 this is actually the same as if documentType was set to Document
 new implementation:
 1. Use the message if it is of type Node, InputSource or DOMSource or do some 
 type conversions for specific data types (WrappedFile, BeanInvocation, 
 String, InputStream, Reader, byte[]...) (to InputSource)
 2. If the old message is not of one of the types above, convert to DOM 
 Document
 3. If this fails throw an Exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)