[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions
[ https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14303335#comment-14303335 ] Stephan Siano commented on CAMEL-8273: -- OK, I think I have a better grasp of the problem now: A good solution for a single XPath evaluation from some byte/stream-like source using Saxon is to convert the data to SAXSource and then feed it into the XPath evaluation (the JDK parser only supports Node and InputSource and the latter allows XXE injection). So far so good. However if we have two consecutive XPath evaluations the SAXSource way is not so good: let's assume we have an InputStream and are doing a setBody with an XPath expression that has a SAXSource as documentType and returns a Document (which is a DocumentOverNodeInfo). If we now do another XPath evaluation (with SAXSource documentType) on the same data the Document will be converted to a String in oder to be wrapped into a SAXSource. Saxon's XPath evaluator will then build another TinyTree from that data, So we end up with two TinyTrees and a String in Memory. Let's assume the TinyTrees consume 4 times the binary document size in Memory each and the String consumes two time the binary document, we end up with ten times the binary document size, which is about as much as a DOM document consumes from the beginning (and we are parsing an XML document again that was already parsed). What do you think about the following approach? We consider Node input as first class citizen and we are not doing any type conversion on that (as we can safely assume that each XPath implementation will be able to handle these) no matter which value documentType has. More flexible selection of default documentType in XPath expressions Key: CAMEL-8273 URL: https://issues.apache.org/jira/browse/CAMEL-8273 Project: Camel Issue Type: Improvement Components: camel-core Reporter: Stephan Siano Assignee: Claus Ibsen Fix For: Future Attachments: 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch In the current implementation of XPath if no documentType is defined (likely in most cases) the document used for XPath evaluation is parsed into a (DOM) Document using the JDK XML parser before applying the XPath expression on it. For large documents this might be resource intensive, especially if the XPath is evaluated using a more efficient parser like Saxon. With the current implementation it is possible to workaround this by setting a documentType attribute to the XPath expression, but doing this efficiently requires some internal knowledge about the previous component in the camel route (which type it creates) and the qualities of the used XML parser (e.g. the JDK parser accepts only InputSource and Node as input types for XPath evaluation whereas Saxon does also support other types like SAXSource). The attached patch will make the data type used by default for XPath evaluation more flexible (depending on the type of the input). There are two cases to differentiate: documentType is set on the XPath expression: current implementation: 1. try to convert to the documentType 2. if that fails do some extra conversions for some additional data types (WrappedFile, BeanInvocation, String) 3. if that fails throw an exception new implementation: 1. try to convert to the documentType 2. if that fails, use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) 3. if that fails throw an exception documentType is not set on the XPath expresson old implementation: this is actually the same as if documentType was set to Document new implementation: 1. Use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) (to InputSource) 2. If the old message is not of one of the types above, convert to DOM Document 3. If this fails throw an Exception -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions
[ https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14300929#comment-14300929 ] Stephan Siano commented on CAMEL-8273: -- Saxon cannot do XPath in streaming mode (I actually don't think that this is even possible to have a full XPath implementation with streaming), but it supports XPath with TinyTree (which is much smaller than the Xerces DOM). If the XML parsing is done during the XPath evaluation (the document it provided not as a DOM tree but something else like InputSource) Saxon will parse into that TinyTree, which was actually the purpose of my patch. Unfortunately I overlooked the XXE thing. I think I will check two things now: 1. whether Saxon will also allow XXE attacks if some non parsed type (like InputSource) is used for the conversion 2. If that is the case convert to NodeInfo (which is the Saxon interface for DOM-Like nodes (the TinyTree is a implementation of that)) and do the XPath parsing with that. Both ways require to set the documentInfo parameter to something else than Document. Unfortunately I don't see a way to do that automatically in case saxon is used... More flexible selection of default documentType in XPath expressions Key: CAMEL-8273 URL: https://issues.apache.org/jira/browse/CAMEL-8273 Project: Camel Issue Type: Improvement Components: camel-core Reporter: Stephan Siano Assignee: Claus Ibsen Fix For: Future Attachments: 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch In the current implementation of XPath if no documentType is defined (likely in most cases) the document used for XPath evaluation is parsed into a (DOM) Document using the JDK XML parser before applying the XPath expression on it. For large documents this might be resource intensive, especially if the XPath is evaluated using a more efficient parser like Saxon. With the current implementation it is possible to workaround this by setting a documentType attribute to the XPath expression, but doing this efficiently requires some internal knowledge about the previous component in the camel route (which type it creates) and the qualities of the used XML parser (e.g. the JDK parser accepts only InputSource and Node as input types for XPath evaluation whereas Saxon does also support other types like SAXSource). The attached patch will make the data type used by default for XPath evaluation more flexible (depending on the type of the input). There are two cases to differentiate: documentType is set on the XPath expression: current implementation: 1. try to convert to the documentType 2. if that fails do some extra conversions for some additional data types (WrappedFile, BeanInvocation, String) 3. if that fails throw an exception new implementation: 1. try to convert to the documentType 2. if that fails, use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) 3. if that fails throw an exception documentType is not set on the XPath expresson old implementation: this is actually the same as if documentType was set to Document new implementation: 1. Use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) (to InputSource) 2. If the old message is not of one of the types above, convert to DOM Document 3. If this fails throw an Exception -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions
[ https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294756#comment-14294756 ] Claus Ibsen commented on CAMEL-8273: Yeah AFAIR the JDK xpath engine relies on working with DOM structures. I have heard about custom xpath like engines that work with streams (not supporting all of xpath, but covers most use cases anyway) but they were custom made. There is vtd-xml for example http://camel.apache.org/vtd-xml And the wso2 guys seems to have a custom streaming xpath engine in their commerical esb https://github.com/apache/synapse/tree/trunk/java/modules/core/src/main/java/org/apache/synapse/util But not in the open source Apache synapse project https://github.com/apache/synapse/tree/trunk/java/modules/core/src/main/java/org/apache/synapse/util Notice how the streaming xpath compiler is missing at Apache. Also not sure how far saxon can go with streams only Some old SO questions http://stackoverflow.com/questions/996103/streaming-xpath-evaluation More flexible selection of default documentType in XPath expressions Key: CAMEL-8273 URL: https://issues.apache.org/jira/browse/CAMEL-8273 Project: Camel Issue Type: Improvement Components: camel-core Reporter: Stephan Siano Assignee: Claus Ibsen Fix For: Future Attachments: 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch In the current implementation of XPath if no documentType is defined (likely in most cases) the document used for XPath evaluation is parsed into a (DOM) Document using the JDK XML parser before applying the XPath expression on it. For large documents this might be resource intensive, especially if the XPath is evaluated using a more efficient parser like Saxon. With the current implementation it is possible to workaround this by setting a documentType attribute to the XPath expression, but doing this efficiently requires some internal knowledge about the previous component in the camel route (which type it creates) and the qualities of the used XML parser (e.g. the JDK parser accepts only InputSource and Node as input types for XPath evaluation whereas Saxon does also support other types like SAXSource). The attached patch will make the data type used by default for XPath evaluation more flexible (depending on the type of the input). There are two cases to differentiate: documentType is set on the XPath expression: current implementation: 1. try to convert to the documentType 2. if that fails do some extra conversions for some additional data types (WrappedFile, BeanInvocation, String) 3. if that fails throw an exception new implementation: 1. try to convert to the documentType 2. if that fails, use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) 3. if that fails throw an exception documentType is not set on the XPath expresson old implementation: this is actually the same as if documentType was set to Document new implementation: 1. Use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) (to InputSource) 2. If the old message is not of one of the types above, convert to DOM Document 3. If this fails throw an Exception -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions
[ https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291819#comment-14291819 ] Stephan Siano commented on CAMEL-8273: -- Crap, you are right, I only ran the unit tests in org.apache.camel.language, not the ones in org.apache.builder.xml. I am not 100% sure what the failed XPathTest.testXPathSplitConcurrent() means (it evaluates an XPath and then concurrently tries to create a Document from the Nodes with a TypeConverter in 100 threads and I actually don't understand why that behavior should change depending on the DocumentFactoryImpl being instantiated by Camel or by the JDK within the XPath.eval method), but the failed XPathFeatureTest.testXPathResult() looks like a showstopper for the whole approach to me. If the XPath implementation from the JDK gets an InputSource as source or the evaluation it will intantiate a DOM parser with default settings (that allow XXE) and I see no way around that. I will do some further analyis on that, but it might really be necessary to do the DOM conversion before the XPath (as in the current coding) More flexible selection of default documentType in XPath expressions Key: CAMEL-8273 URL: https://issues.apache.org/jira/browse/CAMEL-8273 Project: Camel Issue Type: Improvement Components: camel-core Reporter: Stephan Siano Assignee: Claus Ibsen Fix For: 2.15.0 Attachments: 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch In the current implementation of XPath if no documentType is defined (likely in most cases) the document used for XPath evaluation is parsed into a (DOM) Document using the JDK XML parser before applying the XPath expression on it. For large documents this might be resource intensive, especially if the XPath is evaluated using a more efficient parser like Saxon. With the current implementation it is possible to workaround this by setting a documentType attribute to the XPath expression, but doing this efficiently requires some internal knowledge about the previous component in the camel route (which type it creates) and the qualities of the used XML parser (e.g. the JDK parser accepts only InputSource and Node as input types for XPath evaluation whereas Saxon does also support other types like SAXSource). The attached patch will make the data type used by default for XPath evaluation more flexible (depending on the type of the input). There are two cases to differentiate: documentType is set on the XPath expression: current implementation: 1. try to convert to the documentType 2. if that fails do some extra conversions for some additional data types (WrappedFile, BeanInvocation, String) 3. if that fails throw an exception new implementation: 1. try to convert to the documentType 2. if that fails, use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) 3. if that fails throw an exception documentType is not set on the XPath expresson old implementation: this is actually the same as if documentType was set to Document new implementation: 1. Use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) (to InputSource) 2. If the old message is not of one of the types above, convert to DOM Document 3. If this fails throw an Exception -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions
[ https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14293031#comment-14293031 ] Stephan Siano commented on CAMEL-8273: -- It seems as if the issues with this solution are unresolveable. I have to withdraw this patch. Sorry for the hassle. More flexible selection of default documentType in XPath expressions Key: CAMEL-8273 URL: https://issues.apache.org/jira/browse/CAMEL-8273 Project: Camel Issue Type: Improvement Components: camel-core Reporter: Stephan Siano Assignee: Claus Ibsen Fix For: 2.15.0 Attachments: 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch In the current implementation of XPath if no documentType is defined (likely in most cases) the document used for XPath evaluation is parsed into a (DOM) Document using the JDK XML parser before applying the XPath expression on it. For large documents this might be resource intensive, especially if the XPath is evaluated using a more efficient parser like Saxon. With the current implementation it is possible to workaround this by setting a documentType attribute to the XPath expression, but doing this efficiently requires some internal knowledge about the previous component in the camel route (which type it creates) and the qualities of the used XML parser (e.g. the JDK parser accepts only InputSource and Node as input types for XPath evaluation whereas Saxon does also support other types like SAXSource). The attached patch will make the data type used by default for XPath evaluation more flexible (depending on the type of the input). There are two cases to differentiate: documentType is set on the XPath expression: current implementation: 1. try to convert to the documentType 2. if that fails do some extra conversions for some additional data types (WrappedFile, BeanInvocation, String) 3. if that fails throw an exception new implementation: 1. try to convert to the documentType 2. if that fails, use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) 3. if that fails throw an exception documentType is not set on the XPath expresson old implementation: this is actually the same as if documentType was set to Document new implementation: 1. Use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) (to InputSource) 2. If the old message is not of one of the types above, convert to DOM Document 3. If this fails throw an Exception -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CAMEL-8273) More flexible selection of default documentType in XPath expressions
[ https://issues.apache.org/jira/browse/CAMEL-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291609#comment-14291609 ] Claus Ibsen commented on CAMEL-8273: I get some unit test failures in camel-core such as XPathTestTestSupport.runBare:58-testXPathSplitConcurrent:381 null More flexible selection of default documentType in XPath expressions Key: CAMEL-8273 URL: https://issues.apache.org/jira/browse/CAMEL-8273 Project: Camel Issue Type: Improvement Components: camel-core Reporter: Stephan Siano Assignee: Claus Ibsen Fix For: 2.15.0 Attachments: 0001-CAMEL-8273-More-flexible-selection-of-default-docume.patch In the current implementation of XPath if no documentType is defined (likely in most cases) the document used for XPath evaluation is parsed into a (DOM) Document using the JDK XML parser before applying the XPath expression on it. For large documents this might be resource intensive, especially if the XPath is evaluated using a more efficient parser like Saxon. With the current implementation it is possible to workaround this by setting a documentType attribute to the XPath expression, but doing this efficiently requires some internal knowledge about the previous component in the camel route (which type it creates) and the qualities of the used XML parser (e.g. the JDK parser accepts only InputSource and Node as input types for XPath evaluation whereas Saxon does also support other types like SAXSource). The attached patch will make the data type used by default for XPath evaluation more flexible (depending on the type of the input). There are two cases to differentiate: documentType is set on the XPath expression: current implementation: 1. try to convert to the documentType 2. if that fails do some extra conversions for some additional data types (WrappedFile, BeanInvocation, String) 3. if that fails throw an exception new implementation: 1. try to convert to the documentType 2. if that fails, use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) 3. if that fails throw an exception documentType is not set on the XPath expresson old implementation: this is actually the same as if documentType was set to Document new implementation: 1. Use the message if it is of type Node, InputSource or DOMSource or do some type conversions for specific data types (WrappedFile, BeanInvocation, String, InputStream, Reader, byte[]...) (to InputSource) 2. If the old message is not of one of the types above, convert to DOM Document 3. If this fails throw an Exception -- This message was sent by Atlassian JIRA (v6.3.4#6332)