[jira] Commented: (TIKA-185) XML files with (unsatisfied) SYSTEM entities can not be extracted

Uwe Schindler (JIRA) Fri, 09 Jan 2009 03:35:27 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662333#action_12662333
 ]


Uwe Schindler commented on TIKA-185:
------------------------------------

Andrzej: The code must not be included into AutoDetectParser, it must be in 
XMLParser. In principle, to provide a general "external entity no-op resolver", 
the code from OpenOffice could be copied to the DefaultHandler given to the 
XMLParser in the TIKA xml parser.

The problem with this is: Sometimes external entities can be resolved or must 
be resolved. We have for example some documentents, that contain entities like 
"&includeSomething;" that are externally defined in a file that is linked by 
URL. Without these entities a lot of information gots lost. In my opinion, the 
entity resolver in TIKA should work like this:

- try to resolve the entity as a URL (new URL(...)), if malformed URL return an 
empty StringReader. If it works return the result of the URLs inputStream. If 
some other extensions occur (FileNotFound etc.) return empty StringReader.
- If resolving as URL does not work try to return a FileInputStream() on the 
system ID. If that also fails, return empty StringReader.

This resolving mechanism tries to get as much from the external entity. If 
returning of the empty StringReader makes the XML invalid, then the parser 
stops later (so it does not break more).

I may create a patch for the XML parser. If we include this code directly into 
the XML parser, the resolving mechanism of OpenDocument may get removed 
(because the one from the XML reader could be reused).

Uwe

> XML files with (unsatisfied) SYSTEM entities can not be extracted
> -----------------------------------------------------------------
>
>                 Key: TIKA-185
>                 URL: https://issues.apache.org/jira/browse/TIKA-185
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.2
>            Reporter: Andrzej Rusin
>            Priority: Minor
>         Attachments: xmlTest.xml, xmlTest2.xml
>
>
> When trying to extract an XPI file (Firefox extenstion, which probably is not 
> a best candidate for extract) I got the below exception.
> It was caused by SYSTEM entities refering the chrome:// protocol.
> However, obviously any XML file that contains SYSTEM entities which can not 
> be accessed at the time of extraction will not be extracted properly.
> Here is the stack trace:
> java.net.MalformedURLException: unknown protocol: chrome
>    at java.net.URL.<init>(URL.java:574)
>    at java.net.URL.<init>(URL.java:464)
>    at java.net.URL.<init>(URL.java:413)
>    at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown 
> Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source)
>    at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown 
> Source)
>    at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown 
> Source)
>    at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>    at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>    at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>    at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)
>    at 
> org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93)
>    at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108)
>    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-185) XML files with (unsatisfied) SYSTEM entities can not be extracted

Reply via email to