[ https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrzej Rusin updated TIKA-185: ------------------------------- Attachment: xmlTest2.xml xmlTest.xml Minimal test case for the issue. xmlTest.xml: when blah.xml is present at time of extracting (with any content, eg "blah"), everything works OK, otherwise not. xmlTest2.xml: always fails because chrome protocol is not defined. However there exist xml files using this protocol, eg in Firefox plugins. > XML files with (unsatisfied) SYSTEM entities can not be indexed > --------------------------------------------------------------- > > Key: TIKA-185 > URL: https://issues.apache.org/jira/browse/TIKA-185 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2 > Reporter: Andrzej Rusin > Priority: Minor > Attachments: xmlTest.xml, xmlTest2.xml > > > When trying to extract an XPI file (Firefox extenstion, which probably is not > a best candidate for extract) I got the below exception. > It was caused by SYSTEM entities refering the chrome:// protocol. > However, obviously any XML file that contains SYSTEM entities which can not > be accessed at the time of extraction will not be extracted properly. > Here is the stack trace: > java.net.MalformedURLException: unknown protocol: chrome > at java.net.URL.<init>(URL.java:574) > at java.net.URL.<init>(URL.java:464) > at java.net.URL.<init>(URL.java:413) > at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown > Source) > at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source) > at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80) > at > org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93) > at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.