[ https://issues.apache.org/jira/browse/TIKA-185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662187#action_12662187 ]
Jukka Zitting commented on TIKA-185: ------------------------------------ We probably should prevent Tika and the parsers it invokes from trying to access any external entities when parsing XML (or any other file formats for that matter). The only input for content extraction should be the input stream and input metadata passed to the parse() method. > XML files with (unsatisfied) SYSTEM entities can not be indexed > --------------------------------------------------------------- > > Key: TIKA-185 > URL: https://issues.apache.org/jira/browse/TIKA-185 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.2 > Reporter: Andrzej Rusin > Priority: Minor > Attachments: xmlTest.xml, xmlTest2.xml > > > When trying to extract an XPI file (Firefox extenstion, which probably is not > a best candidate for extract) I got the below exception. > It was caused by SYSTEM entities refering the chrome:// protocol. > However, obviously any XML file that contains SYSTEM entities which can not > be accessed at the time of extraction will not be extracted properly. > Here is the stack trace: > java.net.MalformedURLException: unknown protocol: chrome > at java.net.URL.<init>(URL.java:574) > at java.net.URL.<init>(URL.java:464) > at java.net.URL.<init>(URL.java:413) > at org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(Unknown > Source) > at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source) > at org.apache.xerces.impl.XMLEntityManager.startEntity(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.startPE(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.skipSeparator(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.scanDecls(Unknown Source) > at org.apache.xerces.impl.XMLDTDScannerImpl.scanDTDInternalSubset(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentScannerImpl$DTDDispatcher.dispatch(Unknown > Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:57) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80) > at > org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93) > at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:80) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.