[ https://issues.apache.org/jira/browse/TIKA-236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-236. -------------------------------- Resolution: Duplicate Assignee: Jukka Zitting This problem is caused by the package containing a malformed XML file that he XMLParser fails to process. Such a failure should cause a TikaException which the package parser would normally just ignore before proceeding to the next package entry, but due to TIKA-237 the XMLParser is incorrectly throwing a SAXException in that case. Now with TIKA-237 fixed this is no longer the case, and the problem described here no longer occurs. Thus I'm resolving this as a Duplicate of TIKA-237. > Premature end of file Exception > ------------------------------- > > Key: TIKA-236 > URL: https://issues.apache.org/jira/browse/TIKA-236 > Project: Tika > Issue Type: Bug > Affects Versions: 0.3 > Environment: Windows / Unix > Reporter: Karl Heinz Marbaise > Assignee: Jukka Zitting > Priority: Critical > > I have reduced the problem down to the following: > @Test > public void testZipFile() throws IOException, SAXException, > TikaException { > String fileName = "lucene-2.2.0-src.zip"; > FileInputStream fis = new FileInputStream(fileName); > Metadata metadata = new Metadata(); > metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); > AutoDetectParser parser = new AutoDetectParser(); > DefaultHandler handler = new BodyContentHandler(); > parser.parse(fis, handler, metadata); > System.out.println("Handler:" + handler.toString()); > } > and the result of the above is the following: > FAILED: testZipFile > org.xml.sax.SAXParseException: Premature end of file. > at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source) > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at > org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:176) > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:59) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:78) > at > org.apache.tika.parser.pkg.PackageParser.parseEntry(PackageParser.java:93) > at org.apache.tika.parser.pkg.ZipParser.parse(ZipParser.java:56) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:108) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:78) > at > com.soebes.supose.scan.ScanZIPDocumentTest.testZipFile(ScanZIPDocumentTest.java:30) > ... Removed 22 stack frames > I have tested the ZIP file with 7-zip, with unzip on command line if it has > any errors in there...but there seemed to be none. If you need this file i > can attach that file, but it's about 7 mb size... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.