Hi, Tilman
I have encountered another problem.
t1.xml is a simple plain text file, not a standard XML file.
When I use Tika Server 2.7.0 to extract file content, the results are as
follows:curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H "Content-Disposition: attachment; filename=t1.xml" Result: fail (empty) curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H "Content-Disposition: attachment; filename=t1.txt" curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H "Content-Disposition: attachment; filename=t1.docx" Result: success The file name information affects the extraction result. [email protected] From: Tilman Hausherr Date: 2023-04-20 11:09 To: user Subject: Re: Tika server extraction failed Yes, the second file brings this on the console log: Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate an array of length 100,000,015, but the maximum length for this record type is 100,000,000. If the file is not corrupt and not large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type. You can set a higher override value with IOUtils.setByteArrayMaxOverride() at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:599) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:276) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:230) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:203) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:82) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) ~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT] ... 41 more So I googled for the error message and found this: https://stackoverflow.com/a/64221068/535646 I then included this into the config.xml file from https://cwiki.apache.org/confluence/display/TIKA/TikaServer+in+Tika+2.x and then it works, although the meta output now came as xml instead of as text, maybe that default config file does change something instead of keeping defaults, but that's another story. Tilman
t1.xml
Description: Binary data
