Hi, Tilman

    I have encountered another problem.
    
    t1.xml is a simple plain text file, not a standard XML file.
    When I use Tika Server 2.7.0 to extract file content, the results are as 
follows:

curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H 
"Content-Disposition: attachment; filename=t1.xml"
Result: fail (empty)    

curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain"
curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H 
"Content-Disposition: attachment; filename=t1.txt"
curl -T t1.xml http://127.0.0.1:12000/tika --header "Accept: text/plain" -H 
"Content-Disposition: attachment; filename=t1.docx"
Result: success

    The file name information affects the extraction result.



[email protected]
 
From: Tilman Hausherr
Date: 2023-04-20 11:09
To: user
Subject: Re: Tika server extraction failed
Yes, the second file brings this on the console log:
 
Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate 
an array of length 100,000,015, but the maximum length for this record 
type is 100,000,000.
If the file is not corrupt and not large, please open an issue on 
bugzilla to request
increasing the maximum allowable size for this record type.
You can set a higher override value with IOUtils.setByteArrayMaxOverride()
        at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:599) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at org.apache.poi.util.IOUtils.checkLength(IOUtils.java:276) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:230) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:203) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:82)
 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:98)
 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:132) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:319) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:127)
 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:115) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) 
~[tika-server-standard-2.7.1-SNAPSHOT.jar:2.7.1-SNAPSHOT]
        ... 41 more
 
So I googled for the error message and found this:
 
https://stackoverflow.com/a/64221068/535646
 
I then included this into the config.xml file from 
https://cwiki.apache.org/confluence/display/TIKA/TikaServer+in+Tika+2.x 
and then it works, although the meta output now came as xml instead of 
as text, maybe that default config file does change something instead of 
keeping defaults, but that's another story.
 
Tilman
 

Attachment: t1.xml
Description: Binary data

Reply via email to