Thank you! The error message gives a hint on how to fix this. You can configure the Tika's OfficeParser to override this maximum record length: https://tika.apache.org/2.6.0/api/org/apache/tika/parser/microsoft/AbstractOfficeParser.html#setByteArrayMaxOverride-int-
I can send a link on how to do this via tika-config.xml if this is the path you'd like to pursue. I was responsible for adding this code in POI because throughout the older MSOffice docs (doc, ppt, xls), there's a common pattern of: read a record length, allocate that length in memory, then read the stream into the byte array. The problem is that files can be carefully modified/created to have a very small file allocate 2GB. This is a protection against that behavior. On Wed, Feb 1, 2023 at 11:03 PM Tilman Hausherr <[email protected]> wrote: > > Hi, > > A complete stack trace would be useful, if it isn't in the log, then using > tika-app would be helpful. At this time the only thing we know is that it's > an office file, which may or may not be corrupt. > > The exception happens as part of a call to IOUtils.toByteArray() > > A google search for that error finds several pages that answers your original > question: > > https://stackoverflow.com/questions/64221010/apache-tika-tried-to-allocate-an-array-of-length-1835606-but-1000000-is-the-ma > https://bz.apache.org/bugzilla/show_bug.cgi?id=65639 > https://www.ibm.com/support/pages/converter-dropped-some-document-tried-allocate-array-length-xxxx-1000000-maximum-record-type-message > > Tilman > > On 01.02.2023 23:17, שי ברק wrote: > > The logs I got: > > Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate > an array of length 1835606, but 1000000 is the maximum for this record type. > If the file is not corrupt, please open an issue on bugzilla to request > increasing the maximum allowable size for this record type. > As a temporary workaround, consider setting a higher override value with > IOUtils.setByteArrayMaxOverride() > > > On Wed, 1 Feb 2023 at 23:49 Tim Allison <[email protected]> wrote: >> >> As Tilman said, I don't think the issue is on the Tika side, but I >> can't tell without testing. What happens when you curl the file to >> the server? You might have to use multipart/form-data? >> >> Again, as Tilman said, it would be useful to see what the logs are. >> Try sending the file to the /rmeta endpoint to get the stacktrace if >> you can't otherwise see the logs. >> >> >> On Wed, Feb 1, 2023 at 12:04 PM Tilman Hausherr <[email protected]> >> wrote: >> > >> > How would you know that it is size related? Try what I mentioned, or look >> > at the server logs, or share the file. >> > >> > Tilman >> > >> > On 01.02.2023 17:08, שי ברק wrote: >> > >> > I work on C# project that uses Tika Server with http request, so I’m >> > wondering if there’s something I can do with the config file of the >> > server…maybe there’s a way to modify the size limit >> > >> > On Wed, 1 Feb 2023 at 17:51 Tilman Hausherr <[email protected]> wrote: >> >> >> >> On 01.02.2023 09:40, שי ברק wrote: >> >> > I have a 150 MB power point office document and when send request to >> >> > Tika server I get 422 response back, says it’s unprocessable entity. >> >> > Is there size limitation in Tika or the issue is with my specific >> >> > document? >> >> >> >> What happens if you do the same with tika-app from the command line? >> >> >> >> Tilman >> >> >> >> >> >> >> > > >
