Thank you! The error message gives a hint on how to fix this.  You can
configure the Tika's OfficeParser to override this maximum record
length: 
https://tika.apache.org/2.6.0/api/org/apache/tika/parser/microsoft/AbstractOfficeParser.html#setByteArrayMaxOverride-int-

I can send a link on how to do this via tika-config.xml if this is the
path you'd like to pursue.

I was responsible for adding this code in POI because throughout the
older MSOffice docs (doc, ppt, xls), there's a common pattern of: read
a record length, allocate that length in memory, then read the stream
into the byte array.  The problem is that files can be carefully
modified/created to have a very small file allocate 2GB.  This is a
protection against that behavior.

On Wed, Feb 1, 2023 at 11:03 PM Tilman Hausherr <[email protected]> wrote:
>
> Hi,
>
> A complete stack trace would be useful, if it isn't in the log, then using 
> tika-app would be helpful. At this time the only thing we know is that it's 
> an office file, which may or may not be corrupt.
>
> The exception happens as part of a call to  IOUtils.toByteArray()
>
> A google search for that error finds several pages that answers your original 
> question:
>
> https://stackoverflow.com/questions/64221010/apache-tika-tried-to-allocate-an-array-of-length-1835606-but-1000000-is-the-ma
> https://bz.apache.org/bugzilla/show_bug.cgi?id=65639
> https://www.ibm.com/support/pages/converter-dropped-some-document-tried-allocate-array-length-xxxx-1000000-maximum-record-type-message
>
> Tilman
>
> On 01.02.2023 23:17, שי ברק wrote:
>
> The logs I got:
>
> Caused by: org.apache.poi.util.RecordFormatException: Tried to allocate
> an array of length 1835606, but 1000000 is the maximum for this record type.
> If the file is not corrupt, please open an issue on bugzilla to request
> increasing the maximum allowable size for this record type.
> As a temporary workaround, consider setting a higher override value with
> IOUtils.setByteArrayMaxOverride()
>
>
> On Wed, 1 Feb 2023 at 23:49 Tim Allison <[email protected]> wrote:
>>
>> As Tilman said, I don't think the issue is on the Tika side, but I
>> can't tell without testing.  What happens when you curl the file to
>> the server?  You might have to use multipart/form-data?
>>
>> Again, as Tilman said, it would be useful to see what the logs are.
>> Try sending the file to the /rmeta endpoint to get the stacktrace if
>> you can't otherwise see the logs.
>>
>>
>> On Wed, Feb 1, 2023 at 12:04 PM Tilman Hausherr <[email protected]> 
>> wrote:
>> >
>> > How would you know that it is size related? Try what I mentioned, or look 
>> > at the server logs, or share the file.
>> >
>> > Tilman
>> >
>> > On 01.02.2023 17:08, שי ברק wrote:
>> >
>> > I work on C# project that uses Tika Server with http request, so I’m 
>> > wondering if there’s something I can do with the config file of the 
>> > server…maybe there’s a way to modify the size limit
>> >
>> > On Wed, 1 Feb 2023 at 17:51 Tilman Hausherr <[email protected]> wrote:
>> >>
>> >> On 01.02.2023 09:40, שי ברק wrote:
>> >> > I have a 150 MB power point office document and when send request to
>> >> > Tika server I get 422 response back, says it’s unprocessable entity.
>> >> > Is there size limitation in Tika or the issue is with my specific
>> >> > document?
>> >>
>> >> What happens if you do the same with tika-app from the command line?
>> >>
>> >> Tilman
>> >>
>> >>
>> >>
>> >
>
>

Reply via email to