I confirmed that this will require the next version of POI due to a bug
that is my fault: https://bz.apache.org/bugzilla/show_bug.cgi?id=63569

Many thanks to Dominik Stadler for fixing this.

If you are able to build POI-4.1.2-SNAPSHOT, the above configuration file
will work.  The next version of POI should be out fairly soon(???); I've
asked on POI's dev list.

On Thu, Jan 23, 2020 at 9:51 AM Tim Allison <[email protected]> wrote:

> Hans,
>   I'm sorry for my delay.  There was a bug found in setting the global max
> in POI, which may require us to wait for the next release, but I _think_
> you should be ok with this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser"/>
>         <parser class="org.apache.tika.parser.microsoft.OfficeParser">
>             <params>
>                 <param name="byteArrayMaxOverride" type="int">2000000</param>
>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
>
>
>
> On Tue, Jan 21, 2020 at 3:44 PM <[email protected]> wrote:
>
>> Hi
>>
>> Still stuck on this issue. Trying to take it up again to see if Tika can
>> be an option.
>>
>>
>>
>> I still get the error message although i have tika-server 1.23 and python
>> tika 1.23.
>>
>>
>>
>> The call to tika  using file in the python code is
>> parser.from_file(filename).
>>
>>
>>
>> I have tried setting the ByteMaxOverride using a tika config file:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>>
>>
>>
>> <properties>
>>
>>     <parsers>
>>
>>         <parser
>> class="org.apache.tika.parser.microsoft.OfficeParserConfig">
>>
>>             <params>
>>
>>                 <param name="ByteArrayMaxOverride"
>> type="int">2048000</param>
>>
>>             </params>
>>
>>         </parser>
>>
>>     </parsers>
>>
>> </properties>
>>
>>
>>
>> But no luck in that the error message is not there anymore. It seems like
>> all the content is parsed though but i would appreciate to not get the
>> warning message:
>>
>>
>>
>> WARN  Ignoring unexpected exception while parsing summary entry
>> DocumentSummaryInformation
>>
>> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
>> length 1186960, but 100000 is the maximum for this record type.
>>
>> If the file is not corrupt, please open an issue on bugzilla to request
>>
>> increasing the maximum allowable size for this record type.
>>
>> As a temporary workaround, consider setting a higher override value with
>> IOUtils.setByteArrayMaxOverride()
>>
>>         at org.apache.poi.util.IOUtils.throwRFE(IOUtils.java:591)
>>
>>
>>
>> Any hints on how to get rid of it?
>>
>> Everything is 1.23 version and i am using the python library.
>>
>>
>>
>>
>>
>> Really appreciate any hints!
>>
>>
>>
>> Kind regards
>>
>> Hans
>>
>>
>>
>> *Från:* Tim Allison <[email protected]>
>> *Skickat:* den 18 december 2019 14:52
>> *Till:* [email protected]
>> *Kopia:* [email protected]
>> *Ämne:* Re: 100000 is the maximum for this record type
>>
>>
>>
>> SummaryInformation parsing can be buggy so we catch pretty much
>> everything there and parse the rest of the document.
>>
>>
>>
>> As of Tika 1.23, you can bump the global ByteArrayMaxOverride via the
>> OfficeParserConfig if you're calling Tika programmatically or via
>> tika-config.xml.
>>
>>
>>
>> On Wed, Dec 18, 2019 at 8:39 AM Hans Meijer <[email protected]>
>> wrote:
>>
>> Tika version 1.23:
>> When trying to parse a larger excel file, size in bytes: 10038272,  this
>> error occurs:
>> WARN  Ignoring unexpected exception while parsing summary entry
>> DocumentSummaryInformation
>> org.apache.poi.util.RecordFormatException: Tried to allocate an array of
>> length 1186960, but 100000 is the maximum for this record type.
>> If the file is not corrupt, please open an issue on bugzilla to request
>> increasing the maximum allowable size for this record type.
>> As a temporary workaround, consider setting a higher override value with
>> IOUtils.setByteArrayMaxOverride()
>>
>> However, it seems like all text gets extracted etc. but still  get the
>> warning message.
>>
>> Any way to analyze more why the warning text is still coming if the
>> content
>> get extracted from the excel spread sheet.
>>
>>
>>
>>
>> --
>> Sent from: http://apache-tika-users.1629097.n2.nabble.com/
>>
>>

Reply via email to