So, we have two problems:

1. We don't know how to tell if we don't have enough heap space to
process the file and skip the file in this case. Allowing out of
memory errors take down our process is not acceptable.

2. When we use 1024MB of heap and try to parse a large PDF file at
some point it starts printing the following error non-stop. In fact I
forgot to kill my process and it ran over night printing this every
second or so:
May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
SEVERE: Stop reading corrupt stream

Thanks,

Alec

On Thu, May 17, 2012 at 2:54 AM, Alex Ott <[email protected]> wrote:
> processing PPT & DOC files could be implemented in almost constant
> space (if we don't store whole text in memory, but pass chunks of text
> to handler)...
>
> P.S. I'm sorry that I can't say more details about it
>
> On Wed, May 16, 2012 at 11:45 PM, Nick Burch <[email protected]> wrote:
>> On Wed, 16 May 2012, Alec Swan wrote:
>>>
>>> Our tests indicate that while Tika can extract text from average files
>>> it fails to extract text from large files of certain types. In our
>>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>>
>>
>> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
>> which can only be parsed by building a DOM-like structure in memory, so they
>> need more memory available to them. XLS/XLSX, amongst a few others, can be
>> done in a largely streaming manner, so have a lower footprint. (It all
>> depends on how the file format is laid out internally)
>>
>> Nick
>
>
>
> --
> With best wishes,                    Alex Ott
> http://alexott.net/
> Tiwtter: alexott_en (English), alexott (Russian)
> Skype: alex.ott

Reply via email to