Re: Tika fails to extract text from very large files

Alec Swan Thu, 17 May 2012 09:17:12 -0700

Could you please clarify "fork parser" and "Tika server" concepts? Do
both of them require spawning and managing external processes which
perform the actual file parsing?


On Thu, May 17, 2012 at 10:12 AM, Nick Burch <[email protected]> wrote:
> On Thu, 17 May 2012, Alec Swan wrote:
>>
>> 1. We don't know how to tell if we don't have enough heap space to
>> process the file and skip the file in this case. Allowing out of
>> memory errors take down our process is not acceptable.
>
>
> In that kind of situation, you should be looking at using something like
> the fork parser or the tika server
>
>
>> 2. When we use 1024MB of heap and try to parse a large PDF file at
>> some point it starts printing the following error non-stop. In fact I
>> forgot to kill my process and it ran over night printing this every
>> second or so:
>> May 16, 2012 8:00:58 PM org.apache.pdfbox.filter.FlateFilter decode
>> SEVERE: Stop reading corrupt stream
>
>
> That looks like a PDFBox bug, you should try reporting that upstream
>
> Nick

Re: Tika fails to extract text from very large files

Reply via email to