Re: Tika fails to extract text from very large files

Alex Ott Thu, 17 May 2012 01:55:27 -0700

processing PPT & DOC files could be implemented in almost constant
space (if we don't store whole text in memory, but pass chunks of text
to handler)...


P.S. I'm sorry that I can't say more details about it

On Wed, May 16, 2012 at 11:45 PM, Nick Burch <[email protected]> wrote:
> On Wed, 16 May 2012, Alec Swan wrote:
>>
>> Our tests indicate that while Tika can extract text from average files
>> it fails to extract text from large files of certain types. In our
>> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
>> MB PDF files. However, it extracted the right text from 94MB TXT file.
>
>
> Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats
> which can only be parsed by building a DOM-like structure in memory, so they
> need more memory available to them. XLS/XLSX, amongst a few others, can be
> done in a largely streaming manner, so have a lower footprint. (It all
> depends on how the file format is laid out internally)
>
> Nick



-- 
With best wishes,                    Alex Ott
http://alexott.net/
Tiwtter: alexott_en (English), alexott (Russian)
Skype: alex.ott

Re: Tika fails to extract text from very large files

Reply via email to