Re: Tika fails to extract text from very large files

Nick Burch Wed, 16 May 2012 14:45:50 -0700

On Wed, 16 May 2012, Alec Swan wrote:

Our tests indicate that while Tika can extract text from average files
it fails to extract text from large files of certain types. In our
tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113
MB PDF files. However, it extracted the right text from 94MB TXT file.

Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formatswhich can only be parsed by building a DOM-like structure in memory, sothey need more memory available to them. XLS/XLSX, amongst a few others,can be done in a largely streaming manner, so have a lower footprint. (Itall depends on how the file format is laid out internally)


Nick

Re: Tika fails to extract text from very large files

Reply via email to