Memory consumption stays under 90MB which is less than max heap size (128M). No out-of-memory errors are thrown during test.
On Wed, May 16, 2012 at 3:45 PM, Nick Burch <[email protected]> wrote: > On Wed, 16 May 2012, Alec Swan wrote: >> >> Our tests indicate that while Tika can extract text from average files >> it fails to extract text from large files of certain types. In our >> tests Tika extracted 0 characters from 100 MB PPTX, 60 MB DOCX and 113 >> MB PDF files. However, it extracted the right text from 94MB TXT file. > > > Are you running out of memory? PPT/PPTX, DOC/DOCX and PDF are all formats > which can only be parsed by building a DOM-like structure in memory, so they > need more memory available to them. XLS/XLSX, amongst a few others, can be > done in a largely streaming manner, so have a lower footprint. (It all > depends on how the file format is laid out internally) > > Nick
