I have watching memory leak with some paser used by Tika, for example
when i'am parsing a large amount of files (about 200 000) with my
application. Why there is memory leak ? Because, some closable objects
as Stream for example are not closed when parsing Exceptions occurs.
For me the best way to fork Tika parsing, is using another system's
Process (not a Java Thread) an rebuild periodically this Process to
continue on my parsing Work.
For example, i can have a parsing (content and metadata) with about one
millions files without any problems of stability for the main
application with this trick.
I have publish my work freely on the Web (Apache license 2.0) for my
application USB-Search here :
http://sourceforge.net/projects/usb-search/ feel free to download
sources and watch some ideas for a parsing process isolation with Tika.
BR
Raphaël GUYOT,
Software engineer at French
Signal school ETRS,
Rennes, France.