This is great, thank you very much! Short-term, I will then have a parser per file, but long term I think I will purchase the path of a separate Tika process. The number of files is not millions at once, but accumulates to that over a short period of time, so it's quite appealing. If I find something new, I'll be free to share.
Thanks again! On Fri, Sep 30, 2016 at 5:16 PM Allison, Timothy B. <[email protected]> wrote: > In an earlier version of tika-batch, we had a single AutoDetectParser per > thread, and we had no problems. I experimented with a single > AutoDetectParser across the threads, and we didn’t have problems. > > > > Because of configuration issues, tika-batch is now creating a new parser > for each file. > > > > In our unit test suite, last I experimented with this, the first > initialization did take a while, but then there was no measurable extra > cost to instantiating a new parser. In short, we didn’t save anything by > using a static AutoDetectParser instead of just instantiating a new one for > each unit test. > > > > If you are going from file system to file system, you might want to > consider tika-batch. > > > > java -jar tika-app.jar -i <input_dir> -o <output_dir> > > > > If you have a whole lot of files (millions), try to isolate Tika in its > own jvm or server or data center; bad things can happen. See slide 17: > http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf > > > > And: > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/ > > > > *From:* Haris Osmanagic [mailto:[email protected]] > *Sent:* Friday, September 30, 2016 10:54 AM > *To:* [email protected] > *Subject:* Re: Is creating new AutoDetectParsers expensive? > > > > I read the first sentence and thought: "Yes! I can save ourselves a bunch > of memory!" > > Then I read the second: "Oh, oh, do I dare trying it out?" : ) > > Thank you very much for the super-speedy response! > > > > On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. <[email protected]> > wrote: > > You can reuse AutoDetectParser in a multithreaded environment. You > shouldn’t have problems with performance or thread safety. > > > > If you find otherwise, please let us know! J > > > > *From:* Haris Osmanagic [mailto:[email protected]] > *Sent:* Friday, September 30, 2016 10:36 AM > *To:* [email protected] > *Subject:* Is creating new AutoDetectParsers expensive? > > > > Hi all! > > Let's assume there are really many files to be parsed, and the operation > is repeated a relatively large number of times each day. > > Is it, in that case, too expensive to create new AutoDetectParsers for > every file? Or, in other words, if I were to reuse a AutoDetectParser for a > large number of files, would I: > > * Have problems with thread-safety? > > * Have problems with performance? > > Thanks you very much! > > Haris Osmanagić > >
