Re: Is creating new AutoDetectParsers expensive?

Haris Osmanagic Fri, 30 Sep 2016 08:57:59 -0700

This is great, thank you very much!

Short-term, I will then have a parser per file, but long term I think I
will purchase the path of a separate Tika process. The number of files is
not millions at once, but accumulates to that over a short period of time,
so it's quite appealing. If I find something new, I'll be free to share.


Thanks again!



On Fri, Sep 30, 2016 at 5:16 PM Allison, Timothy B. <[email protected]>
wrote:

> In an earlier version of tika-batch, we had a single AutoDetectParser per
> thread, and we had no problems.  I experimented with a single
> AutoDetectParser across the threads, and we didn’t have problems.
>
>
>
> Because of configuration issues, tika-batch is now creating a new parser
> for each file.
>
>
>
> In our unit test suite, last I experimented with this, the first
> initialization did take a while, but then there was no measurable extra
> cost to instantiating a new parser.   In short, we didn’t save anything by
> using a static AutoDetectParser instead of just instantiating a new one for
> each unit test.
>
>
>
> If you are going from file system to file system, you might want to
> consider tika-batch.
>
>
>
> java -jar tika-app.jar -i <input_dir> -o <output_dir>
>
>
>
> If you have a whole lot of files (millions), try to isolate Tika in its
> own jvm or server or data center; bad things can happen.  See slide 17:
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
>
>
>
> And:
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
>
>
>
> *From:* Haris Osmanagic [mailto:[email protected]]
> *Sent:* Friday, September 30, 2016 10:54 AM
> *To:* [email protected]
> *Subject:* Re: Is creating new AutoDetectParsers expensive?
>
>
>
> I read the first sentence and thought: "Yes! I can save ourselves a bunch
> of memory!"
>
> Then I read the second: "Oh, oh, do I dare trying it out?" : )
>
> Thank you very much for the super-speedy response!
>
>
>
> On Fri, Sep 30, 2016 at 4:46 PM Allison, Timothy B. <[email protected]>
> wrote:
>
> You can reuse AutoDetectParser in a multithreaded environment.  You
> shouldn’t have problems with performance or thread safety.
>
>
>
> If you find otherwise, please let us know! J
>
>
>
> *From:* Haris Osmanagic [mailto:[email protected]]
> *Sent:* Friday, September 30, 2016 10:36 AM
> *To:* [email protected]
> *Subject:* Is creating new AutoDetectParsers expensive?
>
>
>
> Hi all!
>
> Let's assume there are really many files to be parsed, and the operation
> is repeated a relatively large number of times each day.
>
> Is it, in that case, too expensive to create new AutoDetectParsers for
> every file? Or, in other words, if I were to reuse a AutoDetectParser for a
> large number of files, would I:
>
> * Have problems with thread-safety?
>
> * Have problems with performance?
>
> Thanks you very much!
>
> Haris Osmanagić
>
>

Re: Is creating new AutoDetectParsers expensive?

Reply via email to