RE: How is Tika used with Solr

Allison, Timothy B. Thu, 11 Feb 2016 17:13:45 -0800

Y, and you can't actually kill a thread.  You can ask nicely via 
Thread.interrupt(), but some of our dependencies don't bother to listen  for 
that.  So, you're pretty much left with a separate process as the only robust 
solution.


So, we did the parent-child process thing for directory-> directory processing 
in tika-app via tika-batch.

The next step is to harden tika-server and to kick that off in a child process 
in a similar way.

For those who want to test their Tika harnesses (whether on single box, 
Hadoop/Spark etc), we added a MockParser that will do whatever you tell it when 
it hits an "application/xml+mock" file...full set of options:

<mock>

    <!-- action can be "add" or "set" -->
    <metadata action="add" name="author">Nikolai Lobachevsky</metadata>

    <!-- element is the name of the sax event to write, p=paragraph
        if the element is not specified, the default is <p> -->

    <write element="p">some content</write>

    <!-- write something to System.out -->

    <print_out>writing to System.out</print_out>

    <!-- write something to System.err -->
    <print_err>writing to System.err</print_err>

    <!-- hang
        millis: how many milliseconds to pause.  The actual hang time will 
probably
            be a bit longer than the value specified.        
        
        heavy: whether or not the hang should do something computationally 
expensive.
            If the value is false, this just does a Thread.sleep(millis).
            This attribute is optional, with default of heavy=false.

        pulse_millis: (required if "heavy" is true), how often to check to see
            whether the thread was interrupted or that the total hang time 
exceeded the millis

        interruptible: whether or not the parser will check to see if its thread
            has been interrupted; this attribute is optional with default of 
true
    -->

    <hang millis="100" heavy="true" pulse_millis="10" interruptible="true" />

    <!-- throw an exception or error; optionally include a message or not -->

    <throw class="java.io.IOException">not another IOException</throw>

    <!-- perform a genuine OutOfMemoryError -->

    <oom/>

</mock>  

-----Original Message-----
From: Erick Erickson [mailto:[email protected]] 
Sent: Thursday, February 11, 2016 7:46 PM
To: solr-user <[email protected]>
Subject: Re: How is Tika used with Solr

Well, I'd imagine you could spawn threads and monitor/kill them as necessary, 
although that doesn't deal with OOM errors....

FWIW,
Erick

On Thu, Feb 11, 2016 at 3:08 PM, xavi jmlucjav <[email protected]> wrote:
> For sure, if I need heavy duty text extraction again, Tika would be 
> the obvious choice if it covers dealing with hangs. I never used 
> tika-server myself (not sure if it existed at the time) just used tika from 
> my own jvm.
>
> On Thu, Feb 11, 2016 at 8:45 PM, Allison, Timothy B. 
> <[email protected]>
> wrote:
>
>> x-post to Tika user's
>>
>> Y and n.  If you run tika app as:
>>
>> java -jar tika-app.jar <input_dir> <output_dir>
>>
>> It runs tika-batch under the hood (TIKA-1330 as part of TIKA-1302).  
>> This creates a parent and child process, if the child process notices 
>> a hung thread, it dies, and the parent restarts it.  Or if your OS 
>> gets upset with the child process and kills it out of self 
>> preservation, the parent restarts the child, or if there's an 
>> OOM...and you can configure how often the child shuts itself down 
>> (with parental restarting) to mitigate memory leaks.
>>
>> So, y, if your use case allows <input_dir> <output_dir>, then we now 
>> have that in Tika.
>>
>> I've been wanting to add a similar watchdog to tika-server ... any 
>> interest in that?
>>
>>
>> -----Original Message-----
>> From: xavi jmlucjav [mailto:[email protected]]
>> Sent: Thursday, February 11, 2016 2:16 PM
>> To: solr-user <[email protected]>
>> Subject: Re: How is Tika used with Solr
>>
>> I have found that when you deal with large amounts of all sort of 
>> files, in the end you find stuff (pdfs are typically nasty) that will hang 
>> tika.
>> That is even worse that a crash or OOM.
>> We used aperture instead of tika because at the time it provided a 
>> watchdog feature to kill what seemed like a hanged extracting thread. 
>> That feature is super important for a robust text extracting 
>> pipeline. Has Tika gained such feature already?
>>
>> xavier
>>
>> On Wed, Feb 10, 2016 at 6:37 PM, Erick Erickson 
>> <[email protected]>
>> wrote:
>>
>> > Timothy's points are absolutely spot-on. In production scenarios, 
>> > if you use the simple "run Tika in a SolrJ program" approach you 
>> > _must_ abort the program on OOM errors and the like and  figure out 
>> > what's going on with the offending document(s). Or record the name 
>> > somewhere and skip it next time 'round. Or........
>> >
>> > How much you have to build in here really depends on your use case.
>> > For "small enough"
>> > sets of documents or one-time indexing, you can get by with dealing 
>> > with errors one at a time.
>> > For robust systems where you have to have indexing available at all 
>> > times and _especially_ where you don't control the document corpus, 
>> > you have to build something far more tolerant as per Tim's comments.
>> >
>> > FWIW,
>> > Erick
>> >
>> > On Wed, Feb 10, 2016 at 4:27 AM, Allison, Timothy B.
>> > <[email protected]>
>> > wrote:
>> > > I completely agree on the impulse, and for the vast majority of 
>> > > the time
>> > (regular catchable exceptions), that'll work.  And, by vast 
>> > majority, aside from oom on very large files, we aren't seeing 
>> > these problems any more in our 3 million doc corpus (y, I know, 
>> > small by today's
>> > standards) from
>> > govdocs1 and Common Crawl over on our Rackspace vm.
>> > >
>> > > Given my focus on Tika, I'm overly sensitive to the worst case
>> > scenarios.  I find it encouraging, Erick, that you haven't seen 
>> > these types of problems, that users aren't complaining too often 
>> > about catastrophic failures of Tika within Solr Cell, and that this 
>> > thread is not yet swamped with integrators agreeing with me. :)
>> > >
>> > > However, because oom can leave memory in a corrupted state 
>> > > (right?),
>> > because you can't actually kill a thread for a permanent hang and 
>> > because Tika is a kitchen sink and we can't prevent memory leaks in 
>> > our dependencies, one needs to be aware that bad things can 
>> > happen...if only very, very rarely.  For a fellow traveler who has 
>> > run into these issues on massive data sets, see also [0].
>> > >
>> > > Configuring Hadoop to work around these types of problems is not 
>> > > too
>> > difficult -- it has to be done with some thought, though.  On 
>> > conventional single box setups, the ForkParser within Tika is one 
>> > option, tika-batch is another.  Hand rolling your own parent/child 
>> > process is non-trivial and is not necessary for the vast majority 
>> > of use
>> cases.
>> > >
>> > >
>> > > [0]
>> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterisin
>> > g-w
>> > eb-content-nanite/
>> > >
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Erick Erickson [mailto:[email protected]]
>> > > Sent: Tuesday, February 09, 2016 10:05 PM
>> > > To: solr-user <[email protected]>
>> > > Subject: Re: How is Tika used with Solr
>> > >
>> > > My impulse would be to _not_ run Tika in its own JVM, just catch 
>> > > any
>> > exceptions in my code and "do the right thing". I'm not sure I see 
>> > any real benefit in yet another JVM.
>> > >
>> > > FWIW,
>> > > Erick
>> > >
>> > > On Tue, Feb 9, 2016 at 6:22 PM, Allison, Timothy B.
>> > > <[email protected]>
>> > wrote:
>> > >> I have one answer here [0], but I'd be interested to hear what 
>> > >> Solr
>> > users/devs/integrators have experienced on this topic.
>> > >>
>> > >> [0]
>> > >> http://mail-archives.apache.org/mod_mbox/tika-user/201602.mbox/%
>> > >> 3CC
>> > >> Y1P
>> > >> R09MB0795EAED947B53965BC86874C7D70%40CY1PR09MB0795.namprd09.prod
>> > >> .ou
>> > >> tlo
>> > >> ok.com%3E
>> > >>
>> > >> -----Original Message-----
>> > >> From: Steven White [mailto:[email protected]]
>> > >> Sent: Tuesday, February 09, 2016 6:33 PM
>> > >> To: [email protected]
>> > >> Subject: Re: How is Tika used with Solr
>> > >>
>> > >> Thank you Erick and Alex.
>> > >>
>> > >> My main question is with a long running process using Tika in 
>> > >> the same
>> > JVM as my application.  I'm running my file-system-crawler in its 
>> > own JVM (not Solr's).  On Tika mailing list, it is suggested to run 
>> > Tika's code in it's own JVM and invoke it from my 
>> > file-system-crawler using Runtime.getRuntime().exec().
>> > >>
>> > >> I fully understand from Alex suggestion and link provided by 
>> > >> Erick to
>> > use Tika outside Solr.  But what about using Tika within the same 
>> > JVM as my file-system-crawler application or should I be making a 
>> > system call to invoke another JAR, that runs in its own JVM to 
>> > extract the raw text?  Are there known issues with Tika when used 
>> > in a long running
>> process?
>> > >>
>> > >> Steve
>> > >>
>> > >>
>> >
>>

RE: How is Tika used with Solr

Reply via email to