Another option is to load and invoke Tika in its own classloader to keep its jars isolated from the rest of the application. We did this for a while, until we switched to Gradle and implemented the "careful exclusion" approach that Ken mentioned. Downside was the need to use reflection to invoke Tika (AutoDectectParser) and marshall properties in and out of Metadata. But it worked fine. We used this child-first classloader implementation: https://articles.qos.ch/delegation/src/java/ch/qos/ChildFirstClassLoader.java

Adam

On 11/25/2020 6:57 AM, Ken Krugler wrote:
When we used Tika as a library with Hadoop map-reduce workflows, we had to run it in a separate thread with a timeout, and leave the thread as a zombie if/when it hung.

As far as jar hell (a very real problem), you can either do careful exclusions in your dependency specification (painful, and fragile) to avoid pulling in the world and creating incompatibilities in jar versions, or you could create a shaded Tika jar.

— Ken

On Nov 25, 2020, at 6:41 AM, Tucker B <[email protected] <mailto:[email protected]>> wrote:

Not totally on topic but I think related to this thread. I'm currently exploring using tika as a library in Apache Spark. This approach suffers the same problems as using Tika as library mentioned above. Has anyone used Tika as a library in a Spark Job? Or would it still make sense to us something external like tika-server? That seems like it might be counter to the point of using Spark in the first place.

On Tue, Nov 24, 2020 at 10:46 AM Slava G <[email protected] <mailto:[email protected]>> wrote:

    We have been using tika as java library, for a few years now and
    parsing millions of different files each day. And we're switching
    now to tika server as bugs in different tika components
    (dependencies) caused issue like exit of the jvm, memory issues
    and so. Also, tika and it's different dependencies bringa lot of
    other dependencies, so it should simply the maintainability and
    reduce JAR hell.

    So, this is our road from tika as java library to tika as a server 😀

    Thanks

    On Tue, Nov 24, 2020, 09:28 Ralph Soika <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Robert,

        in the sense of a microservice architecture it makes absolute
        sense to use Tika as a server/microservice component. As Tim
        Allison explained this helps you to separate your business
        requirements in isolated components (running in there own JVM).

        If you don't need to link the Tika function closely to your
        code then use the server option wherever possible.


        Best regards

        Ralph


        On 23.11.20 21:36, Robert Raines wrote:
        Hi,

        I am using Tika to extract text from Word Docs and PDFs
        locally. It's great. Thank you Apache and Tika developers!

        Could someone help me understand why Tika offers a
        client-server option instead of just a code library? I am
        sure there was/is a good reason, so I am curious if anyone
        knows or if there are some resources that explain the
        history of how/why Tika also has its API architecture.

        Thanks so much,
        Robert



--
        *Imixs Software Solutions GmbH*
        *Web:* www.imixs.com <http://www.imixs.com/> *Phone:* +49
        (0)89-452136 16
        *Office:* Agnes-Pockels-Bogen 1, 80992 München
        Registergericht: Amtsgericht Muenchen, HRB 136045
        Geschaeftsführer: Gaby Heinle u. Ralph Soika

        *Imixs* is an open source company, read more: www.imixs.org
        <http://www.imixs.org/>


--------------------------
Ken Krugler
http://www.scaleunlimited.com <http://www.scaleunlimited.com>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply via email to