Another option is to load and invoke Tika in its own classloader to keep
its jars isolated from the rest of the application. We did this for a
while, until we switched to Gradle and implemented the "careful
exclusion" approach that Ken mentioned. Downside was the need to use
reflection to invoke Tika (AutoDectectParser) and marshall properties in
and out of Metadata. But it worked fine. We used this child-first
classloader implementation:
https://articles.qos.ch/delegation/src/java/ch/qos/ChildFirstClassLoader.java
Adam
On 11/25/2020 6:57 AM, Ken Krugler wrote:
When we used Tika as a library with Hadoop map-reduce workflows, we
had to run it in a separate thread with a timeout, and leave the
thread as a zombie if/when it hung.
As far as jar hell (a very real problem), you can either do careful
exclusions in your dependency specification (painful, and fragile) to
avoid pulling in the world and creating incompatibilities in jar
versions, or you could create a shaded Tika jar.
— Ken
On Nov 25, 2020, at 6:41 AM, Tucker B <[email protected]
<mailto:[email protected]>> wrote:
Not totally on topic but I think related to this thread. I'm
currently exploring using tika as a library in Apache Spark. This
approach suffers the same problems as using Tika as library mentioned
above. Has anyone used Tika as a library in a Spark Job? Or would it
still make sense to us something external like tika-server? That
seems like it might be counter to the point of using Spark in the
first place.
On Tue, Nov 24, 2020 at 10:46 AM Slava G <[email protected]
<mailto:[email protected]>> wrote:
We have been using tika as java library, for a few years now and
parsing millions of different files each day. And we're switching
now to tika server as bugs in different tika components
(dependencies) caused issue like exit of the jvm, memory issues
and so. Also, tika and it's different dependencies bringa lot of
other dependencies, so it should simply the maintainability and
reduce JAR hell.
So, this is our road from tika as java library to tika as a server 😀
Thanks
On Tue, Nov 24, 2020, 09:28 Ralph Soika <[email protected]
<mailto:[email protected]>> wrote:
Hi Robert,
in the sense of a microservice architecture it makes absolute
sense to use Tika as a server/microservice component. As Tim
Allison explained this helps you to separate your business
requirements in isolated components (running in there own JVM).
If you don't need to link the Tika function closely to your
code then use the server option wherever possible.
Best regards
Ralph
On 23.11.20 21:36, Robert Raines wrote:
Hi,
I am using Tika to extract text from Word Docs and PDFs
locally. It's great. Thank you Apache and Tika developers!
Could someone help me understand why Tika offers a
client-server option instead of just a code library? I am
sure there was/is a good reason, so I am curious if anyone
knows or if there are some resources that explain the
history of how/why Tika also has its API architecture.
Thanks so much,
Robert
--
*Imixs Software Solutions GmbH*
*Web:* www.imixs.com <http://www.imixs.com/> *Phone:* +49
(0)89-452136 16
*Office:* Agnes-Pockels-Bogen 1, 80992 München
Registergericht: Amtsgericht Muenchen, HRB 136045
Geschaeftsführer: Gaby Heinle u. Ralph Soika
*Imixs* is an open source company, read more: www.imixs.org
<http://www.imixs.org/>
--------------------------
Ken Krugler
http://www.scaleunlimited.com <http://www.scaleunlimited.com>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr