When we used Tika as a library with Hadoop map-reduce workflows, we had to run 
it in a separate thread with a timeout, and leave the thread as a zombie 
if/when it hung.

As far as jar hell (a very real problem), you can either do careful exclusions 
in your dependency specification (painful, and fragile) to avoid pulling in the 
world and creating incompatibilities in jar versions, or you could create a 
shaded Tika jar.

— Ken

> On Nov 25, 2020, at 6:41 AM, Tucker B <[email protected]> wrote:
> 
> Not totally on topic but I think related to this thread. I'm currently 
> exploring using tika as a library in Apache Spark. This approach suffers the 
> same problems as using Tika as library mentioned above. Has anyone used Tika 
> as a library in a Spark Job? Or would it still make sense to us something 
> external like tika-server? That seems like it might be counter to the point 
> of using Spark in the first place. 
> 
> On Tue, Nov 24, 2020 at 10:46 AM Slava G <[email protected] 
> <mailto:[email protected]>> wrote:
> We have been using tika as java library, for a few years now and parsing 
> millions of different files each day. And we're switching now to tika server 
> as bugs in different tika components (dependencies) caused issue like exit of 
> the jvm, memory issues and so. Also, tika and it's different dependencies 
> bringa lot of other dependencies, so it should simply the maintainability and 
> reduce JAR hell. 
> 
> So, this is our road from tika as java library to tika as a server 😀
> 
> Thanks 
> 
> On Tue, Nov 24, 2020, 09:28 Ralph Soika <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Robert,
> 
> in the sense of a microservice architecture it makes absolute sense to use 
> Tika as a server/microservice component. As Tim Allison explained this helps 
> you to separate your business requirements in isolated components (running in 
> there own JVM). 
> If you don't need to link the Tika function closely to your code then use the 
> server option wherever possible. 
> 
> Best regards
> 
> Ralph
> 
> On 23.11.20 21:36, Robert Raines wrote:
>> Hi,
>> 
>> I am using Tika to extract text from Word Docs and PDFs locally. It's great. 
>> Thank you Apache and Tika developers!  
>> 
>> Could someone help me understand why Tika offers a client-server option 
>> instead of just a code library? I am sure there was/is a good reason, so I 
>> am curious if anyone knows or if there are some resources that explain the 
>> history of how/why Tika also has its API architecture.
>> 
>> Thanks so much,
>> Robert
>> 
>> 
>> 
> -- 
> Imixs Software Solutions GmbH 
> Web: www.imixs.com <http://www.imixs.com/> Phone: +49 (0)89-452136 16 
> Office: Agnes-Pockels-Bogen 1, 80992 München
> Registergericht: Amtsgericht Muenchen, HRB 136045
> Geschaeftsführer: Gaby Heinle u. Ralph Soika
> 
> Imixs is an open source company, read more: www.imixs.org 
> <http://www.imixs.org/>

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply via email to