Hi Tim

Thanks this is helpful.


  1.  For tika-app we found the dependency on 
org.apache.tika<https://mvnrepository.com/artifact/org.apache.tika> » 
tika-langdetect-optimaize<https://mvnrepository.com/artifact/org.apache.tika/tika-langdetect-optimaize>
 brings in some older 3rd party jars, and unfortunately it appears that the 
com.optimaize.languagedetector<https://mvnrepository.com/artifact/com.optimaize.languagedetector>
 » 
language-detector<https://mvnrepository.com/artifact/com.optimaize.languagedetector/language-detector>
 0.6 is unmaintained so it’s dependencies on vulnerable versions of guava 
(18.0) cause us problems with security scans. I could be wrong but I don’t 
believe we need this component for our usage of just detect and parse?


  1.  We have a sort of microservice process (java based) which is ingesting 
files parsed from tika. It was nice that we could separate the tika process in 
it’s own heap space as a separate java process rather than adding it to our 
app, but I suppose we could work around that


Thank you
Brian Laskey

From: Tim Allison <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, March 8, 2024 at 9:44 AM
To: "[email protected]" <[email protected]>
Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using tiki-core 
/ and parsers

Hi Brian, A few thoughts: 1) tika-app is basically tika-core + 
tika-parsers-standard-package. Which components are you trying to avoid? 
tika-serialization and jackson? boilerpipecontenthandler and some of its 
dependencies? I ask, because we
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
    Report Suspicious  
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/AdhS1Rd-!-XFVHHhSeIETGVh4pErKXUBQg_VGvhYuZ_NbtxDOv3ViXaHJFGCw0722FF-991FszxCfjRGoltCz_3pKpckkpA8OXZp57n5SWAPc4zFf44P75FcQSvUiSTHh_Kc$>
   ‌
ZjQcmQRYFpfptBannerEnd
Hi Brian,
  A few thoughts:

1) tika-app is basically tika-core + tika-parsers-standard-package. Which 
components are you trying to avoid? tika-serialization and jackson? 
boilerpipecontenthandler and some of its dependencies? I ask, because we could 
factor out a tika-app-core with no parsers in Tika 3.x, which is what we do now 
with tika-server-core and tika-server-standard.

2) Unrelated, there are probably more efficient ways of running Tika than 
calling it per file on the commandline. That is a robust option, at least!

If all you want is detect and text extraction, and you want to run it from the 
commandline, write two classes, whose main()s call:
System.out.println(Tika.detect(File f));

or

System.out.println(Tika.parseToString(File f))

On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey 
<[email protected]<mailto:[email protected]>> wrote:
Hello Tika community,

Our team is migrating away from usage of tika-app.jar (2.6 currently) to 
something with more minimal third party dependencies which we can control.


Is there any good documentation or pathway to describe how a team could map the 
tika-app functionality we use to the same behavior using just tika-core and 
tika-parsers-standard-package
(I assume)?

The tika-app functions we use today are:

Mime-type detection
java -jar tika-app.jar -d <file>

and
Text extraction attempts
java -jar tika-app.jar -t <file>

Is there a subset of tika parser jars we would need to include to have 
equivalent functionality if we wrote our own wrapper main class?

Thank you,
Brian Laskey

Reply via email to