The problem (I think) is that tika-parsers.jar includes just the Tika parsers 
(wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc). 
 If you are using jars, I’d recommend the tika-app.jar which includes all 
dependencies.
From: Steven White [mailto:swhite4...@gmail.com]
Sent: Tuesday, February 02, 2016 7:01 PM
To: user@tika.apache.org
Subject: Using Tika that comes with Solr 5.2

Hi everyone,

I have written a standalone application that works with Solr 5.2.  I'm using 
the existing JARs that come with Solr to index data off a file system.  My 
applications scans the file system, looking for files and then uses Tika to 
extract the raw text and then sends the raw text to Solr, using SolrJ, for 
indexing.

What I'm finding is that Tika will not extract the raw text off PDF, 
Powerpoint, ets. files but it will off raw text files.

Here is the code for:

public static void parseWithTika() throws Exception {
  File file = new File("C:\\temp\\test.pdf");

  FileInputStream in =- new FileInputStream(file);
  AutoDetectParser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  BodyContentHandler contentHandler = new BodyContentHandler();

  parse.parse(in, contentHandler, metadata);

  String content = contentHandelr.toString();  <=== 'content is always an empty 
string

  in.close();
}

In the above code, 'content' is always empty (the above is: off 
https://tika.apache.org/1.8/examples.html)

Solr 5.2 comes with the following Tika JARs which I have included all of them: 
tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar, 
vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and 
kite-morphlines-tika-decompress-0.12.1.jar

Any idea why this isn't working?

Thanks!

Steve

Reply via email to