The problem (I think) is that tika-parsers.jar includes just the Tika parsers
(wrappers) around a boatload of actual parsers/dependencies (POI, PDFBox, etc).
If you are using jars, I’d recommend the tika-app.jar which includes all
dependencies.
From: Steven White [mailto:[email protected]]
Sent: Tuesday, February 02, 2016 7:01 PM
To: [email protected]
Subject: Using Tika that comes with Solr 5.2
Hi everyone,
I have written a standalone application that works with Solr 5.2. I'm using
the existing JARs that come with Solr to index data off a file system. My
applications scans the file system, looking for files and then uses Tika to
extract the raw text and then sends the raw text to Solr, using SolrJ, for
indexing.
What I'm finding is that Tika will not extract the raw text off PDF,
Powerpoint, ets. files but it will off raw text files.
Here is the code for:
public static void parseWithTika() throws Exception {
File file = new File("C:\\temp\\test.pdf");
FileInputStream in =- new FileInputStream(file);
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
BodyContentHandler contentHandler = new BodyContentHandler();
parse.parse(in, contentHandler, metadata);
String content = contentHandelr.toString(); <=== 'content is always an empty
string
in.close();
}
In the above code, 'content' is always empty (the above is: off
https://tika.apache.org/1.8/examples.html)
Solr 5.2 comes with the following Tika JARs which I have included all of them:
tika-core-1.7.jar, tika-java7-1.7.jar, tika-parsers-1.7.jar, tika-xmp-1.7.jar,
vorbis-java-tika-0.6.jar, kite-morphlines-tika-core-0.12.1.jar and
kite-morphlines-tika-decompress-0.12.1.jar
Any idea why this isn't working?
Thanks!
Steve