On 07/03/2018 13:29, lala wrote:
It's just confusing for me, In the DIH configuration file, the inner entity
that takes "TikaEntityProcessor" as its processor, I can easily specify a
tikaConfig attribute to an xml file, located inside the config folder in the
core, and where in this file I should be able to override the PDFParser
default properties... As in parseContext.Config...
The thing is that I placed my tika-config.xml file in the config folder,
set "tikaConfig" attribute = "tika-config.xml"... But tika still not parsing
images inside PDF file!!!
Let's say this is just experimenting Solr DIH crawling... Why it's not
This is my tika-config.xml file:
<?xml version="1.0" encoding="UTF-8"?>
I've read the code in both TikaEntityProcessor and TikaConfig... It should
read the xml file from config folder, extract params and override original
PDFParser attributes. But It DOESN'T!
My reading of
indicates that your PDF parser may not run unless you explicitly exclude
PDFs, which I don't think you're doing above.
I'm not an expert on Tika configuration, but I think you should first
try this xml file with standalone Tika and see if it does what you think
it should. Once you're sure, then try it with DIH or SolrJ.
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Flax - Open Source Enterprise Search
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828