Using mainfoldCF as a webcrawler with tika and solr

Farrenkopf, Sven Tue, 14 Aug 2018 02:40:50 -0700

I'm using manifoldCF with solr, trying to get it working as a webcrawler. 
Crawling the websites (HTML, Text) works fine, the problem is that links to 
binary documents (pdf, xlsx, docx, ...) don't work even if I put a 
tika-Transformation in the job. I haven't even found a written confirmation 
that the webcrawler-connector does support  binary documents, although some 
posts to the mailing-lists indicate that it is possible.


The documents are apparently recognized - I put a direct link to a pdf-document 
in the seeds and it is processed as I run the job.

But there is no error (Tika-errors are not ignored!) and the document is not 
transferred to solr. With no error-message I have nothing to work with ...

Any ideas/hints what to do? Does somebody know a tutorial for setting up a 
webcrawler with solr & tika? I haven't found any on the web, which made me ask 
myself if I'm trying sth impossible here?

Thanks in advance.

Sven

Using mainfoldCF as a webcrawler with tika and solr

Reply via email to