Hi Sven, Please have a look at the Simple History report to see what happened to the documents you are interested in. The Web Connector will fetch binary documents no problem, but it sounds like you have something else in your configuration that is causing them to be rejected. The configuration of the web connector, as well as the configuration of the downstream pipeline connectors, all are able to reject documents based on mime type. The Simple History will give you a reason for that rejection. If not, you can turn on connector debugging and you can see the decisions that go into whether to index a document or not.
Karl On Tue, Aug 14, 2018 at 5:40 AM Farrenkopf, Sven <[email protected]> wrote: > I’m using manifoldCF with solr, trying to get it working as a webcrawler. > Crawling the websites (HTML, Text) works fine, the problem is that links to > binary documents (pdf, xlsx, docx, …) don’t work even if I put a > tika-Transformation in the job. I haven’t even found a written confirmation > that the webcrawler-connector does support binary documents, although some > posts to the mailing-lists indicate that it is possible. > > > > The documents are apparently recognized – I put a direct link to a > pdf-document in the seeds and it is processed as I run the job. > > > > But there is no error (Tika-errors are not ignored!) and the document is > not transferred to solr. With no error-message I have nothing to work with … > > > > Any ideas/hints what to do? Does somebody know a tutorial for setting up a > webcrawler with solr & tika? I haven’t found any on the web, which made me > ask myself if I’m trying sth impossible here? > > > > Thanks in advance. > > > > Sven >
