Hello Christian,

>we've got a problem using Nutch: On the website that has to be crawled, there 
>is 
>a navigation on top of each page. Nutch crawls the navigation of each page 
>which leads to the situation that for certain queries (that are included in 
>the navigation) every page is delivered as a result.

We had always used the blacklist-whitelist plugin for this.
There you can specify tags/ids and classes to white or black list in your html.

http://lucene.472066.n3.nabble.com/HTML-tag-filtering-td4116686.html

Here is a version compiled for nutch 1.12 with java 8.

https://aarboard.oncloud7.ch/index.php/s/MfFDlsUBWMWW5ZM


André

Reply via email to