Re: Exclude HTML elements from Crawl

2023-09-22 Thread Sebastian Nagel
Hi Michael, > I wonder if there is not already a build-in option to exclude HTML > elements (like a div with a given id or class or other elements like header). No, there isn't one so far. > I know https://issues.apache.org/jira/browse/NUTCH-585 > I also do not understand why this little patc

Exclude HTML elements from Crawl

2023-09-21 Thread Fritsch, Michael
Hello, I use Nutch 1.18 to crawl our documentation with the parse-html plugin. Each page has elements like TOCs which should not be included. I know https://issues.apache.org/jira/browse/NUTCH-585 and included one of the patches. However, I wonder if there is not already a build-in option to ex