RE: Parsed segment has outlinks filtered

2019-10-18 Thread yossi.tamari
Hi Sachin, If you're using the default crawl script, I think the answer was in Sebastian's email: the default seems to be to filter only in the Parse step. This has changed recently, so the Fetch step now filters as well, but only if you have the latest code. Otherwise, you need to remove the

Re: Parsed segment has outlinks filtered

2019-10-18 Thread Sachin Mittal
Hi, Setting the prop parse.filter.urls= false does not filter out the outlinks. I get all the outlinks for my parsed url. So this is working as expected. However it has caused something unwarranted on the FetcherThread as now it seems to be fetching all the urls (even ones which do not match

Re: Parsed segment has outlinks filtered

2019-10-18 Thread Sebastian Nagel
Hi Sachin, practically every Nutch tool (inject, generate, fetch, parse, update, index) can filter (and normalize) URLs. Because filtering and normalizing is expensive only the steps which add new URLs (inject and parse) do this by default (see bin/crawl). For your use case you might instead