Hi Sachin,
If you're using the default crawl script, I think the answer was in Sebastian's
email: the default seems to be to filter only in the Parse step. This has
changed recently, so the Fetch step now filters as well, but only if you have
the latest code. Otherwise, you need to remove the
Hi,
Setting the prop parse.filter.urls= false does not filter out the outlinks.
I get all the outlinks for my parsed url. So this is working as expected.
However it has caused something unwarranted on the FetcherThread as now it
seems to be fetching all the urls (even ones which do not match
Hi Sachin,
practically every Nutch tool (inject, generate, fetch, parse, update, index)
can filter (and normalize) URLs. Because filtering and normalizing is expensive
only the steps which add new URLs (inject and parse) do this by default (see
bin/crawl).
For your use case you might instead
3 matches
Mail list logo