Hi,

Thanks I figured this out. Lets hope it works!.

urlfilter-regex is required to filter out the urls for next crawl, however
I still want to index all the outlinks for my current url.
The reason is that I may not want nutch to crawl these outlinks in next
round, but I may still want some other crawler to scrape these urls.

Sachin


On Thu, Oct 17, 2019 at 10:01 PM <yossi.tam...@pipl.com> wrote:

> Hi Sachin,
>
> I'm not sure what you are trying to achieve: If you don't want to filter
> the outlinks, why do you enable urlfilter-regex?
> Anyway, if you set the property parse.filter.urls to false, the Parser
> will not filter outlinks at all.
>
>         Yossi.
>
> -----Original Message-----
> From: Sachin Mittal <sjmit...@gmail.com>
> Sent: Thursday, 17 October 2019 19:15
> To: user@nutch.apache.org
> Subject: Parsed segment has outlinks filtered
>
> Hi,
> I was bit confused on the outlinks generated from a parsed url.
> If I use the utility:
>
> bin/nutch parsechecker url
>
> The generated outlinks has all the outlinks.
>
> However if I check the dump of parsed segment generated using nutch crawl
> script using command:
>
> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
> nogenerate -noparse -noparsetext
>
> And I review the same entry's ParseData I see it has lot fewer outlinks.
> Basically it has filtered out all the outlinks which did not match the
> regex's defined in regex-urlfilter.txt.
>
> So I want to know if there is a way to avoid this and make sure the
> generated outlinks in the nutch segments contains all the urls and not just
> the filtered ones.
>
> Even if you can point to the code where this url filtering happens for
> outlinks I can figure out a way to circumvent this.
>
> Thanks
> Sachin
>
>

Reply via email to