Yes the changes Sebastian suggested seems to be working fine.
I now see all the outlinks in the parsed document and subsequent crawl of
the outlinks filters out those that do not match my regex-urlfilter.

Thanks
Sachin


On Fri, Oct 18, 2019 at 11:51 PM <yossi.tam...@pipl.com> wrote:

> Hi Sachin,
>
> If you're using the default crawl script, I think the answer was in
> Sebastian's email: the default seems to be to filter only in the Parse
> step. This has changed recently, so the Fetch step now filters as well, but
> only if you have the latest code. Otherwise, you need to remove the
> -noFilter flag from generate_args in the crawl script. I missed that, since
> I don't use this script.
> (Generally, always treat Sebastian's answers as The Best Answers!)
>
>         Yossi.
>
> -----Original Message-----
> From: Sachin Mittal <sjmit...@gmail.com>
> Sent: Friday, 18 October 2019 17:36
> To: user@nutch.apache.org
> Subject: Re: Parsed segment has outlinks filtered
>
> Hi,
> Setting the prop parse.filter.urls= false does not filter out the outlinks.
> I get all the outlinks for my parsed url. So this is working as expected.
> However it has caused something unwarranted on the FetcherThread as now it
> seems to be fetching all the urls (even ones which do not match
> urlfilter-regex).
> These urls were not fetched earlier. So what it seems to be doing is that
> when generating next set of urls, it is not applying urlfilter-regex.
>
> I will play around with noFilter option as Sebastian has mentioned and see
> if this works as expected.
>
> However any idea why the next crawl cycle (from previous crawl cycle's
> outlinks) does not seem to be applying the url filters defined in
> urlfilter-regex
>
> Thanks
> Sachin
>
>
>
> On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal <sjmit...@gmail.com> wrote:
>
> > Hi,
> >
> > Thanks I figured this out. Lets hope it works!.
> >
> > urlfilter-regex is required to filter out the urls for next crawl,
> > however I still want to index all the outlinks for my current url.
> > The reason is that I may not want nutch to crawl these outlinks in
> > next round, but I may still want some other crawler to scrape these urls.
> >
> > Sachin
> >
> >
> > On Thu, Oct 17, 2019 at 10:01 PM <yossi.tam...@pipl.com> wrote:
> >
> >> Hi Sachin,
> >>
> >> I'm not sure what you are trying to achieve: If you don't want to
> >> filter the outlinks, why do you enable urlfilter-regex?
> >> Anyway, if you set the property parse.filter.urls to false, the
> >> Parser will not filter outlinks at all.
> >>
> >>         Yossi.
> >>
> >> -----Original Message-----
> >> From: Sachin Mittal <sjmit...@gmail.com>
> >> Sent: Thursday, 17 October 2019 19:15
> >> To: user@nutch.apache.org
> >> Subject: Parsed segment has outlinks filtered
> >>
> >> Hi,
> >> I was bit confused on the outlinks generated from a parsed url.
> >> If I use the utility:
> >>
> >> bin/nutch parsechecker url
> >>
> >> The generated outlinks has all the outlinks.
> >>
> >> However if I check the dump of parsed segment generated using nutch
> >> crawl script using command:
> >>
> >> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch
> >> - nogenerate -noparse -noparsetext
> >>
> >> And I review the same entry's ParseData I see it has lot fewer outlinks.
> >> Basically it has filtered out all the outlinks which did not match
> >> the regex's defined in regex-urlfilter.txt.
> >>
> >> So I want to know if there is a way to avoid this and make sure the
> >> generated outlinks in the nutch segments contains all the urls and
> >> not just the filtered ones.
> >>
> >> Even if you can point to the code where this url filtering happens
> >> for outlinks I can figure out a way to circumvent this.
> >>
> >> Thanks
> >> Sachin
> >>
> >>
>
>

Reply via email to