Yes the changes Sebastian suggested seems to be working fine. I now see all the outlinks in the parsed document and subsequent crawl of the outlinks filters out those that do not match my regex-urlfilter.
Thanks Sachin On Fri, Oct 18, 2019 at 11:51 PM <yossi.tam...@pipl.com> wrote: > Hi Sachin, > > If you're using the default crawl script, I think the answer was in > Sebastian's email: the default seems to be to filter only in the Parse > step. This has changed recently, so the Fetch step now filters as well, but > only if you have the latest code. Otherwise, you need to remove the > -noFilter flag from generate_args in the crawl script. I missed that, since > I don't use this script. > (Generally, always treat Sebastian's answers as The Best Answers!) > > Yossi. > > -----Original Message----- > From: Sachin Mittal <sjmit...@gmail.com> > Sent: Friday, 18 October 2019 17:36 > To: user@nutch.apache.org > Subject: Re: Parsed segment has outlinks filtered > > Hi, > Setting the prop parse.filter.urls= false does not filter out the outlinks. > I get all the outlinks for my parsed url. So this is working as expected. > However it has caused something unwarranted on the FetcherThread as now it > seems to be fetching all the urls (even ones which do not match > urlfilter-regex). > These urls were not fetched earlier. So what it seems to be doing is that > when generating next set of urls, it is not applying urlfilter-regex. > > I will play around with noFilter option as Sebastian has mentioned and see > if this works as expected. > > However any idea why the next crawl cycle (from previous crawl cycle's > outlinks) does not seem to be applying the url filters defined in > urlfilter-regex > > Thanks > Sachin > > > > On Thu, Oct 17, 2019 at 11:53 PM Sachin Mittal <sjmit...@gmail.com> wrote: > > > Hi, > > > > Thanks I figured this out. Lets hope it works!. > > > > urlfilter-regex is required to filter out the urls for next crawl, > > however I still want to index all the outlinks for my current url. > > The reason is that I may not want nutch to crawl these outlinks in > > next round, but I may still want some other crawler to scrape these urls. > > > > Sachin > > > > > > On Thu, Oct 17, 2019 at 10:01 PM <yossi.tam...@pipl.com> wrote: > > > >> Hi Sachin, > >> > >> I'm not sure what you are trying to achieve: If you don't want to > >> filter the outlinks, why do you enable urlfilter-regex? > >> Anyway, if you set the property parse.filter.urls to false, the > >> Parser will not filter outlinks at all. > >> > >> Yossi. > >> > >> -----Original Message----- > >> From: Sachin Mittal <sjmit...@gmail.com> > >> Sent: Thursday, 17 October 2019 19:15 > >> To: user@nutch.apache.org > >> Subject: Parsed segment has outlinks filtered > >> > >> Hi, > >> I was bit confused on the outlinks generated from a parsed url. > >> If I use the utility: > >> > >> bin/nutch parsechecker url > >> > >> The generated outlinks has all the outlinks. > >> > >> However if I check the dump of parsed segment generated using nutch > >> crawl script using command: > >> > >> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch > >> - nogenerate -noparse -noparsetext > >> > >> And I review the same entry's ParseData I see it has lot fewer outlinks. > >> Basically it has filtered out all the outlinks which did not match > >> the regex's defined in regex-urlfilter.txt. > >> > >> So I want to know if there is a way to avoid this and make sure the > >> generated outlinks in the nutch segments contains all the urls and > >> not just the filtered ones. > >> > >> Even if you can point to the code where this url filtering happens > >> for outlinks I can figure out a way to circumvent this. > >> > >> Thanks > >> Sachin > >> > >> > >