Hi, Thanks I figured this out. Lets hope it works!.
urlfilter-regex is required to filter out the urls for next crawl, however I still want to index all the outlinks for my current url. The reason is that I may not want nutch to crawl these outlinks in next round, but I may still want some other crawler to scrape these urls. Sachin On Thu, Oct 17, 2019 at 10:01 PM <[email protected]> wrote: > Hi Sachin, > > I'm not sure what you are trying to achieve: If you don't want to filter > the outlinks, why do you enable urlfilter-regex? > Anyway, if you set the property parse.filter.urls to false, the Parser > will not filter outlinks at all. > > Yossi. > > -----Original Message----- > From: Sachin Mittal <[email protected]> > Sent: Thursday, 17 October 2019 19:15 > To: [email protected] > Subject: Parsed segment has outlinks filtered > > Hi, > I was bit confused on the outlinks generated from a parsed url. > If I use the utility: > > bin/nutch parsechecker url > > The generated outlinks has all the outlinks. > > However if I check the dump of parsed segment generated using nutch crawl > script using command: > > bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch - > nogenerate -noparse -noparsetext > > And I review the same entry's ParseData I see it has lot fewer outlinks. > Basically it has filtered out all the outlinks which did not match the > regex's defined in regex-urlfilter.txt. > > So I want to know if there is a way to avoid this and make sure the > generated outlinks in the nutch segments contains all the urls and not just > the filtered ones. > > Even if you can point to the code where this url filtering happens for > outlinks I can figure out a way to circumvent this. > > Thanks > Sachin > >

