RE: Parsed segment has outlinks filtered

2019-10-17 Thread yossi.tamari
Hi Sachin,

I'm not sure what you are trying to achieve: If you don't want to filter the 
outlinks, why do you enable urlfilter-regex?
Anyway, if you set the property parse.filter.urls to false, the Parser will not 
filter outlinks at all.

Yossi.

-Original Message-
From: Sachin Mittal  
Sent: Thursday, 17 October 2019 19:15
To: user@nutch.apache.org
Subject: Parsed segment has outlinks filtered

Hi,
I was bit confused on the outlinks generated from a parsed url.
If I use the utility:

bin/nutch parsechecker url

The generated outlinks has all the outlinks.

However if I check the dump of parsed segment generated using nutch crawl 
script using command:

bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch - 
nogenerate -noparse -noparsetext

And I review the same entry's ParseData I see it has lot fewer outlinks.
Basically it has filtered out all the outlinks which did not match the regex's 
defined in regex-urlfilter.txt.

So I want to know if there is a way to avoid this and make sure the generated 
outlinks in the nutch segments contains all the urls and not just the filtered 
ones.

Even if you can point to the code where this url filtering happens for outlinks 
I can figure out a way to circumvent this.

Thanks
Sachin



Parsed segment has outlinks filtered

2019-10-17 Thread Sachin Mittal
Hi,
I was bit confused on the outlinks generated from a parsed url.
If I use the utility:

bin/nutch parsechecker url

The generated outlinks has all the outlinks.

However if I check the dump of parsed segment generated using nutch crawl
script using command:

bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
nogenerate -noparse -noparsetext

And I review the same entry's ParseData I see it has lot fewer outlinks.
Basically it has filtered out all the outlinks which did not match the
regex's defined in regex-urlfilter.txt.

So I want to know if there is a way to avoid this and make sure the
generated outlinks in the nutch segments contains all the urls and not just
the filtered ones.

Even if you can point to the code where this url filtering happens for
outlinks I can figure out a way to circumvent this.

Thanks
Sachin