Hi Sachin,

practically every Nutch tool (inject, generate, fetch, parse, update, index)
can filter (and normalize) URLs. Because filtering and normalizing is expensive
only the steps which add new URLs (inject and parse) do this by default (see
bin/crawl).

For your use case you might instead filter during the generation step
* remove the -noFilter option of the generate command
* add -noFilter to the parse step resp. set parse.filter.urls to false
  as Yossi mentioned

For historical reasons (difficult to change when trying to ensure backwards
compatibility some commands have a -filter argument while others have -noFilter.
In addition, often there are configuration properties to achieve the same.
But command-line args always take precedence.

Best,
Sebastian

On 17.10.19 20:23, Sachin Mittal wrote:
> Hi,
> 
> Thanks I figured this out. Lets hope it works!.
> 
> urlfilter-regex is required to filter out the urls for next crawl, however
> I still want to index all the outlinks for my current url.
> The reason is that I may not want nutch to crawl these outlinks in next
> round, but I may still want some other crawler to scrape these urls.
> 
> Sachin
> 
> 
> On Thu, Oct 17, 2019 at 10:01 PM <yossi.tam...@pipl.com> wrote:
> 
>> Hi Sachin,
>>
>> I'm not sure what you are trying to achieve: If you don't want to filter
>> the outlinks, why do you enable urlfilter-regex?
>> Anyway, if you set the property parse.filter.urls to false, the Parser
>> will not filter outlinks at all.
>>
>>         Yossi.
>>
>> -----Original Message-----
>> From: Sachin Mittal <sjmit...@gmail.com>
>> Sent: Thursday, 17 October 2019 19:15
>> To: user@nutch.apache.org
>> Subject: Parsed segment has outlinks filtered
>>
>> Hi,
>> I was bit confused on the outlinks generated from a parsed url.
>> If I use the utility:
>>
>> bin/nutch parsechecker url
>>
>> The generated outlinks has all the outlinks.
>>
>> However if I check the dump of parsed segment generated using nutch crawl
>> script using command:
>>
>> bin/nutch readseg -dump /segments/<>/ /outputdir -nocontent -nofetch -
>> nogenerate -noparse -noparsetext
>>
>> And I review the same entry's ParseData I see it has lot fewer outlinks.
>> Basically it has filtered out all the outlinks which did not match the
>> regex's defined in regex-urlfilter.txt.
>>
>> So I want to know if there is a way to avoid this and make sure the
>> generated outlinks in the nutch segments contains all the urls and not just
>> the filtered ones.
>>
>> Even if you can point to the code where this url filtering happens for
>> outlinks I can figure out a way to circumvent this.
>>
>> Thanks
>> Sachin
>>
>>
> 

Reply via email to