Re: Crawl sites with hashtags in url

Markus Jelsma Tue, 01 May 2012 04:45:03 -0700

Hi,

URL's are passed through a series of normalizers. By default both the 
RegexNormalizer and the BasicNormalizer affect URL's with anchors, the latter 
removes it completely and is not configurable.


You can either hack your way through it by simply disabling the removal of the 
page reference or make it configurable. In that case you're welcome to attach 
a patch to a new issue in Jira.

Cheers,


On Tuesday 01 May 2012 13:25:25 Roberto Gardenier wrote:
> Hello,
> 
> 
> 
> Im currently trying to crawl a site which uses hashtags in the urls. I dont
> seem to get any results and Im hoping im just overlooking something.
> 
> I have created a JIRA bug report because I was not aware of the existence
> of this mailing list. Its my first time using such channels so i hope
> correctly sending  this message.
> 
> Link: https://issues.apache.org/jira/browse/NUTCH-1343
> 
> 
> 
> The site structure that im trying to index, is as follow:
> 
> http://domain.com (landingpage)
> 
> http://domain.com/#/page1
> 
> http://domain.com/#/page1/subpage1
> 
> http://domain.com/#/page2
> 
> http://domain.com/#/page2/subpage1
> 
> and so on.
> 
> 
> 
> I've pointed nutch to http://domain.com as start url and in my filter i've
> placed all kind of rules.
> 
> First i thought this would be sufficient:
> 
> +http\://domain\.com\/#
> 
> But then i realised that # is used for comments so i escaped it:
> 
> +http\://domain\.com\/#
> 
> 
> 
> Still no results. So i thought i could use the asterix for it:
> 
> +http\://domain\.com\/*
> 
> Still no luck.. So i started using various regex stuff but without success.
> 
> 
> 
> I noticed the following messages in hadoop.log:
> 
> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
> 
> Ive researched on this setting but i dont know for sure if this affects my
> problem in a way. This property is set to false in my configs.
> 
> 
> 
> I dont know if this is even related to the situation above but maybe it
> helps.
> 
> 
> 
> Any help is very much appreciated! I've tried googling the problem but i
> couldnt find documentation or anyone else with this problem.
> 
> 
> 
> Many thanks in advance.
> 
> 
> 
> With kind regard,
> 
> Roberto Gardenier

-- 
Markus Jelsma - CTO - Openindex

Re: Crawl sites with hashtags in url

Reply via email to