RE: Crawl sites with hashtags in url

Roberto Gardenier Mon, 07 May 2012 00:40:19 -0700

Hi Sebastian, 

I have looked at the RFC and im convinced that i dont need to take any further 
action on this issue, as is that this website is just not following the rules. 
Just like twitter... but who cares.
Its not our problem anymore, thank you so much for your reply!


Kind regards,
Roberto Gardenier 
    

-----Oorspronkelijk bericht-----
Van: Sebastian Nagel [mailto:[email protected]] 
Verzonden: dinsdag 1 mei 2012 23:21
Aan: [email protected]
Onderwerp: Re: Crawl sites with hashtags in url

Hi Roberto,

as defined in ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt the
hash ('#') is used to separate the "fragment" from the rest of the URL.
The RFC explicitly delegates the semantics of the fragment to the media
type of the document. In good old HTML the fragment is just an "anchor"
and should be removed - otherwise the same physical document is fetched
multiple times by different URLs. That's the current behavior of Nutch,
see Markus' explanations.

Nowadays (with AJAX), the situation is changing and anchors are used to
address not a different view but indeed different content. Have a look
at NUTCH-1323 and Markus' comment on NUTCH-1339, maybe this will help
you to solve the problem.

Sebastian


On 05/01/2012 01:44 PM, Markus Jelsma wrote:
> Hi,
>
> URL's are passed through a series of normalizers. By default both the
> RegexNormalizer and the BasicNormalizer affect URL's with anchors, the latter
> removes it completely and is not configurable.
>
> You can either hack your way through it by simply disabling the removal of the
> page reference or make it configurable. In that case you're welcome to attach
> a patch to a new issue in Jira.
>
> Cheers,
>
>
> On Tuesday 01 May 2012 13:25:25 Roberto Gardenier wrote:
>> Hello,
>>
>>
>>
>> Im currently trying to crawl a site which uses hashtags in the urls. I dont
>> seem to get any results and Im hoping im just overlooking something.
>>
>> I have created a JIRA bug report because I was not aware of the existence
>> of this mailing list. Its my first time using such channels so i hope
>> correctly sending  this message.
>>
>> Link: https://issues.apache.org/jira/browse/NUTCH-1343
>>
>>
>>
>> The site structure that im trying to index, is as follow:
>>
>> http://domain.com (landingpage)
>>
>> http://domain.com/#/page1
>>
>> http://domain.com/#/page1/subpage1
>>
>> http://domain.com/#/page2
>>
>> http://domain.com/#/page2/subpage1
>>
>> and so on.
>>
>>
>>
>> I've pointed nutch to http://domain.com as start url and in my filter i've
>> placed all kind of rules.
>>
>> First i thought this would be sufficient:
>>
>> +http\://domain\.com\/#
>>
>> But then i realised that # is used for comments so i escaped it:
>>
>> +http\://domain\.com\/#
>>
>>
>>
>> Still no results. So i thought i could use the asterix for it:
>>
>> +http\://domain\.com\/*
>>
>> Still no luck.. So i started using various regex stuff but without success.
>>
>>
>>
>> I noticed the following messages in hadoop.log:
>>
>> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
>>
>> Ive researched on this setting but i dont know for sure if this affects my
>> problem in a way. This property is set to false in my configs.
>>
>>
>>
>> I dont know if this is even related to the situation above but maybe it
>> helps.
>>
>>
>>
>> Any help is very much appreciated! I've tried googling the problem but i
>> couldnt find documentation or anyone else with this problem.
>>
>>
>>
>> Many thanks in advance.
>>
>>
>>
>> With kind regard,
>>
>> Roberto Gardenier
>

RE: Crawl sites with hashtags in url

Reply via email to