Hello,
Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. I have created a JIRA bug report because I was not aware of the existence of this mailing list. Its my first time using such channels so i hope correctly sending this message. Link: https://issues.apache.org/jira/browse/NUTCH-1343 The site structure that im trying to index, is as follow: http://domain.com (landingpage) http://domain.com/#/page1 http://domain.com/#/page1/subpage1 http://domain.com/#/page2 http://domain.com/#/page2/subpage1 and so on. I've pointed nutch to http://domain.com as start url and in my filter i've placed all kind of rules. First i thought this would be sufficient: +http\://domain\.com\/# But then i realised that # is used for comments so i escaped it: +http\://domain\.com\/# Still no results. So i thought i could use the asterix for it: +http\://domain\.com\/* Still no luck.. So i started using various regex stuff but without success. I noticed the following messages in hadoop.log: INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off Ive researched on this setting but i dont know for sure if this affects my problem in a way. This property is set to false in my configs. I dont know if this is even related to the situation above but maybe it helps. Any help is very much appreciated! I've tried googling the problem but i couldnt find documentation or anyone else with this problem. Many thanks in advance. With kind regard, Roberto Gardenier

