RE: Crawl sites with hashtags in url

2012-05-07 Thread Roberto Gardenier
...@gmail.com] Verzonden: woensdag 2 mei 2012 2:21 Aan: user@nutch.apache.org Onderwerp: Re: Crawl sites with hashtags in url Hi Roberto, If you're having an invalid URI error, then this might probably help you: http://lucene.472066.n3.nabble.com/Invalid-uri-td3742047.html Remi On Tue, May 1, 2012 at 7

Re: Crawl sites with hashtags in url

2012-05-07 Thread Siddharth Jain
Hi, Have any of you has worked on crawling https sites with certificate.pls let me know -- View this message in context: http://lucene.472066.n3.nabble.com/Re-Crawl-sites-with-hashtags-in-url-tp3954098p3968209.html Sent from the Nutch - User mailing list archive at Nabble.com.

Crawl sites with hashtags in url

2012-05-01 Thread Roberto Gardenier
Hello, Im currently trying to crawl a site which uses hashtags in the urls. I dont seem to get any results and Im hoping im just overlooking something. I have created a JIRA bug report because I was not aware of the existence of this mailing list. Its my first time using such channels so i

Re: Crawl sites with hashtags in url

2012-05-01 Thread Markus Jelsma
Hi, URL's are passed through a series of normalizers. By default both the RegexNormalizer and the BasicNormalizer affect URL's with anchors, the latter removes it completely and is not configurable. You can either hack your way through it by simply disabling the removal of the page reference

RE: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url

2012-05-01 Thread Roberto Gardenier
: Markus Jelsma (JIRA) [mailto:j...@apache.org] Verzonden: dinsdag 1 mei 2012 13:40 Aan: r.garden...@simgroep.nl Onderwerp: [jira] [Closed] (NUTCH-1343) Crawl sites with hashtags in url [ https://issues.apache.org/jira/browse/NUTCH-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all

Re: Crawl sites with hashtags in url

2012-05-01 Thread Sebastian Nagel
Hi Roberto, as defined in ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt the hash ('#') is used to separate the fragment from the rest of the URL. The RFC explicitly delegates the semantics of the fragment to the media type of the document. In good old HTML the fragment is just an anchor and

Re: Crawl sites with hashtags in url

2012-05-01 Thread remi tassing
Hi Roberto, If you're having an invalid URI error, then this might probably help you: http://lucene.472066.n3.nabble.com/Invalid-uri-td3742047.html Remi On Tue, May 1, 2012 at 7:25 PM, Roberto Gardenier r.garden...@simgroep.nlwrote: Hello, Im currently trying to crawl a site which uses