Re: Crawling and redirects to the same URL

Nutch User - 1 Sun, 18 Sep 2011 07:47:03 -0700

On 15.09.2011 22:25, Elisabeth Adler wrote:

Hi,
I am having issues crawling an intranet site with an (imho) oddredirect mechanism. One part of the intranet website requiresauthentication which Nutch can bypass sending a specialhttp.agent.name. This works fine.
The issue I am facing is that the server sends a redirect (302) aftersuccessful authentication to the same URL. Nutch is not following theredirect. My guess is that Nutch omits the site because it has beenvisited before...
Any pointers on how to overcome this and index the site after theredirect happened are very welcome. My configuration is below.
Thanks a lot,
Elisabeth


I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10
plugin.includes =protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)


These could give some explanation:

http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
https://issues.apache.org/jira/browse/NUTCH-1044

Re: Crawling and redirects to the same URL

Reply via email to