On 15.09.2011 22:25, Elisabeth Adler wrote:
Hi,

I am having issues crawling an intranet site with an (imho) odd redirect mechanism. One part of the intranet website requires authentication which Nutch can bypass sending a special http.agent.name. This works fine.

The issue I am facing is that the server sends a redirect (302) after successful authentication to the same URL. Nutch is not following the redirect. My guess is that Nutch omits the site because it has been visited before...

Any pointers on how to overcome this and index the site after the redirect happened are very welcome. My configuration is below.
Thanks a lot,
Elisabeth


I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10
plugin.includes = protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)



These could give some explanation:

http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
https://issues.apache.org/jira/browse/NUTCH-1044

Reply via email to