On 15.09.2011 22:25, Elisabeth Adler wrote:
Hi,
I am having issues crawling an intranet site with an (imho) odd
redirect mechanism. One part of the intranet website requires
authentication which Nutch can bypass sending a special
http.agent.name. This works fine.
The issue I am facing is that the server sends a redirect (302) after
successful authentication to the same URL. Nutch is not following the
redirect. My guess is that Nutch omits the site because it has been
visited before...
Any pointers on how to overcome this and index the site after the
redirect happened are very welcome. My configuration is below.
Thanks a lot,
Elisabeth
I am using nutch-1.3 with
http.agent.name = my-nutch-1.3
generate.max.per.host = -1
fetcher.threads.per.host = 5
fetcher.threads.fetch = 5
fetcher.server.delay = 1
http.redirect.max = 10
plugin.includes =
protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
These could give some explanation:
http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
https://issues.apache.org/jira/browse/NUTCH-1044