The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part.
Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change "response = getResponse(u, datum, *false*)" call to "response = getResponse(u, datum, *true*)" in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B lewis john mcgibbney wrote > > I've checked working with redirects and everything seems to work fine for > me. > > The site I checked on > > http://www.scotland.gov.uk > > temp redirect to > > http://home.scotland.gov.uk/home > > Nutch gets this fine when I do some tweaking with nutch-site.xml > > redirects property -1 (just to demonstrate, I would usually not set it so) > > Lewis > -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.

