Re: http.redirect.max

xuyuanme Fri, 24 Feb 2012 01:31:11 -0800

The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part.


Yes from my end I can see the crawl for website http://www.scotland.gov.uk
is redirected as expected.

However the website I tried to crawl is a bit more tricky.

Here's what I want to do:

1. Set
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
as the seed page

2. And try to crawl one of the link
(http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT)
as a test

If you click the link, you'll find the website use redirect and cookie to
control page navigation. So I used protocol-httpclient plugin instead of
protocol-http to handle the cookie.

However, the redirect does not happen as expected. The only way I can fetch
second link is to manually change "response = getResponse(u, datum,
*false*)" call to "response = getResponse(u, datum, *true*)" in
org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
lib-http plugin.

So my issue is related to this specific site
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B


lewis john mcgibbney wrote
> 
> I've checked working with redirects and everything seems to work fine for
> me.
> 
> The site I checked on
> 
> http://www.scotland.gov.uk
> 
> temp redirect to
> 
> http://home.scotland.gov.uk/home
> 
> Nutch gets this fine when I do some tweaking with nutch-site.xml
> 
> redirects property -1 (just to demonstrate, I would usually not set it so)
> 
> Lewis
> 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: http.redirect.max

Reply via email to