Hi Alex,

Can you please have a look at NUTCH-1042?

Might it be the case that your redirect possibly has a crawl-delay which
then falls into the boundary case we witness in the issue above?

You may want to chabge your log properties to debug for a while and run
some small crawls on your problem URLs, maybe try adding in some LOG.debug
statements to see what kind of conditions are being satisfied around the
fetcher areas mentioned in NUTCH-1042.

hth

On Thu, Mar 1, 2012 at 8:09 PM, <[email protected]> wrote:

>
>  Hello,
>
> I tried 1, 2, -1 for the config http.redirect.max, but nutch still
> postpones redirected urls to later depths.
> What is the correct config  setting to have nutch crawl redirected urls
> immediately. I need it because I have restriction on depth be at most 2.
>
> Thanks.
> Alex.
>
>
>
>
>
> -----Original Message-----
> From: xuyuanme <[email protected]>
> To: user <[email protected]>
> Sent: Fri, Feb 24, 2012 1:31 am
> Subject: Re: http.redirect.max
>
>
> The config file is used for some proof of concept testing so the content
> might be confusing, please ignore some incorrect part.
>
> Yes from my end I can see the crawl for website http://www.scotland.gov.uk
> is redirected as expected.
>
> However the website I tried to crawl is a bit more tricky.
>
> Here's what I want to do:
>
> 1. Set
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
> as the seed page
>
> 2. And try to crawl one of the link
> (
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT
> )
> as a test
>
> If you click the link, you'll find the website use redirect and cookie to
> control page navigation. So I used protocol-httpclient plugin instead of
> protocol-http to handle the cookie.
>
> However, the redirect does not happen as expected. The only way I can fetch
> second link is to manually change "response = getResponse(u, datum,
> *false*)" call to "response = getResponse(u, datum, *true*)" in
> org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the
> lib-http plugin.
>
> So my issue is related to this specific site
>
> http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B
>
>
> lewis john mcgibbney wrote
> >
> > I've checked working with redirects and everything seems to work fine for
> > me.
> >
> > The site I checked on
> >
> > http://www.scotland.gov.uk
> >
> > temp redirect to
> >
> > http://home.scotland.gov.uk/home
> >
> > Nutch gets this fine when I do some tweaking with nutch-site.xml
> >
> > redirects property -1 (just to demonstrate, I would usually not set it
> so)
> >
> > Lewis
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>
>


-- 
*Lewis*

Reply via email to