Re: Unable to crawl a URL unless session cookies are set

remi tassing Tue, 02 Dec 2014 17:24:50 -0800

Hi Kartik,

I had a similar enquiry a long time ago and from what I remember, Nutch
will save the new URL and crawl it in the future...which is not the needed
behavior here.

To solve this problem, I've customized my protocol-httpclient (HttpResponse
class) to just open the 2nd URL right after the first one.

Crawling internal websites generally needs a lot of customization
(authentication with post request, javascript redirection, NTLM
authentication ...). And my general choice was to create "handlers" that
are called in HttpResponse depending on the site to be crawled. Maybe
plugins could be used but I thought it was a little bit overkill for the
job.

I hope that helps!

Remi

On Tue, Dec 2, 2014 at 4:51 PM, Krishnanand, Kartik <
[email protected]> wrote:

> Hi,
>
> I am crawling an internal site where the URL that I want to crawl. I hope
> that someone can help
>
> When I load this URL in the browser, it does a 301 redirect to another URL
> that sets up cookies that will expire until end of session. When  I load
> the URL again in the browser, I am now able to load the URL.
>
> I don't know how to simulate this in my crawler setting. I am aware of
> "http.redirect.max" configuration in our nutch configuration XMLs.  But if
> I understand this correctly, the crawler will follow the redirect and not
> come back to original URL. Is my understanding correct?
>
> How would I be able to crawl this URL?
>
> Thanks,
>
> Kartik
>
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>

Re: Unable to crawl a URL unless session cookies are set

Reply via email to