Re: Unable to crawl a URL unless session cookies are set

remi tassing Thu, 04 Dec 2014 04:33:49 -0800

Hi Kartik,

I'm not 100% sure to understand your question but have a look at
HttpResponse class as I mentioned, you totally hack it and crawl the same
(or other URLs) in there.


Remi

On Wed, Dec 3, 2014 at 4:11 PM, Krishnanand, Kartik <
kartik.krishnan...@bankofamerica.com> wrote:

> Hi, Remi
>
> How do you force the crawler to crawl the same URL? If I were to check for
> certain cookie values, and they match, I would like to be able to crawl the
> same URL again.
>
> Kartik
>
> -----Original Message-----
> From: remi tassing [mailto:tassingr...@gmail.com]
> Sent: Tuesday, December 02, 2014 5:24 PM
> To: user@nutch.apache.org
> Subject: Re: Unable to crawl a URL unless session cookies are set
>
> Hi Kartik,
>
> I had a similar enquiry a long time ago and from what I remember, Nutch
> will save the new URL and crawl it in the future...which is not the needed
> behavior here.
>
> To solve this problem, I've customized my protocol-httpclient (HttpResponse
> class) to just open the 2nd URL right after the first one.
>
> Crawling internal websites generally needs a lot of customization
> (authentication with post request, javascript redirection, NTLM
> authentication ...). And my general choice was to create "handlers" that
> are called in HttpResponse depending on the site to be crawled. Maybe
> plugins could be used but I thought it was a little bit overkill for the
> job.
>
> I hope that helps!
>
> Remi
>
> On Tue, Dec 2, 2014 at 4:51 PM, Krishnanand, Kartik <
> kartik.krishnan...@bankofamerica.com> wrote:
>
> > Hi,
> >
> > I am crawling an internal site where the URL that I want to crawl. I
> > hope that someone can help
> >
> > When I load this URL in the browser, it does a 301 redirect to another
> > URL that sets up cookies that will expire until end of session. When
> > I load the URL again in the browser, I am now able to load the URL.
> >
> > I don't know how to simulate this in my crawler setting. I am aware of
> > "http.redirect.max" configuration in our nutch configuration XMLs.
> > But if I understand this correctly, the crawler will follow the
> > redirect and not come back to original URL. Is my understanding correct?
> >
> > How would I be able to crawl this URL?
> >
> > Thanks,
> >
> > Kartik
> >
> > ----------------------------------------------------------------------
> > This message, and any attachments, is for the intended recipient(s)
> > only, may contain information that is privileged, confidential and/or
> > proprietary and subject to important terms and conditions available at
> > http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> > intended recipient, please delete this message.
> >
>
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>

Re: Unable to crawl a URL unless session cookies are set

Reply via email to