Hi Kartik, I'm not 100% sure to understand your question but have a look at HttpResponse class as I mentioned, you totally hack it and crawl the same (or other URLs) in there.
Remi On Wed, Dec 3, 2014 at 4:11 PM, Krishnanand, Kartik < kartik.krishnan...@bankofamerica.com> wrote: > Hi, Remi > > How do you force the crawler to crawl the same URL? If I were to check for > certain cookie values, and they match, I would like to be able to crawl the > same URL again. > > Kartik > > -----Original Message----- > From: remi tassing [mailto:tassingr...@gmail.com] > Sent: Tuesday, December 02, 2014 5:24 PM > To: user@nutch.apache.org > Subject: Re: Unable to crawl a URL unless session cookies are set > > Hi Kartik, > > I had a similar enquiry a long time ago and from what I remember, Nutch > will save the new URL and crawl it in the future...which is not the needed > behavior here. > > To solve this problem, I've customized my protocol-httpclient (HttpResponse > class) to just open the 2nd URL right after the first one. > > Crawling internal websites generally needs a lot of customization > (authentication with post request, javascript redirection, NTLM > authentication ...). And my general choice was to create "handlers" that > are called in HttpResponse depending on the site to be crawled. Maybe > plugins could be used but I thought it was a little bit overkill for the > job. > > I hope that helps! > > Remi > > On Tue, Dec 2, 2014 at 4:51 PM, Krishnanand, Kartik < > kartik.krishnan...@bankofamerica.com> wrote: > > > Hi, > > > > I am crawling an internal site where the URL that I want to crawl. I > > hope that someone can help > > > > When I load this URL in the browser, it does a 301 redirect to another > > URL that sets up cookies that will expire until end of session. When > > I load the URL again in the browser, I am now able to load the URL. > > > > I don't know how to simulate this in my crawler setting. I am aware of > > "http.redirect.max" configuration in our nutch configuration XMLs. > > But if I understand this correctly, the crawler will follow the > > redirect and not come back to original URL. Is my understanding correct? > > > > How would I be able to crawl this URL? > > > > Thanks, > > > > Kartik > > > > ---------------------------------------------------------------------- > > This message, and any attachments, is for the intended recipient(s) > > only, may contain information that is privileged, confidential and/or > > proprietary and subject to important terms and conditions available at > > http://www.bankofamerica.com/emaildisclaimer. If you are not the > > intended recipient, please delete this message. > > > > ---------------------------------------------------------------------- > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the > intended recipient, please delete this message. >