Problem with crawling a site using cookies

Hemanth Yamijala Tue, 25 May 2010 06:05:33 -0700

Hi,

We are trying to crawl a site that requires cookies to be set. When
this is tried from the browser, the original URL is redirected to a
page that contains a JavaScript. The javascript just resends the URL
with an additional parameter indicating that the referrer is the
script. It seems like a cookie is set in the request header for this
page and thus the server sends back the actual HTML content.


When we crawl this page through Nutch, it was stopping at the
Javascript page. When we tried to give both the original URL and the
referred URL as seeds in the same fetcher instance, it still did not
get the contents. We were hoping that the second seed URL would
actually use the cookies. We tried this by enabling the
protocol-httpclient plugin as well. Debugging
org.apache.nutch.protocol.httpclient.Http in a standalone mode, we see
that the second GET request is actually setting the cookie, but this
didn't help.

What is the recommended configuration to handle such crawl
requirements ? Can anyone suggest on how to debug this further ?

Thanks
hemanth

Problem with crawling a site using cookies

Reply via email to