Hi, We are trying to crawl a site that requires cookies to be set. When this is tried from the browser, the original URL is redirected to a page that contains a JavaScript. The javascript just resends the URL with an additional parameter indicating that the referrer is the script. It seems like a cookie is set in the request header for this page and thus the server sends back the actual HTML content.
When we crawl this page through Nutch, it was stopping at the Javascript page. When we tried to give both the original URL and the referred URL as seeds in the same fetcher instance, it still did not get the contents. We were hoping that the second seed URL would actually use the cookies. We tried this by enabling the protocol-httpclient plugin as well. Debugging org.apache.nutch.protocol.httpclient.Http in a standalone mode, we see that the second GET request is actually setting the cookie, but this didn't help. What is the recommended configuration to handle such crawl requirements ? Can anyone suggest on how to debug this further ? Thanks hemanth

