Hi,

> We are trying to crawl a site that requires cookies to be set. When
> this is tried from the browser, the original URL is redirected to a
> page that contains a JavaScript. The javascript just resends the URL
> with an additional parameter indicating that the referrer is the
> script. It seems like a cookie is set in the request header for this
> page and thus the server sends back the actual HTML content.
>
> When we crawl this page through Nutch, it was stopping at the
> Javascript page. When we tried to give both the original URL and the
> referred URL as seeds in the same fetcher instance, it still did not
> get the contents. We were hoping that the second seed URL would
> actually use the cookies. We tried this by enabling the
> protocol-httpclient plugin as well. Debugging
> org.apache.nutch.protocol.httpclient.Http in a standalone mode, we see
> that the second GET request is actually setting the cookie, but this
> didn't help.
>
> What is the recommended configuration to handle such crawl
> requirements ? Can anyone suggest on how to debug this further ?
>

We debugged this problem using a standalone program and the apache
httpclient source code. Turned out that the site was sending a
non-standard 'expires' format for the cookie. Basically, it did not
have the Day-of-the-week spec with which cookie date formats are
supposed to begin. We were able to fix it in the standalone program by
setting a custom Date format - something like:

GetMethod method = new GetMethod(url);
HttpMethodParams params = method.getParams();
Collection patterns =
(Collection)params.getParameter(HttpMethodParams.DATE_PATTERNS);
patterns.add("dd MMM yyyy HH:mm:ss zzz");

While this fixed the standalone application, it seems like doing
something similar in Nutch is going to be more difficult. We'd
appreciate any input of whether there is a way to configure the
httpclient classes when used in the context of Nutch ?

Thanks
Hemanth

Reply via email to