Hi,    (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae)

I would like to crawl a news site and having troubles with cookie support
(which haven't undestood well enough yet).

Here is the status of the project... When I crawl the URLs I get this
message in the log of nutch (also in console):
> Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched
successfully
> Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched
successfully
> Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched
successfully


I have looked up the segments dump and saw that the site would like to
redirect the client to an address which sets a cookie and falls back to the
same address and shows the content. To clearify:
1. clients tries the address: http://ntvmsnbc.com/id/25237248
2. address sends a Location header for redirection to:
http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e
3. this address sets a cookie and redirects the user back to original
address requested: http://ntvmsnbc.com/id/25237248
4. the content is shown this time

with the standart configuration, I cannot get any content at all. I thought
I should try to accept cookies. However, *I don't know how to accept cookies
yet*. My settings work for other sites which not requires cookies.

I tried to change in nutch-default.xml as:
http.redirect.max = 5
http.useHttp11 = true
plugin.includes
= 
protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)

Do you think something goes wrong with my config?

Appreciate any ideas,
Dincer

Reply via email to