Hi, (USING: nutch 1.3 on Ubuntu 11.04 - 2.6.38-10-generic-pae) I would like to crawl a news site and having troubles with cookie support (which haven't undestood well enough yet).
Here is the status of the project... When I crawl the URLs I get this message in the log of nutch (also in console): > Skipping http://ntvmsnbc.com/id/25237248 as content is not fetched successfully > Skipping http://ntvmsnbc.com/id/25237249 as content is not fetched successfully > Skipping http://ntvmsnbc.com/id/25237253 as content is not fetched successfully I have looked up the segments dump and saw that the site would like to redirect the client to an address which sets a cookie and falls back to the same address and shows the content. To clearify: 1. clients tries the address: http://ntvmsnbc.com/id/25237248 2. address sends a Location header for redirection to: http://www.ntvmsnbc.com/redirect.aspx?to=http%3a%2f%2fwww.ntvmsnbc.com%2fid%2f25237248%2f&from=http%3a%2f%2fntvmsnbc.com%2fid%2f25237248%2f&mskey=323dcf07efd1a45a3851a0199679e65e 3. this address sets a cookie and redirects the user back to original address requested: http://ntvmsnbc.com/id/25237248 4. the content is shown this time with the standart configuration, I cannot get any content at all. I thought I should try to accept cookies. However, *I don't know how to accept cookies yet*. My settings work for other sites which not requires cookies. I tried to change in nutch-default.xml as: http.redirect.max = 5 http.useHttp11 = true plugin.includes = protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic) Do you think something goes wrong with my config? Appreciate any ideas, Dincer