Hi Rémy, > I'm wondering about how nutch handle cookies defined while fetching a page. > > 1) are those cookies used when nutch is crawling urls generated from that page ? Generally, cookies are ignored. But have a look at https://issues.apache.org/jira/browse/NUTCH-827 Your problem is almost the same as the POST authentication via a login page. The starting shop page is just a login page without user name and password.
> 2) is there a way to configure Nutch so the values of some of those cookies are considered as part of the identity of the page (as well as the URL) (ready to do some dev if necessary)
URLs are unique keys to bring all associate all kinds of information about a page (content, crawl status, inlinks). Different content for the same URL is impossible for one crawl. You have to split the crawl (one for each shop). Maybe, the shop system also accepts the shop id as a query parameter. Then you can fake the URLs. Maybe there are better solutions or work-arounds. But I don't know a good one. Sebastian On 04/05/2012 11:28 AM, Rémy Amouroux wrote:
Hi all 2) is there a way to configure Nutch so the values of some of those cookies are considered as part of the identity of the page (as well as the URL) (ready to do some dev if necessary) For the last point, I'm trying to fetch en e-commerce web site working for different shops selling the same products. You can enter a shop via a specific url (shop-home) that will set a cookie for this shop. And then, the urls for the product are exactly the same whatever the shop, but the information on the page (price, availability and so on) is different depending on the cookie defining the shop. Thus, with the usual nutch config, beginning the fetch using all "shop-home" urls as seeds, nutch will fetch only one page per product (url being the identity) and not one page per product / shop. Is my analysis correct ? Is there a way arounf that ? Regards RemyA

