Hi all

I'm wondering about how nutch handle cookies defined while fetching a page.

1) are those cookies used when nutch is crawling urls generated from that page ?

2) is there a way to configure Nutch so the values of some of those cookies are 
considered as part of the identity of the page (as well as the URL) (ready to 
do some dev if necessary)

For the last point, I'm trying to fetch en e-commerce web site working for 
different shops selling the same products. You can enter a shop via a specific 
url (shop-home) that will set a cookie for this shop. And then, the urls for 
the product are exactly the same whatever the shop, but the information on the 
page (price, availability and so on) is different depending on the cookie 
defining the shop.

Thus, with the usual nutch config, beginning the fetch using all "shop-home" 
urls as seeds, nutch will fetch only one page per product (url being the 
identity) and not one page per product / shop.

Is my analysis correct ?
Is there a way arounf that ?

Regards

RemyA

Reply via email to