Hi Rémy,

> I'm wondering about how nutch handle cookies defined while fetching a page.
>
> 1) are those cookies used when nutch is crawling urls generated from that 
page ?
Generally, cookies are ignored. But have a look at
  https://issues.apache.org/jira/browse/NUTCH-827
Your problem is almost the same as the POST authentication via a login page.
The starting shop page is just a login page without user name and password.

> 2) is there a way to configure Nutch so the values of some of those cookies are considered as part of the identity of the page (as well as the URL) (ready to do some dev if necessary)

URLs are unique keys to bring all associate all kinds of information about a 
page
(content, crawl status, inlinks). Different content for the same URL is 
impossible
for one crawl. You have to split the crawl (one for each shop). Maybe, the shop
system also accepts the shop id as a query parameter. Then you can fake the 
URLs.

Maybe there are better solutions or work-arounds. But I don't know a good one.

Sebastian

On 04/05/2012 11:28 AM, Rémy Amouroux wrote:
Hi all


2) is there a way to configure Nutch so the values of some of those cookies are 
considered as part of the identity of the page (as well as the URL) (ready to 
do some dev if necessary)

For the last point, I'm trying to fetch en e-commerce web site working for 
different shops selling the same products. You can enter a shop via a specific 
url (shop-home) that will set a cookie for this shop. And then, the urls for 
the product are exactly the same whatever the shop, but the information on the 
page (price, availability and so on) is different depending on the cookie 
defining the shop.

Thus, with the usual nutch config, beginning the fetch using all "shop-home" 
urls as seeds, nutch will fetch only one page per product (url being the identity) and 
not one page per product / shop.

Is my analysis correct ?
Is there a way arounf that ?

Regards

RemyA

Reply via email to