when a page is retrieved nutch calculates digest. if a pages changes then digest will be different and it might be a mark that this page is new, but be carefull since modern site have a lot of ad which might influence digest.
you task requires some brainstorm and thinking Best Regards Alexander Aristov On 5 November 2010 15:13, Chris <[email protected]> wrote: > Yes .. that looks good - there is a white list for enterprise searches. > Sounds exactly as one part I need. > > How about the other? > Is there a way of doing a diff between two versions? > Do you know that? > > Am 05.11.2010 13:49, schrieb Eric Martin: > > I know urlfilter will allow you to specify domain crawl only. (no crawl >> outside links) >> >> -----Original Message----- >> From: Chris [mailto:[email protected]] >> Sent: Thursday, November 04, 2010 10:43 PM >> To: [email protected] >> Subject: Updates of websites >> >> Hello, >> >> I read a bit of the documentation but I never installed Nutch or so. >> First of all, I am wondering whether what I want is possible with Nutch. >> >> I have a bunch of websites .. like 200 or so and I'd like to monitor >> them - see whether someone adds new content etc. >> >> With bin/nutch inject crawl/crawldb seed it is possible to add my list >> of URLs as I read. >> >> Two things: Can I tell Nutch, not to follow outgoing links? >> Is it possible to see a website / statistics / whatever like: >> Today, 23rd October 2010 >> Website: www.url1.com added new content: www.url1.com/new_content >> >> And I'd like to have this daily. >> >> Is there a way of doing it with Nutch? >> >> Thanks already >> Best regards >> Chris >> >> >> >>

