Sounds too much work at the moment (learning curce, time, money).
Is there an alternative to Nutch that could do what I want?
Best Regards
Chris
Am 06.11.2010 17:24, schrieb Alexander Aristov:
when a page is retrieved nutch calculates digest. if a pages changes then
digest will be different and it might be a mark that this page is new, but
be carefull since modern site have a lot of ad which might influence digest.
you task requires some brainstorm and thinking
Best Regards
Alexander Aristov
On 5 November 2010 15:13, Chris<[email protected]> wrote:
Yes .. that looks good - there is a white list for enterprise searches.
Sounds exactly as one part I need.
How about the other?
Is there a way of doing a diff between two versions?
Do you know that?
Am 05.11.2010 13:49, schrieb Eric Martin:
I know urlfilter will allow you to specify domain crawl only. (no crawl
outside links)
-----Original Message-----
From: Chris [mailto:[email protected]]
Sent: Thursday, November 04, 2010 10:43 PM
To: [email protected]
Subject: Updates of websites
Hello,
I read a bit of the documentation but I never installed Nutch or so.
First of all, I am wondering whether what I want is possible with Nutch.
I have a bunch of websites .. like 200 or so and I'd like to monitor
them - see whether someone adds new content etc.
With bin/nutch inject crawl/crawldb seed it is possible to add my list
of URLs as I read.
Two things: Can I tell Nutch, not to follow outgoing links?
Is it possible to see a website / statistics / whatever like:
Today, 23rd October 2010
Website: www.url1.com added new content: www.url1.com/new_content
And I'd like to have this daily.
Is there a way of doing it with Nutch?
Thanks already
Best regards
Chris