when a page is retrieved nutch calculates digest. if a pages changes then
digest will be different and it might be a mark that this page is new, but
be carefull since modern site have a lot of ad which might influence digest.

you task requires some brainstorm and thinking

Best Regards
Alexander Aristov


On 5 November 2010 15:13, Chris <[email protected]> wrote:

> Yes .. that looks good - there is a white list for enterprise searches.
> Sounds exactly as one part I need.
>
> How about the other?
> Is there a way of doing a diff between two versions?
> Do you know that?
>
> Am 05.11.2010 13:49, schrieb Eric Martin:
>
>  I know urlfilter will allow you to specify domain crawl only. (no crawl
>> outside links)
>>
>> -----Original Message-----
>> From: Chris [mailto:[email protected]]
>> Sent: Thursday, November 04, 2010 10:43 PM
>> To: [email protected]
>> Subject: Updates of websites
>>
>> Hello,
>>
>> I read a bit of the documentation but I never installed Nutch or so.
>> First of all, I am wondering whether what I want is possible with Nutch.
>>
>> I have a bunch of websites  .. like 200 or so and I'd like to monitor
>> them - see whether someone adds new content etc.
>>
>> With  bin/nutch inject crawl/crawldb seed  it is possible to add my list
>> of URLs as I read.
>>
>> Two things: Can I tell Nutch, not to follow outgoing links?
>> Is it possible to see a website / statistics / whatever like:
>> Today, 23rd October 2010
>> Website: www.url1.com added new content: www.url1.com/new_content
>>
>> And I'd like to have this daily.
>>
>> Is there a way of doing it with Nutch?
>>
>> Thanks already
>> Best regards
>> Chris
>>
>>
>>
>>

Reply via email to