Adam,
I finally go the command to work on another server (see below). to
change the retry interval, should I just add the two properties into
nutch-site.xml (though I tried this before and it didn't work):
http://mysite/ Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Jan 08 15:42:33 EST 2010
yes just add those config in the nutch-site.xml and it should work. but are
you going to recrawl every hour ??? i see 3600 secondes !!
another thing is you have to make an initial clean crawl with the new
fetchtime , because in the crawldb it will not change the fetch time
automaticly .
Thanks.
I'm on a development system, so every hour is okay.
I guess that's why the last time I changed the properties file it didn't
take any effect (because crawldb won't change the fetch time
automatically).
I'll give this a try - thanks much.
Vijaya Peters
SRA International, Inc.
4350 Fair
but just think about one thing...if you are recrawling to much urls and the
crawl time will be more than 1 hours, so your crawl will not finish...becoz
every time it find and url so it will find that the fetchtime is ready and it
fetch it again
to well sett your fetchtime you have to crawl
Okay. Our fetch finishes in less than 10 minutes (just intranet). But,
I'll set it to 2 hours.
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
Adam,
I'm using cygwin to run the scripts. I use EditPlus to edit the files. But
EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file
to a unix machine.
Vijaya Peters
SRA International, Inc.
12500 Fair Lakes Circle
Room 3507
Fairfax, VA 22033
Tel: 703-222-9207
hi,
you shouldnt open the crc file you have to open the other one, which is
part-0.
use vi top edit part-.
if you will not find this file so your dump failed...just check the
logs/hadoop.log file
Subject: RE: how to force nutch to do a recrawl
Date: Fri, 11 Dec 2009 09:14:26
hi,
check the fetch time in your crawldb...you can dump all the crawldb like this:
./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db
entries will look like this:
http://www.YOUR_URL_TO_FETCH
Status: 2 (db_fetched)
Fetch time: Thu Dec 10 09:19:18 EST 2009
Modified time: Wed Dec 31
Adam,
I tried running that command and get the following (it created a
whole_db directory, but it's not dumping out the contents to the
console):
$ bin/nutch readdb crawl/crawldb/ -dump whole_db
CrawlDb dump: starting
CrawlDb db: crawl/crawldb/
CrawlDb dump: done
Vijaya Peters
SRA International,
it will not dump to the console !
whole_db is a folder and you have to edit the file you will find in this folder
Subject: RE: how to force nutch to do a recrawl
Date: Thu, 10 Dec 2009 14:26:30 -0500
From: vijaya_pet...@sra.com
To: nutch-user@lucene.apache.org
Adam,
I tried running
Adam,
What do I use to open a CRC file? I tried QuickSFV. Thanks in advance!
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please
jus use vi or vim
i use vi to edit the file
Subject: RE: how to force nutch to do a recrawl
Date: Thu, 10 Dec 2009 15:58:24 -0500
From: vijaya_pet...@sra.com
To: nutch-user@lucene.apache.org
Adam,
What do I use to open a CRC file? I tried QuickSFV. Thanks in advance!
Vijaya
Adam,
I'm on windows unfortunately!! I'm using cygdrive, but it doesn't
recognize vi. Any idea for opening it in windows? Notepad didn't work
either.
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to
bu8t how you are running sh scripts...
you have to use cygwin to be able to edit linux files
Subject: RE: how to force nutch to do a recrawl
Date: Thu, 10 Dec 2009 16:09:13 -0500
From: vijaya_pet...@sra.com
To: nutch-user@lucene.apache.org
Adam,
I'm on windows unfortunately!! I'm
What do you mean by recrawl?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.
On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
I'm running Nutch
I tried that and it worked a few times, but now I get 0 records selected for
fetching.
$ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50
crawl started in: crawl9a
rootUrlDir = urls
threads = 10
depth = 15
topN = 50
Injector: starting
Injector: crawlDb: crawl9a/crawldb
Injector: urlDir: urls
Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details
2009/12/9, xiao yang yangxiao9...@gmail.com:
What do you mean by recrawl?
Does the following command meets what you need?
bin/nutch crawl urls -dir
I tried that too.
in Nutch-site.xml, I added in the below, but this had no effect.
property
namedb.default.fetch.interval/name
value0/value
description(DEPRECATED) The default number of days between re-fetches of a
page. value was 30
/description
/property
property
What about the configuration in crawl-urlfilter.txt?
On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
I tried that too.
in Nutch-site.xml, I added in the below, but this had no effect.
property
namedb.default.fetch.interval/name
value0/value
I didn't see a setting to override in crawl-urlfilter. How do I set
numberDays? I have regular expressions to include/exclude certain extensions
and certain urls, but that's all I have in there.
Please send me an example and I'll give it a try.
Thanks!
Vijaya Peters
SRA International, Inc.
I don't that you can use nutch crawl command to do that, this is a one stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option -adddays,
read that page on the wiki to get a feel how you should do:
Okay. I'll dig a little deeper. I saw a few scripts that people had
created, but I couldn't get them to work.
Thanks much.
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's 100 Best Companies to
22 matches
Mail list logo