RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Adam, I finally go the command to work on another server (see below). to change the retry interval, should I just add the two properties into nutch-site.xml (though I tried this before and it didn't work): http://mysite/ Version: 7 Status: 2 (db_fetched) Fetch time: Fri Jan 08 15:42:33 EST 2010

RE: how to force nutch to do a recrawl

2009-12-14 Thread BELLINI ADAM
yes just add those config in the nutch-site.xml and it should work. but are you going to recrawl every hour ??? i see 3600 secondes !! another thing is you have to make an initial clean crawl with the new fetchtime , because in the crawldb it will not change the fetch time automaticly .

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Thanks. I'm on a development system, so every hour is okay. I guess that's why the last time I changed the properties file it didn't take any effect (because crawldb won't change the fetch time automatically). I'll give this a try - thanks much. Vijaya Peters SRA International, Inc. 4350 Fair

RE: how to force nutch to do a recrawl

2009-12-14 Thread BELLINI ADAM
but just think about one thing...if you are recrawling to much urls and the crawl time will be more than 1 hours, so your crawl will not finish...becoz every time it find and url so it will find that the fetchtime is ready and it fetch it again to well sett your fetchtime you have to crawl

RE: how to force nutch to do a recrawl

2009-12-14 Thread Peters, Vijaya
Okay. Our fetch finishes in less than 10 minutes (just intranet). But, I'll set it to 2 hours. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10

RE: how to force nutch to do a recrawl

2009-12-11 Thread Peters, Vijaya
Adam, I'm using cygwin to run the scripts. I use EditPlus to edit the files. But EditPlus won't allow me to edit the crc file. I'll see if I can ftp the file to a unix machine. Vijaya Peters SRA International, Inc. 12500 Fair Lakes Circle Room 3507 Fairfax, VA 22033 Tel: 703-222-9207

RE: how to force nutch to do a recrawl

2009-12-11 Thread BELLINI ADAM
hi, you shouldnt open the crc file you have to open the other one, which is part-0. use vi top edit part-. if you will not find this file so your dump failed...just check the logs/hadoop.log file Subject: RE: how to force nutch to do a recrawl Date: Fri, 11 Dec 2009 09:14:26

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
hi, check the fetch time in your crawldb...you can dump all the crawldb like this: ./bin/nutch readdb $your_crawl_rep/crawldb/ -dump whole_db entries will look like this: http://www.YOUR_URL_TO_FETCH Status: 2 (db_fetched) Fetch time: Thu Dec 10 09:19:18 EST 2009 Modified time: Wed Dec 31

RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam, I tried running that command and get the following (it created a whole_db directory, but it's not dumping out the contents to the console): $ bin/nutch readdb crawl/crawldb/ -dump whole_db CrawlDb dump: starting CrawlDb db: crawl/crawldb/ CrawlDb dump: done Vijaya Peters SRA International,

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
it will not dump to the console ! whole_db is a folder and you have to edit the file you will find in this folder Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 14:26:30 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I tried running

RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
jus use vi or vim i use vi to edit the file Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 15:58:24 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, What do I use to open a CRC file? I tried QuickSFV. Thanks in advance! Vijaya

RE: how to force nutch to do a recrawl

2009-12-10 Thread Peters, Vijaya
Adam, I'm on windows unfortunately!! I'm using cygdrive, but it doesn't recognize vi. Any idea for opening it in windows? Notepad didn't work either. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to

RE: how to force nutch to do a recrawl

2009-12-10 Thread BELLINI ADAM
bu8t how you are running sh scripts... you have to use cygwin to be able to edit linux files Subject: RE: how to force nutch to do a recrawl Date: Thu, 10 Dec 2009 16:09:13 -0500 From: vijaya_pet...@sra.com To: nutch-user@lucene.apache.org Adam, I'm on windows unfortunately!! I'm

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang
What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I tried that and it worked a few times, but now I get 0 records selected for fetching. $ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50 crawl started in: crawl9a rootUrlDir = urls threads = 10 depth = 15 topN = 50 Injector: starting Injector: crawlDb: crawl9a/crawldb Injector: urlDir: urls

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii
Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang
What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property  namedb.default.fetch.interval/name  value0/value  

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc.

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii
I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do:

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to