How to get all the crawled pages for perticular domain

2009-12-09 Thread bhavin pandya
Hi, I have setup nutch 1.0 on cluster of 3 nodes. We are running two application. 1. Nutch based search application. We have successfully crawled approx. 25m pages on 3 nodes. It's working as per expectation. 2. I am running application which needs to extract some information for perticular

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh I don't know what to do! I've tried all kind of stuff but with no luck... :(

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Andrzej Bialecki
Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException:

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Dennis Kubes
1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Dennis Kubes
Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi Dennis, Thanks for trying to help. I don't know what fresh install means exactly. Here is what I've done: 1) Downloaded latest version of Nutch from the SVN to a new folder. 2) Copied all the custom plugins I've written to the new folder 3) Edited all configuration files. 4) Executed ant

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman
Hi all, thanks Dennis - you helped me solve the problem. The problem was that I had two versions of jetty in my lib folder. I deleted the old version and viola - it works. The problem is that both versions exist in the SVN! Altough I took a fresh copy of the SVN I had both versions in my lib

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Dennis Kubes
Done. I have removed the old Jetty jars from the SVN. Thanks for bringing this issue forward. Dennis Eran Zinman wrote: Hi all, thanks Dennis - you helped me solve the problem. The problem was that I had two versions of jetty in my lib folder. I deleted the old version and viola - it

Nutch 1.0 and Office 2007 documents

2009-12-09 Thread Joe Bell
Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 Is a

how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang
What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I tried that and it worked a few times, but now I get 0 records selected for fetching. $ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50 crawl started in: crawl9a rootUrlDir = urls threads = 10 depth = 15 topN = 50 Injector: starting Injector: crawlDb: crawl9a/crawldb Injector: urlDir: urls

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii
Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang
What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property  namedb.default.fetch.interval/name  value0/value  

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc.

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii
I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do:

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya
Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to