Hi,
I have setup nutch 1.0 on cluster of 3 nodes.
We are running two application.
1. Nutch based search application.
We have successfully crawled approx. 25m pages on 3 nodes.
It's working as per expectation.
2. I am running application which needs to extract some information
for perticular
Hi,
Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.
I am getting the following exceptions in my logs when I execute
bin/start-all.sh
I don't know what to do! I've tried all kind of stuff but with no luck... :(
Eran Zinman wrote:
Hi,
Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.
I am getting the following exceptions in my logs when I execute
bin/start-all.sh
Do you use the scripts in place, i.e. without deploying the
Hi Andrzej,
Thanks for your help (as always).
Still getting same exception when running on standalone Hadoop cluster.
Getting same exceptions as before - also in the datanode log I'm getting:
2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call
to 10.0.0.2:9000 failed on
Hi,
Running new Nutch version status:
1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode).
2. Nutch doesn't work when I setup it to work with Hadoop either in a single
or cluster setup.
*I'm getting an exception: *
ERROR namenode.NameNode - java.lang.SecurityException:
1) Is this a new or existing Hadoop cluster?
2) What Java version are you using and what is your environment?
Dennis
Eran Zinman wrote:
Hi,
Running new Nutch version status:
1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode).
2. Nutch doesn't work when I setup it to
Hi Dennis,
1) I've initially tried to run on my existing DFS and it didn't work. I then
made a backup of my DFS and performed a format and it still didn't work...
2) I'm using:
java version 1.6.0_0
OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
OpenJDK Client VM (build
Did you do a fresh install of Nutch with Hadoop 0.20 or did you just
copy over the new jars? The sealing violation is multiple of the same
jars being loaded and the Jetty versions changed between 0.19 and 0.20
for Hadoop?
Dennis
Eran Zinman wrote:
Hi Dennis,
1) I've initially tried to run
Hi Dennis,
Thanks for trying to help.
I don't know what fresh install means exactly.
Here is what I've done:
1) Downloaded latest version of Nutch from the SVN to a new folder.
2) Copied all the custom plugins I've written to the new folder
3) Edited all configuration files.
4) Executed ant
Hi all,
thanks Dennis - you helped me solve the problem.
The problem was that I had two versions of jetty in my lib folder.
I deleted the old version and viola - it works.
The problem is that both versions exist in the SVN! Altough I took a fresh
copy of the SVN I had both versions in my lib
Done. I have removed the old Jetty jars from the SVN. Thanks for
bringing this issue forward.
Dennis
Eran Zinman wrote:
Hi all,
thanks Dennis - you helped me solve the problem.
The problem was that I had two versions of jetty in my lib folder.
I deleted the old version and viola - it
Hi,
I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949
Is a
I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete
recrawl?
thanks,
- Vijaya
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com http://www.sra.com/
Named to FORTUNE's 100 Best Companies to
What do you mean by recrawl?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.
On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
I'm running Nutch
I tried that and it worked a few times, but now I get 0 records selected for
fetching.
$ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50
crawl started in: crawl9a
rootUrlDir = urls
threads = 10
depth = 15
topN = 50
Injector: starting
Injector: crawlDb: crawl9a/crawldb
Injector: urlDir: urls
Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details
2009/12/9, xiao yang yangxiao9...@gmail.com:
What do you mean by recrawl?
Does the following command meets what you need?
bin/nutch crawl urls -dir
I tried that too.
in Nutch-site.xml, I added in the below, but this had no effect.
property
namedb.default.fetch.interval/name
value0/value
description(DEPRECATED) The default number of days between re-fetches of a
page. value was 30
/description
/property
property
What about the configuration in crawl-urlfilter.txt?
On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
I tried that too.
in Nutch-site.xml, I added in the below, but this had no effect.
property
namedb.default.fetch.interval/name
value0/value
I didn't see a setting to override in crawl-urlfilter. How do I set
numberDays? I have regular expressions to include/exclude certain extensions
and certain urls, but that's all I have in there.
Please send me an example and I'll give it a try.
Thanks!
Vijaya Peters
SRA International, Inc.
I don't that you can use nutch crawl command to do that, this is a one stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option -adddays,
read that page on the wiki to get a feel how you should do:
Okay. I'll dig a little deeper. I saw a few scripts that people had
created, but I couldn't get them to work.
Thanks much.
Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA 22033
Tel: 703-502-1184
www.sra.com
Named to FORTUNE's 100 Best Companies to
21 matches
Mail list logo