Hi
Is there a separate mailing list for hadoop right now since it has been
separated from nutch?
Even then i feel that if there are any critical bugs fixed in hadoop, it
would help even if the nutch group is aware of this.
Interested people can be abreast of developments in hadoop also.
Rgds
Prab
Richard Braman wrote:
I copied in hadoop-default.xml and mapred-default.xml
from hadoop trunk
into conf
and still same error.
What input directory is it looking for?
You didn't say at all what are you trying to do, and what is your
environment. What was the cmd-line? where is the data l
I copied in hadoop-default.xml and mapred-default.xml
from hadoop trunk
into conf
and still same error.
What input directory is it looking for?
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 1:11 AM
To: nutch-user@lucene.apache.org
Sub
But that didn't make it work. Borrowing from nutch haddop tutorial this
is my hadoop-site.xml
fs.default.name
local
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
mapred.job.trac
Hadoop-site.xml from trunk is missing
fs.default.name
local
The name of the default file system. Either the literal string
"local" or a host:port for NDFS.
This should be added or posted ot the Nutch 0.8 tutorial.
-Original Message-
Fr
I am not trying to use hadoop dfs, this is just a single nutcher and a
single searcher on a single server configuration.
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 11:41 PM
To: nutch-user@lucene.apache.org
Subject: .08 java.io.IOExcep
GO the following error running inject on Nutch 0.8 trunk. What am i
doing wrong?
060322 235123 parsing
jar:file:/T:/nutch-trunk/lib/hadoop-0.1-dev.jar!/mapred-de
fault.xml
060322 235123 parsing
\tmp\hadoop\mapred\local\job_vdg9ku.xml\localRunner
060322 235123 parsing file:/T:/nutch-trunk/conf/ha
Getting back to nutch after doing some more legwork on pdf parsing, I
got nutch from HEAD and built it. I noticed that there is a .job file
created by the build. Is this something new in .08. Can you run nutch
as a scheduled task now?
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voic
I would forward this to [EMAIL PROTECTED]
-Original Message-
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 21, 2006 12:23 PM
To: nutch-user@lucene.apache.org
Subject: Can't index Japanese PDF
In my quick experiments, Nutch 0.7.1 (with bundled PDFBox
which I thou
In Nutch-default.xml,
Include plugin for word and PDF as below.
plugin.includes
protocol-http|urlfilter-regex|parse-(text|html||msword|pdf)|index-basic|query-(basic|site|url|jobs)
Regular expression naming plugin directory names to
include. Any plugin not matching this expression is exc
hi there,
Is there any specific setting need to be added in
configuration file in order to crawl and index pdf and
word file?
thanks,
Michael,
__
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.co
Hi sudhendra,
Thans for reply. It's src/java/org/apache/nutch/tools.PruneDB, not
src/java/org/apache/nutch/toos.PruneDB
Best regards,
Keren
sudhendra seshachala <[EMAIL PROTECTED]> wrote: I guess the problem is with the
package name
src/java/org/apache/nutch/tools.PruneDB and
I guess the problem is with the package name
src/java/org/apache/nutch/tools.PruneDB and
src/java/org/apache/nutch/toos.PruneDB...
Can you please verify again. It seems to be a typo mistake
Thanks
keren nutch <[EMAIL PROTECTED]> wrote:
Hi Matt,
Thanks for reply. I put Prun
Hi Matt,
Thanks for reply. I put PruneDB.java in src/java/org/apache/nutch/tools and
run ant. But when I run 'nutch org.apache.nutch.toos.PruneDB db -s', I got the
error:
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/nutch/tools/PruneDB
Please let me know where I'm
I'm puzzled by the claim that "It takes ~4 hours to remove a url from
the webdb.". If you're removing them one at a time, yes, because you
have to rewrite the entire webdb for any change. But you want to
process them in bulk. So it should only take:
= (time to rewrite webdb) + (time to proce
Actually, we have 11,000,000 urls in the webdb.
Keren
"Insurance Squared Inc." <[EMAIL PROTECTED]> wrote: We've got a website that is
causing our crawler to slow down (from
20mbits down to 3-5) - 400K pages that are basically not available,
we're just getting 404's. I'd like to remove them
We've got a website that is causing our crawler to slow down (from
20mbits down to 3-5) - 400K pages that are basically not available,
we're just getting 404's. I'd like to remove them from the DB to get
our crawl speed back up again.
Here's what our developer told me - I'm stumped, that seem
(Moved to the proper list)
Raghavendra Prabhu wrote:
Hi
Does the inlink value problem solve the OPIC problem which was there.
That is on a recrawl, the page would have a higher score.
Does this fix that problem?
No, it doesn't. But it prevents your linkDB from growing indefinitely, which
Hello Team,
Thanks to Andrzej for his support and a number of high level pointers in the
matter of performance tuning.
I am running the above version of nutch with mapred/ndfs across a cluster of
five servers. One acting as namenode and jobtracker and all acting as
datanodes and tasktrackers.
E
19 matches
Mail list logo