Hello,
Sorry if this topic has arisen before, but we're trying to enhance Nutch
to accept on-the-fly injections of new content. In other words, we have
a crawler that feeds page injection commands to an HTTP server - this
server, in turn, adds the URL to the crawldb (if necessary), generates
avoiding duplicates.
CC-
-Original Message-
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 4:10 PM
To: nutch-user@lucene.apache.org
Subject: RE: Scaling Nutch 0.8 via Map/Reduce
Thanks for the response. Having a bunch of 50GB segments is more
If you're not worried about scaling up, I think version 0.7.x should be
just fine. I've been working on migrating an 0.7.1 prototype
(integrated with existing infrastructure) to 0.8 -- it's not trivial.
:-) The file and folder structures have changed.
DaveG
-Original Message-
From:
that 50GB or so, and
do
fast recovery via an sync process and a check daemon.
NOTE: At the time we built our solution NDFS was not production quality
--
not sure where things stand now.
-Original Message-
From: Goldschmidt, Dave [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 3
that transform things
from one format to the other, but require some java and nutch skills.
I suggest just start from scratch.
Stefan
Am 23.12.2005 um 16:51 schrieb Goldschmidt, Dave:
Hello, I know I've posted this question before, but I've had no
success
in using a 70GB 0.7.1 segment
Hello, yes, I think 0.7.x without NDFS and Map/Reduce should be fine for
one-machine applications. You should be able to set up Nutch without
NDFS or Map/Reduce by setting the fs.default.name and mapred.job.tracker
properties to local (I think these may already be the default).
HTH,
DaveG
Marko, thanks also from me! Just found this in the archives
DaveG
Hi Mike,
Exception in thread main java.lang.NullPointerException
at
org.apache.nutch.mapred.JobTrackerInfoServer.init
(JobTrackerInfoServer.java:67)
at
This is a little bug. You must copy
Hello, I ran into the same problem (which I think is fixed in future
releases). For Nutch 0.7.1, just create the missing directories and run
the ant script again.
HTH,
DaveG
-Original Message-
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED]
Sent: Monday, December 19, 2005 2:38 PM
it searchable.
updatedb doesn't affect the segments. it just updates the webdb
with
the
results of the fetch. so if you already ran updatedb on the
pre-merged
segments, you don't need to run it again.
On 12/8/05, Goldschmidt, Dave [EMAIL PROTECTED] wrote:
Hi, just wanted to be sure - after I
Hi, just wanted to be sure - after I merge segments via the mergesegs
tool, I need to use the updatedb tool before dropping the new indexes
in, correct?
And, as just posted, I need to shutdown and restart Tomcat, too, yes?
Thanks,
DaveG
Hello,
I'm currently aiming for a ~7M page index - but what exactly am I aiming
for? :-)
In other words, should I end up with only ONE Nutch segment when I'm
done? Currently, I have 52. Having one Nutch segment will be quickest
for searches?
If so, should I perform the merge
05.12.2005 um 22:24 schrieb Goldschmidt, Dave:
Hello,
In searching for solutions, I found an old post from Doug on tuning
these parameters -- but this old message applied to ~30,000 documents
only:
http://marc.theaimsgroup.com/?l=lucene-userm=110235452004799w=2
I've upped both the mergeFactor
Hello,
I'm currently indexing ~50 segments, each ~2GB in size, for a total of
only ~7,000,000 pages. From the log output, I see an index rate of ~72
records/second. Doing the math, this is over 24 hours of time to index
these segments.
Does this sound slow? If so, any suggestions as to
of your nutch-site settings yet?
-byron
--- Goldschmidt, Dave [EMAIL PROTECTED]
wrote:
Hello,
I'm currently indexing ~50 segments, each ~2GB in
size, for a total of
only ~7,000,000 pages. From the log output, I see
an index rate of ~72
records/second. Doing the math, this is over 24
Hello, MapReduce is described on Nutch's Wiki:
http://wiki.apache.org/nutch/Presentations
Specifically:
http://wiki.apache.org/nutch-data/attachments/Presentations/attachments/
mapred.pdf
Hope this helps,
DaveG
-Original Message-
From: Gal Nitzan [mailto:[EMAIL PROTECTED]
Sent:
15 matches
Mail list logo