Hi,
As a test, I recently did a quick incremental crawl. First, I did a
crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3
tastTrackers/dataNodes). So far, so good, the fetches were distributed
among the 3 nodes (3/3/4) and a segment was generated. Running a quick
-stats on the
Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch 0.8 and you have to run it
in the directory where the index resides (apparently it looks in the
segments dir from where it's run). However this wont' work w/ nutch 0.8
My err, I meant nutch server not nutch search
--Flo
Florent Gluck wrote:
Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch 0.8 and you have to run it
in the directory where the index resides (apparently it looks
Never mind, I got tomcat working.
After looking at the code, it seems nutch parse does nothing yet.
The last remaining thing is how to use NutchBean to output the segments'
content.
Thanks,
--Flo
Florent Gluck wrote:
Sorry for such a basic question, but how do we run a search on a
generated
Hi,
I'm using the map reduce branch, 1 master and 3 slaves, and they are
configured the standard way (master as a jobtracker + namenode)
After having created a index, I run dedup on it, but I get a
IOException. Here is an extract of the log:
051215 160733 Dedup: starting
051215 160733 Dedup:
Pushpesh,
We extended nutch with a whitelist filter and you might find it useful.
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all
--Flo
Pushpesh Kr. Rajwanshi wrote:
hmmm... actually my requirement is
Hi,
I'm running nutch trunk as of today. I have 3 slaves and a master. I'm
using *mapred.map.tasks=20* and *mapred.reduce.tasks=4*
There is something I'm really confused about.
When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for
Ken Krugler wrote:
Hello fellow Nutchers,
I followed the steps described here by Doug:
http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
PROTECTED]
...to start a test run of the new (0.8, as of 1/12/2006) version of
Nutch.
It ran for quite a while on my
I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing. I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master). I'm puzzled. Currently
Hi Mike,
Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue
of Generator.java, but it didn't change the
situation.
I'll try to do some more testings.
Thanks, Mike
On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote:
Florent Gluck wrote:
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead
affect the local crawl as well since they
have nothing to do w/ ndfs.
It therefore seems that /protocol-httpclient/ is reliable enough to be
used (well, at least in my case).
--Flo
Florent Gluck wrote:
Andrzej Bialecki wrote:
Could you please check (on a smaller sample ;-) ) which of these two
Hi,
I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data
and 4.6M fetched urls in my crawldb. I'm using the mapred code from
trunk (revision 374061, Wed, 01 Feb 2006).
I was able to generate the indexes from the crawldb and linkdb, but I
started to see this error recently while
master machine. We've increased this twice now, and each time it solved
similar problems. We now have it at 16K. See my other post today (re: Corrupt
NDFS?) for more details.
Good Luck,
- Chris
At 11:07 AM -0500 2/10/06, Florent Gluck wrote:
Hi,
I have 4 boxes (1 master, 3 slaves), about
Hi,
I'm using nutch revision 385671 from the trunk. I'm running it on a
single machine using the local fileystem.
I just started with a seed of one single url: http://www.osnews.com
Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and
dumpped the crawl db. Here is where I got quite
Hi Andrzej,
Well, I think for now I'll just disable the parse-js plugin since I
don't really need it anyway.
I'll let you know if I ever work on it (I may need it in the future).
Thanks,
--Flo
Andrzej Bialecki wrote:
Florent Gluck wrote:
Some urls are totally bogus. I didn't investigate
Thank you for the explanation. It was a bit confusing at first, but it
actually makes sense.
Florent
Doğacan Güney wrote:
Hi,
On 5/17/07, Florent Gluck [EMAIL PROTECTED] wrote:
Hi all,
I've noticed that when doing a segment dump using readseg, several
instances of the same CrawlDatum can
17 matches
Mail list logo