Thank you for the explanation. It was a bit confusing at first, but it
actually makes sense.
Florent
Doğacan Güney wrote:
Hi,
On 5/17/07, Florent Gluck [EMAIL PROTECTED] wrote:
Hi all,
I've noticed that when doing a segment dump using readseg, several
instances of the same CrawlDatum can
Hi Andrzej,
Well, I think for now I'll just disable the parse-js plugin since I
don't really need it anyway.
I'll let you know if I ever work on it (I may need it in the future).
Thanks,
--Flo
Andrzej Bialecki wrote:
Florent Gluck wrote:
Some urls are totally bogus. I didn't investigate
Hi,
I'm using nutch revision 385671 from the trunk. I'm running it on a
single machine using the local fileystem.
I just started with a seed of one single url: http://www.osnews.com
Then I ran a crawl cycle of depth 2 (generate/fetch/updatedb) and
dumpped the crawl db. Here is where I got quite
master machine. We've increased this twice now, and each time it solved
similar problems. We now have it at 16K. See my other post today (re: Corrupt
NDFS?) for more details.
Good Luck,
- Chris
At 11:07 AM -0500 2/10/06, Florent Gluck wrote:
Hi,
I have 4 boxes (1 master, 3 slaves), about
Hi,
I have 4 boxes (1 master, 3 slaves), about 33GB worth of segment data
and 4.6M fetched urls in my crawldb. I'm using the mapred code from
trunk (revision 374061, Wed, 01 Feb 2006).
I was able to generate the indexes from the crawldb and linkdb, but I
started to see this error recently while
of Generator.java, but it didn't change the
situation.
I'll try to do some more testings.
Thanks, Mike
On 1/19/06, Doug Cutting [EMAIL PROTECTED] wrote:
Florent Gluck wrote:
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead
affect the local crawl as well since they
have nothing to do w/ ndfs.
It therefore seems that /protocol-httpclient/ is reliable enough to be
used (well, at least in my case).
--Flo
Florent Gluck wrote:
Andrzej Bialecki wrote:
Could you please check (on a smaller sample ;-) ) which of these two
Hi Mike,
Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue
Ken Krugler wrote:
Hello fellow Nutchers,
I followed the steps described here by Doug:
http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL
PROTECTED]
...to start a test run of the new (0.8, as of 1/12/2006) version of
Nutch.
It ran for quite a while on my
I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing. I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master). I'm puzzled. Currently
Hi,
I'm running nutch trunk as of today. I have 3 slaves and a master. I'm
using *mapred.map.tasks=20* and *mapred.reduce.tasks=4*
There is something I'm really confused about.
When I inject 25000 urls and fetch them (depth = 1) and do a readdb
-stats, I get:
060110 171347 Statistics for
Pushpesh,
We extended nutch with a whitelist filter and you might find it useful.
Check the comments from Matt Kangas here:
http://issues.apache.org/jira/browse/NUTCH-87;jsessionid=6F6AD5423357184CF57B51B003201C49?page=all
--Flo
Pushpesh Kr. Rajwanshi wrote:
hmmm... actually my requirement is
Hi,
I'm using the map reduce branch, 1 master and 3 slaves, and they are
configured the standard way (master as a jobtracker + namenode)
After having created a index, I run dedup on it, but I get a
IOException. Here is an extract of the log:
051215 160733 Dedup: starting
051215 160733 Dedup:
Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch 0.8 and you have to run it
in the directory where the index resides (apparently it looks in the
segments dir from where it's run). However this wont' work w/ nutch 0.8
My err, I meant nutch server not nutch search
--Flo
Florent Gluck wrote:
Sorry for such a basic question, but how do we run a search on a
generated index ?
I read about how to setup tomcat w/ nutch 0.8 and you have to run it
in the directory where the index resides (apparently it looks
Never mind, I got tomcat working.
After looking at the code, it seems nutch parse does nothing yet.
The last remaining thing is how to use NutchBean to output the segments'
content.
Thanks,
--Flo
Florent Gluck wrote:
Sorry for such a basic question, but how do we run a search on a
generated
Hi,
As a test, I recently did a quick incremental crawl. First, I did a
crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3
tastTrackers/dataNodes). So far, so good, the fetches were distributed
among the 3 nodes (3/3/4) and a segment was generated. Running a quick
-stats on the
17 matches
Mail list logo