Florent Gluck wrote:
Both return more or less the same results (w/ a difference of ~1.5% in
the #fetches which is not surprising on a 100k set).
I checked the logs and in the 2 cases, I see exactly 100'000 fetch attempts.
You were right, it actually makes sense that the settings in
/mapred-defaul
Andrzej,
I ran 2 crawls of 1 pass each, injecting 100'000 urls.
Here is the output of /readdb -stats/ when crawling with /protocol-http/:
060123 162250 TOTAL urls: 119221
060123 162250 avg score:1.023
060123 162250 max score:240.666
060123 162250 min score:1.0
060123
Andrzej Bialecki wrote:
> Florent Gluck wrote:
>
>> Hi Mike,
>>
>> I finally got everything working properly!
>> What I did was to switch to /protocol-http/ and move the following from
>> /nutch-site.xml/ to /mapred-default.xml/:
>>
>
>
> Could you please check (on a smaller sample ;-) ) which
Florent Gluck wrote:
Hi Mike,
I finally got everything working properly!
What I did was to switch to /protocol-http/ and move the following from
/nutch-site.xml/ to /mapred-default.xml/:
Could you please check (on a smaller sample ;-) ) which of these two
changes was necessary? Frist, seco
Hi Mike,
I finally got everything working properly!
What I did was to switch to /protocol-http/ and move the following from
/nutch-site.xml/ to /mapred-default.xml/:
/
mapred.map.tasks
100
The default number of map tasks per job. Typically set
to a prime several times greater than number
Hi Florent
Thanks for the inquery and reply. I did some more tests based on your
suggestion.
Using the old protocol-http the problem is solved for single machine. But
when I have datanodes running on two other machines the problem still exist
but the number of unfetched pages is less than before.
Florent Gluck wrote:
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.
There have been a number of complaints about unreliable fetching with
protocol-httpclient, so
Hi Mike,
Your differents tests are really interesting, thanks for sharing!
I didn't do as many tests. I changed the number of fetch threads and the
number of map and reduce tasks and noticed that it gave me quite
different results in terms of pages fetched.
Then, I wanted to see if this issue woul
Hi Florent
I did some more testings. Here is the results:
I have 3 machines, P4 and 1G ram. All three are data node and one is
namenode. I started from 8 seed urls and tried to see the effect of
depth 1 crawl for different configuration.
Number of unfetch pages changes with different configu
I've experienced the same effect. When I decrease number of map/reduce
tasks, I can fetch more web pages. but increasing those increases unfetched
pages. I also get some "java.net.SocketTimeoutException: Read timed out"
exceptions in my datanode log files. But those time out problems couldn't
cause
I'm having the exact same problem.
I noticed that changing the number of map/reduce tasks gives me
different DB_fetched results.
Looking at the logs, a lot of urls are actually missing. I can't find
their trace *anywhere* in the logs (whether on the slaves or the
master). I'm puzzled. Currently
Hi,
I have setup for boxes using MapReduce, everything goes smoothly, I have
feeded about 8 seed nodes for begining and I have crawled by depth 2.
Only 1900 pages (about 300MG) data and the rest is marked and db unfetched.
Does any one know what could be wrong?
This is the output of (bin/nutc
12 matches
Mail list logo