What about scoring in mapred? I have looked crawl/crawl.java but I did
not found anything concerned with page scores calculating. Does the
mapred use ranking system somehow?
Is it possible to use mapred for clustering whole-web crawling or it
works with Intranet Crawling only?
Pre score calculation is done in the indexer.
Yes it works with complete webcrawls as well, and it works very well
for that. :-)
Stefan
Am 08.11.2005 um 11:22 schrieb Anton Potehin:
What about scoring in mapred? I have looked crawl/crawl.java but I did
not found anything concerned with
Alright i see in crawl/Indexer.java in method reduce object class dbDatum
which contain score. But where calculate this score?
What formula using when calculate score?
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 08, 2005 1:54 PM
To:
After I looked thru Crawl.java I exploded all tasks for several phases:
1) Inject - here we add web-links into crawlDb
2) Generate segment - here we create data segment
3) Fetching
4) Parse segment
5) Update crawlDb - here the information is added from segment
Does it mean that every job at every phase may be separated for several
machines (for example: generate or every rest phases may be performed
parallel on several machines)?
Give us URL for presentation on wiki please?
-Original Message-
From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
nutch use the concepts of segments and yes you are able to update
part of the index by just delete older older segments and generate /
fetch new segments.
Stefan
Am 08.11.2005 um 18:38 schrieb Jack Tang:
Hi
I read GFS document and NFS document on the wiki. One interesting
question here:
Rod Taylor wrote:
The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the No input
directories issue when using a local filesystem with multiple task
I'm starting some work using Nutch's MapReduce for parallel computation
unrelated to web indexing. The last few days I've been becoming
familiar with how the implementation works, and I've been very
impressed.
I ran some tests using the Grep demo to get a feel for how it works with
large files,
Jack Tang wrote:
Hi Stefan
Deleting is totally OK if there is NO references to the chunks(segments).
Also, Will master balance the searching request? Say, there are 3
slaves: Slave 1, 2, 3
and three copies of chunks are distributed on the slaves. If slave 1
is 90% busy, and 2 is 80% busy, 3 is