rank system

2005-11-08 Thread Anton Potehin
What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with page scores calculating. Does the mapred use ranking system somehow? Is it possible to use mapred for clustering whole-web crawling or it works with Intranet Crawling only?

Re: rank system

2005-11-08 Thread Stefan Groschupf
Pre score calculation is done in the indexer. Yes it works with complete webcrawls as well, and it works very well for that. :-) Stefan Am 08.11.2005 um 11:22 schrieb Anton Potehin: What about scoring in mapred? I have looked crawl/crawl.java but I did not found anything concerned with

RE: rank system

2005-11-08 Thread anton
Alright i see in crawl/Indexer.java in method reduce object class dbDatum which contain score. But where calculate this score? What formula using when calculate score? -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 08, 2005 1:54 PM To:

questions

2005-11-08 Thread Anton Potehin
After I looked thru Crawl.java I exploded all tasks for several phases: 1) Inject - here we add web-links into crawlDb 2) Generate segment - here we create data segment 3) Fetching 4) Parse segment 5) Update crawlDb - here the information is added from segment

RE: questions

2005-11-08 Thread anton
Does it mean that every job at every phase may be separated for several machines (for example: generate or every rest phases may be performed parallel on several machines)? Give us URL for presentation on wiki please? -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED]

Re: Index update and Google Dance

2005-11-08 Thread Stefan Groschupf
nutch use the concepts of segments and yes you are able to update part of the index by just delete older older segments and generate / fetch new segments. Stefan Am 08.11.2005 um 18:38 schrieb Jack Tang: Hi I read GFS document and NFS document on the wiki. One interesting question here:

Re: mapred bug -- bad part calculation?

2005-11-08 Thread Doug Cutting
Rod Taylor wrote: The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the No input directories issue when using a local filesystem with multiple task

mapreduce with large amounts of data

2005-11-08 Thread Andrew McNabb
I'm starting some work using Nutch's MapReduce for parallel computation unrelated to web indexing. The last few days I've been becoming familiar with how the implementation works, and I've been very impressed. I ran some tests using the Grep demo to get a feel for how it works with large files,

Re: Index update and Google Dance

2005-11-08 Thread Andrzej Bialecki
Jack Tang wrote: Hi Stefan Deleting is totally OK if there is NO references to the chunks(segments). Also, Will master balance the searching request? Say, there are 3 slaves: Slave 1, 2, 3 and three copies of chunks are distributed on the slaves. If slave 1 is 90% busy, and 2 is 80% busy, 3 is