Re: OPIC scoring differences

2007-07-11 Thread Doğacan Güney
On 7/9/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Carl Cerecke wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? As it is now, the

Re: OPIC scoring differences

2007-07-11 Thread Andrzej Bialecki
Doğacan Güney wrote: Andrzej, nice to see you working on this. There is one thing that I don't understand about your presentation. Assume that page A is the only url in our crawldb and it contains n outlinks. t = 0 - Generate runs, A is generated. t = 1 - Page A is fetched and its cash is

Re: OPIC scoring differences

2007-07-09 Thread Doğacan Güney
Hi, On 7/9/07, Carl Cerecke [EMAIL PROTECTED] wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? Also, there's a comment in the code: // XXX (ab)

Re: OPIC scoring differences

2007-07-09 Thread Andrzej Bialecki
Carl Cerecke wrote: Hi, The docs for the OPICScoringFilter mention that the plugin implements a variant of OPIC from Artiboul et al's paper. What exactly is different? How does the difference affect the scores? As it is now, the implementation doesn't preserve the total cash value in the

Re: OPIC score calculation issues

2006-03-14 Thread Andrzej Bialecki
(Better late than never... I forgot I didn't yet respond to your posting). Doug Cutting wrote: I think all that you're saying is that we should not run two CrawlDB updates at once, right? But there are lots of reasons we cannot do that besides the OPIC calculation. When we used WebDB it was

Re: OPIC score calculation issues

2006-03-14 Thread Doug Cutting
Andrzej Bialecki wrote: When we used WebDB it was possible to overlap generate / fetch / update cycles, because we would lock pages selected by FetchListTool for a period of time. Now we don't do this. The advantage is that we don't have to rewrite CrawlDB. But operations on CrawlDB are

Re: OPIC score calculation issues

2006-02-28 Thread Doug Cutting
Andrzej Bialecki wrote: * CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s from crawl_parse with the same URL, which means that we get: * the original CrawlDatum * (optionally a CrawlDatum that contains just a Signature) * all CrawlDatum.LINKED entries pointing to

Re: OPIC

2005-10-21 Thread Andrzej Bialecki
Massimo Miccoli wrote: Sorry Andrzej, I mean on DeleteDuplicates.java, not in runtime. Is that the correct place to integrate some like Shingling or n-gram? Yes. But there is an small issue of high dimensionality to solve, otherwise it will be very inefficient... Both shingling and n-gram

Re: OPIC

2005-10-20 Thread Massimo Miccoli
Hi Doug, Many thanks for your patch. I now try it. I'm also thinking to integrate some algo for near duplicated urls detection. I mean some like Shingling. Is dedup the best place to integrate the algo? Thanks, Massimo Doug Cutting ha scritto: Here is a patch that implements this. I'm

Re: OPIC

2005-10-20 Thread Andrzej Bialecki
Massimo Miccoli wrote: Hi Doug, Many thanks for your patch. I now try it. I'm also thinking to integrate some algo for near duplicated urls detection. I mean some like Shingling. Is dedup the best place to integrate the algo? That would be lovely. Dedup is the place to start, but certainly

Re: OPIC

2005-10-19 Thread Doug Cutting
Here is a patch that implements this. I'm still testing it. If it appears to work well, I will commit it. Doug Cutting wrote: Massimo Miccoli wrote: Any news about integration of OPIC in mapred? I have time to develop OPIC on Nutch Mapred. Can you help me to start? By the email from