On 7/9/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:
Carl Cerecke wrote:
Hi,
The docs for the OPICScoringFilter mention that the plugin implements a
variant of OPIC from Artiboul et al's paper. What exactly is different?
How does the difference affect the scores?
As it is now, the
Doğacan Güney wrote:
Andrzej, nice to see you working on this.
There is one thing that I don't understand about your presentation.
Assume that page A is the only url in our crawldb and it contains n
outlinks.
t = 0 - Generate runs, A is generated.
t = 1 - Page A is fetched and its cash is
Hi,
On 7/9/07, Carl Cerecke [EMAIL PROTECTED] wrote:
Hi,
The docs for the OPICScoringFilter mention that the plugin implements a
variant of OPIC from Artiboul et al's paper. What exactly is different?
How does the difference affect the scores?
Also, there's a comment in the code:
// XXX (ab)
Carl Cerecke wrote:
Hi,
The docs for the OPICScoringFilter mention that the plugin implements a
variant of OPIC from Artiboul et al's paper. What exactly is different?
How does the difference affect the scores?
As it is now, the implementation doesn't preserve the total cash value
in the
(Better late than never... I forgot I didn't yet respond to your posting).
Doug Cutting wrote:
I think all that you're saying is that we should not run two CrawlDB
updates at once, right? But there are lots of reasons we cannot do
that besides the OPIC calculation.
When we used WebDB it was
Andrzej Bialecki wrote:
When we used WebDB it was possible to overlap generate / fetch / update
cycles, because we would lock pages selected by FetchListTool for a
period of time.
Now we don't do this. The advantage is that we don't have to rewrite
CrawlDB. But operations on CrawlDB are
Andrzej Bialecki wrote:
* CrawlDBReducer (used by CrawlDB.update()) collects all CrawlDatum-s
from crawl_parse with the same URL, which means that we get:
* the original CrawlDatum
* (optionally a CrawlDatum that contains just a Signature)
* all CrawlDatum.LINKED entries pointing to
Massimo Miccoli wrote:
Sorry Andrzej,
I mean on DeleteDuplicates.java, not in runtime. Is that the correct
place to integrate some like Shingling or n-gram?
Yes. But there is an small issue of high dimensionality to solve,
otherwise it will be very inefficient...
Both shingling and n-gram
Hi Doug,
Many thanks for your patch. I now try it. I'm also thinking to integrate
some algo for near duplicated urls detection. I mean some like Shingling.
Is dedup the best place to integrate the algo?
Thanks,
Massimo
Doug Cutting ha scritto:
Here is a patch that implements this. I'm
Massimo Miccoli wrote:
Hi Doug,
Many thanks for your patch. I now try it. I'm also thinking to integrate
some algo for near duplicated urls detection. I mean some like Shingling.
Is dedup the best place to integrate the algo?
That would be lovely. Dedup is the place to start, but certainly
Here is a patch that implements this. I'm still testing it. If it
appears to work well, I will commit it.
Doug Cutting wrote:
Massimo Miccoli wrote:
Any news about integration of OPIC in mapred? I have time to develop
OPIC on Nutch Mapred. Can you help me to start?
By the email from
11 matches
Mail list logo