Re: Creating a new scoring filter.

2007-02-22 Thread Andrzej Bialecki

(moved from nutch-user)

Nicolás Lichtmaier wrote:



Should I post these kind of questions to the dev list instead?


Yes :)

Hi, I'm working in a fixed set of URLs and I'd like to replace the 
standard OPIC score plugin with something different. I'd like to 
create a scoring plugin which entirely bases its score on the document 
parsed data (yes, I will trust the document text itself to decide its 
relevance).


I've been reading the code and the ScoringFilter interface seems to be 
targeted for use by OPIC like algorithms. For example, the step called 
after parsing is called passScoreAfterParsing(), telling me what am 
I supposed to do in that method, and the method setting the scores is 
called distributeScoreToOutlink(). All of this scares me... would it 
be safe to use these methods differently and, e.g., modify the 
socument score in passScoreAfterParsing() instead of just passing it?


You can modify whichever way you want - it's up to you. These methods 
simply ensure that the score data (not just the CrawlDatum.getScore(), 
but possibly a multitude of metadata collected on the way) is passed to 
appropriate segment parts.


E.g. in distributeScoreToOutlink() you could simply set the default 
score for new pages to a fixed value, without actually using the score 
information from the source page.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




RE: Performance optimization for Nutch index / query

2007-02-22 Thread Steve Severance
Hi,

I would like to comment if I might. I am not a Nutch/Lucene hacker yet. I have 
been working with it for only a few weeks. However I am looking at extending it 
significantly to add some new features. Now some of these will require 
extending Lucene as well. First I have a test implementation of PageRank that 
is really an approximation that runs ontop of map reduce. Are people interested 
in having this in the index? I am interested in how this and other meta data 
might interact with your super field. For instance I am also looking at using 
relevance feedback and having that as one of the criteria for ranking 
documents. I was also considering using an outside data source, possibly even 
another Lucene index to store these values on a per document basis. 

The other major feature I am thinking about is using distance between words and 
text type. Do you know of anyone who has done this?

Regards,

Steve


iVirtuoso, Inc
Steve Severance
Partner, Chief Technology Officer
[EMAIL PROTECTED]
mobile: (240) 472 - 9645


-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 7:44 PM
To: nutch-dev@lucene.apache.org
Subject: Performance optimization for Nutch index / query

Hi all,

This very long post is meant to initiate a discussion. There is no code 
yet. Be warned that it discusses low-level Nutch/Lucene stuff.

Nutch queries are currently translated into complex Lucene queries. This 
is necessary in order to take into account score factors coming from 
various document parts, such as URL, host, title, content, and anchors.

Typically, the translation provided by query-basic looks like this for 
single term queries:

(1)
Query: term1
Parsed: term1
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0)

For queries consisting of two or more terms it looks like this (Nutch 
uses implicit AND):

(2)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) url:term1 
term2~2147483647^4.0 anchor:term1 term2~4^2.0 content:term1 
term2~2147483647 title:term1 term2~2147483647^1.5 host:term1 
term2~2147483647^2.0

By the way, please note the absurd default slop value - in case of 
anchors it defeats the purpose of having the ANCHOR_GAP ...

Let's list other common query types:

(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:term1 term2 term3~2147483647^4.0 anchor:term1 term2 term3~4^2.0 
content:term1 term2 term3~2147483647 title:term1 term2 
term3~2147483647^1.5 host:term1 term2 term3~2147483647^2.0

For phrase queries it looks like this:

(4)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0 
content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0)

For mixed term and phrase queries it looks like this:

(5)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2 term3^4.0 anchor:term2 
term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2 
term3^2.0)

For queries with NOT operator it looks like this:

(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0)

(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0 
content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0 
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0) 
url:term1 term2~2147483647^4.0 anchor:term1 term2~4^2.0 
content:term1 term2~2147483647 title:term1 term2~2147483647^1.5 
host:term1 term2~2147483647^2.0

(8)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0 
content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0) 
-(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5 
host:term3^2.0)

(9)
Query: term1 -term2 term3
Parsed: term1 -term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1 
title:term1^1.5 host:term1^2.0) -(url:term2 term3^4.0 anchor:term2 
term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2 
term3^2.0)


WHEW ... !!! Are you tired? Well, Lucene is tired of these queries too. 
They are too long! They are absurdly long and complex. For large indexes 
the time to evaluate them may run into several 

Why not make SOLR the Nutch SE

2007-02-22 Thread Gal Nitzan
Hi,

Since I ran into SOLR the other day I was wandering why can't we join forces 
between the two projects?

Both projects complement to each other.

Any thoughts?

Gal.