Hi,
I would like to comment if I might. I am not a Nutch/Lucene hacker yet. I have
been working with it for only a few weeks. However I am looking at extending it
significantly to add some new features. Now some of these will require
extending Lucene as well. First I have a test implementation of PageRank that
is really an approximation that runs ontop of map reduce. Are people interested
in having this in the index? I am interested in how this and other meta data
might interact with your super field. For instance I am also looking at using
relevance feedback and having that as one of the criteria for ranking
documents. I was also considering using an outside data source, possibly even
another Lucene index to store these values on a per document basis.
The other major feature I am thinking about is using distance between words and
text type. Do you know of anyone who has done this?
Regards,
Steve
iVirtuoso, Inc
Steve Severance
Partner, Chief Technology Officer
[EMAIL PROTECTED]
mobile: (240) 472 - 9645
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Thursday, February 22, 2007 7:44 PM
To: nutch-dev@lucene.apache.org
Subject: Performance optimization for Nutch index / query
Hi all,
This very long post is meant to initiate a discussion. There is no code
yet. Be warned that it discusses low-level Nutch/Lucene stuff.
Nutch queries are currently translated into complex Lucene queries. This
is necessary in order to take into account score factors coming from
various document parts, such as URL, host, title, content, and anchors.
Typically, the translation provided by query-basic looks like this for
single term queries:
(1)
Query: term1
Parsed: term1
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0)
For queries consisting of two or more terms it looks like this (Nutch
uses implicit AND):
(2)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0
content:term2 title:term2^1.5 host:term2^2.0) url:term1
term2~2147483647^4.0 anchor:term1 term2~4^2.0 content:term1
term2~2147483647 title:term1 term2~2147483647^1.5 host:term1
term2~2147483647^2.0
By the way, please note the absurd default slop value - in case of
anchors it defeats the purpose of having the ANCHOR_GAP ...
Let's list other common query types:
(3)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0
content:term2 title:term2^1.5 host:term2^2.0) +(url:term3^4.0
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0)
url:term1 term2 term3~2147483647^4.0 anchor:term1 term2 term3~4^2.0
content:term1 term2 term3~2147483647 title:term1 term2
term3~2147483647^1.5 host:term1 term2 term3~2147483647^2.0
For phrase queries it looks like this:
(4)
Query: term1 term2
Parsed: term1 term2
Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0
content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0)
For mixed term and phrase queries it looks like this:
(5)
Query: term1 term2 term3
Parsed: term1 term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0) +(url:term2 term3^4.0 anchor:term2
term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2
term3^2.0)
For queries with NOT operator it looks like this:
(6)
Query: term1 -term2
Parsed: term1 -term2
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0) -(url:term2^4.0 anchor:term2^2.0
content:term2 title:term2^1.5 host:term2^2.0)
(7)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0) +(url:term2^4.0 anchor:term2^2.0
content:term2 title:term2^1.5 host:term2^2.0) -(url:term3^4.0
anchor:term3^2.0 content:term3 title:term3^1.5 host:term3^2.0)
url:term1 term2~2147483647^4.0 anchor:term1 term2~4^2.0
content:term1 term2~2147483647 title:term1 term2~2147483647^1.5
host:term1 term2~2147483647^2.0
(8)
Query: term1 term2 -term3
Parsed: term1 term2 -term3
Translated: +(url:term1 term2^4.0 anchor:term1 term2^2.0
content:term1 term2 title:term1 term2^1.5 host:term1 term2^2.0)
-(url:term3^4.0 anchor:term3^2.0 content:term3 title:term3^1.5
host:term3^2.0)
(9)
Query: term1 -term2 term3
Parsed: term1 -term2 term3
Translated: +(url:term1^4.0 anchor:term1^2.0 content:term1
title:term1^1.5 host:term1^2.0) -(url:term2 term3^4.0 anchor:term2
term3^2.0 content:term2 term3 title:term2 term3^1.5 host:term2
term3^2.0)
WHEW ... !!! Are you tired? Well, Lucene is tired of these queries too.
They are too long! They are absurdly long and complex. For large indexes
the time to evaluate them may run into several