Re: NUTCH-479 "Support for OR queries" - what is this about

2007-07-09 Thread Kai_testing Middleton
Hi Andrzej:  Thanks for the thorough reply!

To round out the discussion a bit, I've done a little homework of my own, 
reading "Lucene in Action" by Otis Gospodnetic and Erik Hatcher.  In section 
10.1 Nutch: "The NPR of search engines" (page 327) it says:
"The Query Handler does some light processing of the
query and forwards the search terms to a large set of
Index Searcher machines.  The Nutch query system might
seem much simpler than Lucene's, but that's largely
because search engine users have a strong idea of what
kind of queries they like to perform.  Lucene's system
is very flexible and allows for many different kinds of
queries.  The simple-looking Nutch query is converted
into a very specific Lucene one.  This is discussed
further later.  Each Index Searcher works in parallel
and returns a ranked list of document IDs."
(see for instance 
http://www.lucenebook.com/search?query=The+query+handler+does+some+light+processing)

Some of what I'm trying to do for my particular implementation is figure out 
how detailed a searching capability I need.  I'm basically going to crawl the 
web using nutch (or heritrix, et. al.) to create a specific corpus, then index 
it with Lucene (or Xapian, et. al.) so that I can query it.  I find nutch 
convenient because it comes with Lucene built in (although, as you point out, 
"Nutch is NOT an extension of Lucene"). The queries I'll process will come from 
a perl backend so I'm considering using Solr as you mention below (also 
mentioned in NUTCH-442: Integrate Solr/Nutch, and Sami Siren's blog: 
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html).

--Kai

- Original Message 
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: nutch-user@lucene.apache.org
Sent: Saturday, July 7, 2007 1:26:22 PM
Subject: Re: NUTCH-479 "Support for OR queries" - what is this about

Briggs wrote:
> Please keep this thread going as I am also curious to know why this
> has been 'forked'.   I am sure that most of this lies within the
> original OPIC filter but I still can't understand why straight forward
> lucene queries have not been used within the application.

No, this has actually almost nothing to do with the scoring filters 
(which were added much later).

The decision to use a different query syntax than the one from Lucene 
was motivated by a few reasons:

* to avoid the need to support low-level index and searcher operations, 
which the Lucene API would require us to implement.

* to keep the Nutch core largely independent of Lucene, so that it's 
possible to use Nutch with different back-end searcher implementations. 
This started to materialize only now, with the ongoing effort to use 
Solr as a possible backend.

* to limit the query syntax to those queries that provide best tradeoff 
between functionality and performance, in a large-scale search engine.


> On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

>> Ok, so I guess what I don't understand is what is the "Nutch query 
>> syntax"?

Query syntax is defined in an informal way on the Help page in 
nutch.war, or here:

http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned from 
org.apache.nutch.analysis.NutchAnalysis.jj.


>>
>> The main discussion I found on nutch-user is this:
>> http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
>> I was wondering why the query syntax is so limited.
>> There are no OR queries, there are no fielded queries,
>> or fuzzy, or approximate... Why? The underlying index
>> supports all these operations.

Actually, it's possible to configure Nutch to allow raw field queries - 
you need to add a raw field query plugin for this. Pleae see 
RawFieldQueryFilter class, and existing plugins that use fielded 
queries: query-site, and query-more. Query-more / DateQueryFilter is 
especially interesting, because it shows how to use raw token values 
from a parsed query to build complex Lucene queries.


>>
>> I notice by looking at the or.patch file 
>> (https://issues.apache.org/jira/secure/attachment/12360659/or.patch) 
>> that one of the programs under consideration is:
>> nutch/searcher/Query.java
>> The code for this is distinct from
>> lucene/search/Query.java

See above - they are completely different classes, with completely 
different purpose. The use of the same class name is unfortunate and 
misleading.

Nutch Query class is intended to express queries entered by search 
engine users, in a tokenized and parsed way, so that the rest of Nutch 
may deal with Clauses, Terms and Phrases instead of plain String-s.

On the other hand, Lucene Query is intended to express arbitrarily 
complex Lucene queries -

Re: NUTCH-479 "Support for OR queries" - what is this about

2007-07-09 Thread Briggs

Thanks for the answer. That was helpful.

I was sooo wrong.

On 7/7/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Briggs wrote:
> Please keep this thread going as I am also curious to know why this
> has been 'forked'.   I am sure that most of this lies within the
> original OPIC filter but I still can't understand why straight forward
> lucene queries have not been used within the application.

No, this has actually almost nothing to do with the scoring filters
(which were added much later).

The decision to use a different query syntax than the one from Lucene
was motivated by a few reasons:

* to avoid the need to support low-level index and searcher operations,
which the Lucene API would require us to implement.

* to keep the Nutch core largely independent of Lucene, so that it's
possible to use Nutch with different back-end searcher implementations.
This started to materialize only now, with the ongoing effort to use
Solr as a possible backend.

* to limit the query syntax to those queries that provide best tradeoff
between functionality and performance, in a large-scale search engine.


> On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

>> Ok, so I guess what I don't understand is what is the "Nutch query
>> syntax"?

Query syntax is defined in an informal way on the Help page in
nutch.war, or here:

http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned from
org.apache.nutch.analysis.NutchAnalysis.jj.



>>
>> The main discussion I found on nutch-user is this:
>> http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
>> I was wondering why the query syntax is so limited.
>> There are no OR queries, there are no fielded queries,
>> or fuzzy, or approximate... Why? The underlying index
>> supports all these operations.


Actually, it's possible to configure Nutch to allow raw field queries -
you need to add a raw field query plugin for this. Pleae see
RawFieldQueryFilter class, and existing plugins that use fielded
queries: query-site, and query-more. Query-more / DateQueryFilter is
especially interesting, because it shows how to use raw token values
from a parsed query to build complex Lucene queries.


>>
>> I notice by looking at the or.patch file
>> (https://issues.apache.org/jira/secure/attachment/12360659/or.patch)
>> that one of the programs under consideration is:
>> nutch/searcher/Query.java
>> The code for this is distinct from
>> lucene/search/Query.java


See above - they are completely different classes, with completely
different purpose. The use of the same class name is unfortunate and
misleading.

Nutch Query class is intended to express queries entered by search
engine users, in a tokenized and parsed way, so that the rest of Nutch
may deal with Clauses, Terms and Phrases instead of plain String-s.

On the other hand, Lucene Query is intended to express arbitrarily
complex Lucene queries - many of these queries would be prohibitively
expensive for a large search engine (e.g. wildcard queries).


>>
>> It looks like this is an architecture issue that I don't understand.
>> If nutch is an "extension" of lucene, why does it define a different
>> Query class?

Nutch is NOT an extension of Lucene. It's an application that uses
Lucene as a library.


>>  Why don't we just use the Lucene code to query the
>> indexes?  Does this have something to do with the nutch webapp
>> (nutch.war)?  What is the historical genesis of this issue (or is that
>> even relevant)?

Nutch webapp doesn't have anything to do with it. The limitations in the
query syntax have different roots (see above).

--
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com





--
"Conscious decisions by conscious minds are what make reality real"


Re: NUTCH-479 "Support for OR queries" - what is this about

2007-07-07 Thread Andrzej Bialecki

Briggs wrote:

Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.


No, this has actually almost nothing to do with the scoring filters 
(which were added much later).


The decision to use a different query syntax than the one from Lucene 
was motivated by a few reasons:


* to avoid the need to support low-level index and searcher operations, 
which the Lucene API would require us to implement.


* to keep the Nutch core largely independent of Lucene, so that it's 
possible to use Nutch with different back-end searcher implementations. 
This started to materialize only now, with the ongoing effort to use 
Solr as a possible backend.


* to limit the query syntax to those queries that provide best tradeoff 
between functionality and performance, in a large-scale search engine.




On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:


Ok, so I guess what I don't understand is what is the "Nutch query 
syntax"?


Query syntax is defined in an informal way on the Help page in 
nutch.war, or here:


http://wiki.apache.org/nutch/Features

Formal syntax definition can be gleaned from 
org.apache.nutch.analysis.NutchAnalysis.jj.





The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.


Actually, it's possible to configure Nutch to allow raw field queries - 
you need to add a raw field query plugin for this. Pleae see 
RawFieldQueryFilter class, and existing plugins that use fielded 
queries: query-site, and query-more. Query-more / DateQueryFilter is 
especially interesting, because it shows how to use raw token values 
from a parsed query to build complex Lucene queries.





I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) 
that one of the programs under consideration is:

nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java


See above - they are completely different classes, with completely 
different purpose. The use of the same class name is unfortunate and 
misleading.


Nutch Query class is intended to express queries entered by search 
engine users, in a tokenized and parsed way, so that the rest of Nutch 
may deal with Clauses, Terms and Phrases instead of plain String-s.


On the other hand, Lucene Query is intended to express arbitrarily 
complex Lucene queries - many of these queries would be prohibitively 
expensive for a large search engine (e.g. wildcard queries).





It looks like this is an architecture issue that I don't understand.  
If nutch is an "extension" of lucene, why does it define a different 
Query class?


Nutch is NOT an extension of Lucene. It's an application that uses 
Lucene as a library.



 Why don't we just use the Lucene code to query the 
indexes?  Does this have something to do with the nutch webapp 
(nutch.war)?  What is the historical genesis of this issue (or is that 
even relevant)?


Nutch webapp doesn't have anything to do with it. The limitations in the 
query syntax have different roots (see above).


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: NUTCH-479 "Support for OR queries" - what is this about

2007-07-07 Thread Briggs

Please keep this thread going as I am also curious to know why this
has been 'forked'.   I am sure that most of this lies within the
original OPIC filter but I still can't understand why straight forward
lucene queries have not been used within the application.



On 7/6/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

I've been reading up on NUTCH-479 "Support for OR queries" but I must be 
missing something obvious because I don't understand what the JIRA is about:

https://issues.apache.org/jira/browse/NUTCH-479

   Description:
   There have been many requests from users to extend Nutch query syntax

   to add support for OR queries,
   in addition to the implicit AND and NOT
queries supported now.

Ok, so I guess what I don't understand is what is the "Nutch query syntax"?

The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.

I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one 
of the programs under consideration is:
nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java

It looks like this is an architecture issue that I don't understand.  If nutch is an 
"extension" of lucene, why does it define a different Query class?  Why don't 
we just use the Lucene code to query the indexes?  Does this have something to do with 
the nutch webapp (nutch.war)?  What is the historical genesis of this issue (or is that 
even relevant)?







We won't tell. Get more on shows you hate to love
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265



--
"Conscious decisions by conscious minds are what make reality real"


NUTCH-479 "Support for OR queries" - what is this about

2007-07-06 Thread Kai_testing Middleton
I've been reading up on NUTCH-479 "Support for OR queries" but I must be 
missing something obvious because I don't understand what the JIRA is about:

https://issues.apache.org/jira/browse/NUTCH-479

   Description:
   There have been many requests from users to extend Nutch query syntax

   to add support for OR queries, 
   in addition to the implicit AND and NOT
queries supported now.

Ok, so I guess what I don't understand is what is the "Nutch query syntax"? 

The main discussion I found on nutch-user is this:
http://osdir.com/ml/search.nutch.devel/2004-02/msg7.html
I was wondering why the query syntax is so limited.
There are no OR queries, there are no fielded queries,
or fuzzy, or approximate... Why? The underlying index
supports all these operations.

I notice by looking at the or.patch file 
(https://issues.apache.org/jira/secure/attachment/12360659/or.patch) that one 
of the programs under consideration is:
nutch/searcher/Query.java
The code for this is distinct from
lucene/search/Query.java

It looks like this is an architecture issue that I don't understand.  If nutch 
is an "extension" of lucene, why does it define a different Query class?  Why 
don't we just use the Lucene code to query the indexes?  Does this have 
something to do with the nutch webapp (nutch.war)?  What is the historical 
genesis of this issue (or is that even relevant)?





 

We won't tell. Get more on shows you hate to love 
(and love to hate): Yahoo! TV's Guilty Pleasures list.
http://tv.yahoo.com/collections/265