RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom
Hi Yonik,

>>If the new "autoGeneratePhraseQueries" is off, position doesn't matter, and 
>>the query will 
>>be treated as "index" OR "reader".

Just wanted to make sure, in Solr does autoGeneratePhraseQueries = "off" treat 
the query with the *default* query operator as set in SolrConfig rather than 
necessarily using the Boolean "OR" operator?

i.e.  if 
 and autoGeneratePhraseQueries = off 

then "IndexReader" -> "index"  "reader" -> "index" AND "reader"

Tom




RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom
Hi Jonathan,

>> I'm afraid I'm having trouble understanding   "if the analyzer returns more 
>> than one position back from a "queryparser token"

>>I'm not sure if "the queryparser forms a phrase query without explicit phrase 
>>quotes" is a problem for me, I had no idea it happened until now, never 
>>noticed, and still don't really understand in what circumstances it happens.

The problem I had was for a Boolean query "l'art AND historie" that the 
WordDelimiterFilter tokenized "l'art"  as two tokens "l" at position 1 and 
"art" at position 2.   So the queryparser decided this means a phrase query for 
"l" followed immediately by "art".  See
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance 
for details.  

This would happen whenever any token filter split a token into more than one 
token.  For example a filter that splits foo-bar into "foo" "bar".  The 
exception is  SynonymFilter or something like it.  In the case of 
SynonymFilter, its not really a case of "splitting" one token into multiple 
tokens, but given one token of input, it outputs all the synonyms of the term.  
However all the tokens have the same position attribute. (see: 
http://www.lucidimagination.com/search/document/CDRG_ch05_5.6.19?q=synonym%20filter)

 So for example for the string "the small thing"  if you had a synonym list for 
small:
small=>tiny,teeny"

input:
postion|1   |2|3
token  |the |small|thing
Would output

postion|1   |2|2|2|3
token  |the |small| tiny|teeny|thing

In this case when the queryParser gets back "small teeny tiny"  since they have 
the same position, they are not turned into a phrase query.

for "l'art"

input
postion|1 
token  |l'art

output
postion|1|2 
token  |l|art
In this case there are two tokens with different positions so it treats them as 
a phrase query.

Tom Burton-West


Re: bi-grams for common terms - any analyzers do that?

2010-09-25 Thread Yonik Seeley
On Sat, Sep 25, 2010 at 8:21 PM, Jonathan Rochkind  wrote:
> Huh, okay, I didn't know that #2 happened at all. Can you explain or point me 
> to documentation to explain when it happens?  I'm afraid I'm having trouble 
> understanding <<  if the analyzer returns more than one position back from a 
> "queryparser token" (whitespace). >>
>
> Not entirely sure what that means.  Can you give an example?

It's always happened, up until recently when it's been made configurable.
An example is IndexReader being split into two tokens by
WordDelimiterFilter and searched as "index reader" (i.e. the two terms
must be directly next to each other for the document to match).  If
the new "autoGeneratePhraseQueries" is off, position doesn't matter,
and the query will be treated as "index" OR "reader".

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


RE: bi-grams for common terms - any analyzers do that?

2010-09-25 Thread Jonathan Rochkind
Huh, okay, I didn't know that #2 happened at all. Can you explain or point me 
to documentation to explain when it happens?  I'm afraid I'm having trouble 
understanding <<  if the analyzer returns more than one position back from a 
"queryparser token" (whitespace). >>

Not entirely sure what that means.  Can you give an example?

As much as the query parser pre-tokenization is a problem in many cases (for me 
too), I'm not sure if dismax could happen without some pre-tokenization, 
doesn't it need that so it can combine the scores of the individual words by 
"maximum disjunction" -- it's got to know what the individual terms are, if 
it's going to dismax combine them, no?  

I'm not sure if "the queryparser forms a phrase query without explicit phrase 
quotes" is a problem for me, I had no idea it happened until now, never 
noticed, and still don't really understand in what circumstances it happens. 

Jonathan

From: Robert Muir [rcm...@gmail.com]
Sent: Saturday, September 25, 2010 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: bi-grams for common terms - any analyzers do that?

On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind wrote:

> Wow, I never heard of autoGeneratePhraseQueries before. Is there any
> documentation of what it does?
>
> My initial reaction is being confused because this sounds kind of like the
> opposite of hte original issue. The original issue is that the query parsers
> are splitting on whitespace _before_ they give tokens to the field
> analyzers.  The query parsers actually do this only with queries that are
> NOT explicit phrase queries.  I woudln't call this behavior "automatically
> generating phrase queries" exactly, and wouldn't expect that turning off
> "automatic generating of phrase queries" would prevent the pre-tokenization
> by the query parser.  But... it does somehow?
>

this is in reference to Tom's comment on his "l'art" problem (
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
 ).

so, there are two problems:
1. that the queryparser "pre-tokenizes" on whitespace at all.
2. that the queryparser forms a phrase query, if the analyzer returns more
than one position back from a "queryparser token" (whitespace).

turning off autoGeneratePhraseQueries only solves problem #2, because its
not appropriate for many languages. Before this option (e.g. Solr 1.4.x),
you had to use the PositionFilter to workaround this problem. But
PositionFilter simply "flattens/stacks" the positions (makes it seem as if
they are all synonyms). With PositionFilter you couldn't have phrase queries
at all... and you don't get a BooleanQuery coordination factor.

with autoGeneratePhraseQueries=false, you won't get a phrase query unless it
was in double quotes... its just that simple.

fixing problem #1 alltogether, is the way to go. Because then the
tokenization would be left to the analyzer completely, and you would have a
lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605

--
Robert Muir
rcm...@gmail.com


Re: bi-grams for common terms - any analyzers do that?

2010-09-25 Thread Robert Muir
On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind wrote:

> Wow, I never heard of autoGeneratePhraseQueries before. Is there any
> documentation of what it does?
>
> My initial reaction is being confused because this sounds kind of like the
> opposite of hte original issue. The original issue is that the query parsers
> are splitting on whitespace _before_ they give tokens to the field
> analyzers.  The query parsers actually do this only with queries that are
> NOT explicit phrase queries.  I woudln't call this behavior "automatically
> generating phrase queries" exactly, and wouldn't expect that turning off
> "automatic generating of phrase queries" would prevent the pre-tokenization
> by the query parser.  But... it does somehow?
>

this is in reference to Tom's comment on his "l'art" problem (
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
 ).

so, there are two problems:
1. that the queryparser "pre-tokenizes" on whitespace at all.
2. that the queryparser forms a phrase query, if the analyzer returns more
than one position back from a "queryparser token" (whitespace).

turning off autoGeneratePhraseQueries only solves problem #2, because its
not appropriate for many languages. Before this option (e.g. Solr 1.4.x),
you had to use the PositionFilter to workaround this problem. But
PositionFilter simply "flattens/stacks" the positions (makes it seem as if
they are all synonyms). With PositionFilter you couldn't have phrase queries
at all... and you don't get a BooleanQuery coordination factor.

with autoGeneratePhraseQueries=false, you won't get a phrase query unless it
was in double quotes... its just that simple.

fixing problem #1 alltogether, is the way to go. Because then the
tokenization would be left to the analyzer completely, and you would have a
lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605

-- 
Robert Muir
rcm...@gmail.com


RE: bi-grams for common terms - any analyzers do that?

2010-09-25 Thread Jonathan Rochkind
Wow, I never heard of autoGeneratePhraseQueries before. Is there any 
documentation of what it does?  

My initial reaction is being confused because this sounds kind of like the 
opposite of hte original issue. The original issue is that the query parsers 
are splitting on whitespace _before_ they give tokens to the field analyzers.  
The query parsers actually do this only with queries that are NOT explicit 
phrase queries.  I woudln't call this behavior "automatically generating phrase 
queries" exactly, and wouldn't expect that turning off "automatic generating of 
phrase queries" would prevent the pre-tokenization by the query parser.  But... 
it does somehow?

Can anyone point me to more info about what autoGeneratePhraseQueries does 
exactly?  If I can use it to turn off that behavior (in a way that only turns 
it off for some fields but not others even in a multi-field dismax query 
somehow?) that would be pretty darn useful, I've been struggling with that for 
a while. 

Jonathan

From: Robert Muir [rcm...@gmail.com]
Sent: Saturday, September 25, 2010 6:46 AM
To: solr-user@lucene.apache.org
Subject: Re: bi-grams for common terms - any analyzers do that?

On Sat, Sep 25, 2010 at 1:04 AM, Andy  wrote:

>
> But I thought specialized analyzers like CJKAnalyzer are designed for those
> languages, which don't use whitespace to separate words.
>

yes


>
> Isn't it up to the tokenizer, not the QueryParser, to decide how to split
> the query into tokens?
>

yes


> I'm really confused.
>

actually it sounds like you understand the situation perfectly!!


> If Solr's QueryParser will only split on whitespace no matter what then
> what is the point of using CJKAnalyzer?


> It sounds like Solr would be pretty useless for languages like CJK. Is
> there any work around for this? Any CJK sites using Solr?
>

if you do not want all queries to be phrasequeries, you should use:



then the lack of whitespace between words will not cause phrase queries. if
you use this option, phrase queries will only be caused if the user
explicitly puts terms in double quotes.

--
Robert Muir
rcm...@gmail.com


Re: bi-grams for common terms - any analyzers do that?

2010-09-25 Thread Robert Muir
On Sat, Sep 25, 2010 at 1:04 AM, Andy  wrote:

>
> But I thought specialized analyzers like CJKAnalyzer are designed for those
> languages, which don't use whitespace to separate words.
>

yes


>
> Isn't it up to the tokenizer, not the QueryParser, to decide how to split
> the query into tokens?
>

yes


> I'm really confused.
>

actually it sounds like you understand the situation perfectly!!


> If Solr's QueryParser will only split on whitespace no matter what then
> what is the point of using CJKAnalyzer?


> It sounds like Solr would be pretty useless for languages like CJK. Is
> there any work around for this? Any CJK sites using Solr?
>

if you do not want all queries to be phrasequeries, you should use:



then the lack of whitespace between words will not cause phrase queries. if
you use this option, phrase queries will only be caused if the user
explicitly puts terms in double quotes.

-- 
Robert Muir
rcm...@gmail.com


RE: bi-grams for common terms - any analyzers do that?

2010-09-24 Thread Dennis Gearon
I'm looking for doing CJK applications by mid next year, also Euro/Russian. Are 
the analyzers for all those up and running?


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Fri, 9/24/10, Andy  wrote:

> From: Andy 
> Subject: RE: bi-grams for common terms - any analyzers do that?
> To: solr-user@lucene.apache.org
> Date: Friday, September 24, 2010, 10:04 PM
> 
> --- On Thu, 9/23/10, Burton-West, Tom 
> wrote:
> 
> > It also splits on whitespace which causes all CJK
> queries
> > to be treated as phrase queries regardless of the CJK
> > tokenizer you use. 
> 
> But I thought specialized analyzers like CJKAnalyzer are
> designed for those languages, which don't use whitespace to
> separate words. 
> 
> Isn't it up to the tokenizer, not the QueryParser, to
> decide how to split the query into tokens?
> 
> I'm really confused.
> 
> If Solr's QueryParser will only split on whitespace no
> matter what then what is the point of using CJKAnalyzer?
> 
> It sounds like Solr would be pretty useless for languages
> like CJK. Is there any work around for this? Any CJK sites
> using Solr?
> 
> 
>       
>


RE: bi-grams for common terms - any analyzers do that?

2010-09-24 Thread Andy

--- On Thu, 9/23/10, Burton-West, Tom  wrote:

> It also splits on whitespace which causes all CJK queries
> to be treated as phrase queries regardless of the CJK
> tokenizer you use. 

But I thought specialized analyzers like CJKAnalyzer are designed for those 
languages, which don't use whitespace to separate words. 

Isn't it up to the tokenizer, not the QueryParser, to decide how to split the 
query into tokens?

I'm really confused.

If Solr's QueryParser will only split on whitespace no matter what then what is 
the point of using CJKAnalyzer?

It sounds like Solr would be pretty useless for languages like CJK. Is there 
any work around for this? Any CJK sites using Solr?


  


Re: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Robert Muir
On Thu, Sep 23, 2010 at 12:02 PM, Burton-West, Tom wrote:
>
> The problem with "l'art" is actually due to a bug or feature in the
> QueryParser.  Currently the QueryParser interacts with the token chain and
> decides whether the tokens coming back from a tokenfilter should be treated
> as a phrase query based on whether or not more than one non-synonym token
> comes back from the tokestream for a single 'queryparser token'.
>

Just a note: in solr's trunk or 3x branch you have a lot more flexibility
already with this stuff:

1. for the specific problem of l'art: you can use the ElisionFilterFactory,
its actually designed to address this. But before it was a bit unwieldy to
use (you had to supply your own list of french contractions: l', m', etc):
with trunk or 3x you can just add it to your analyzer, if you don't specify
a list it uses the default list from Lucene's FrenchAnalyzer.

2. if you are using WordDelimiterFilter, you can customize how it splits on
a per-character basis. See https://issues.apache.org/jira/browse/SOLR-2059 ,
a user gave a nice example there of how you can treat '#' and '@' special
for twitter messages.

3. in all cases, if you don't want phrase queries automatically formed
unless the user put them in quotes, you can turn it off in your fieldtype:


(somewhat related)
Tom, thanks for posting your schema. given your problems with huge amounts
of terms, i looked at your previous messages and ran some quick math and
guestimated your average term length must be quite large.

Yet i notice from your website (
http://www.hathitrust.org/visualizations_languages) it says you have 18,329
thai books (and you have no ThaiWordFilter in your schema).

Are you sure that your terms are not filled with tons of very long
untokenized thai sentences? (thai uses no spaces between words) just an idea
:)

-- 
Robert Muir
rcm...@gmail.com


RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Burton-West, Tom
Hi all,

The CommonGrams filter is designed to only work on phrase queries.  It is 
designed to solve the problem of slow phrase queries with phrases containing 
common words, when you don't want to use stop words.  It would not make sense 
for Boolean queries. Boolean queries just get passed through unchanged. 

For background on the CommonGramsFilter please see: 
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2

There are two filters,  CommonGramsFilter and CommonGramsQueryFilter you use 
CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing.  
CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries 
(i.e. non-phrase queries)  will work.  For example "the rain" would produce 3 
tokens:
the  position 1
rain position 2
the-rain position 1
When you have a phrase query, you want Solr to search for the token "the-rain" 
so you don't want the unigrams.
When you have a Boolean query, the CommonGramsQueryFilter only gets one token 
as input and simply outputs it.

Appended below is a sample config from our schema.xml.

For background on the problem with "l'art" please see: 
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We 
used a custom filter to change all punctuation to spaces.   You could probably 
use one of the other filters to do this. (See the comments from David Smiley at 
the end of the blog post regarding possible approaches.)At the time, I just 
couldn't get WordDelimiterFilter to behave as documented with various 
combinations of parameters and was not aware of the other filters David 
mentions.

The problem with "l'art" is actually due to a bug or feature in the 
QueryParser.  Currently the QueryParser interacts with the token chain and 
decides whether the tokens coming back from a tokenfilter should be treated as 
a phrase query based on whether or not more than one non-synonym token comes 
back from the tokestream for a single 'queryparser token'.
It also splits on whitespace which causes all CJK queries to be treated as 
phrase queries regardless of the CJK tokenizer you use. This is a contentious 
issue.  See https://issues.apache.org/jira/browse/LUCENE-2458.  There is a 
semi-workaround using PositionFilter, but it has many undesirable side effects. 
 I believe Robert Muir, who is an expert on the various problems involved and  
opened Lucene-2458 is working on a better fix.

Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search




−







−










RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Jonathan Rochkind
I've been thinking about the CommonGramsFilter for a while, and am confused 
about how it works. Can anyone provide examples?  Are you meant to include the 
analyzer at both index and query time?  The description on the wiki says among 
other things: "The CommonGramsQueryFilter converts the phrase query "the cat" 
into the single term query the_cat." -- does that mean it _only_ works on 
phrase queries?If you've indexed with commongrams, what will happen at 
query time to a non-phrase query <> ?   Very confused. 

From: Steven A Rowe [sar...@syr.edu]
Sent: Thursday, September 23, 2010 8:21 AM
To: solr-user@lucene.apache.org
Subject: RE: bi-grams for common terms - any analyzers do that?

<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory>


> -Original Message-
> From: Andy [mailto:angelf...@yahoo.com]
> Sent: Thursday, September 23, 2010 6:05 AM
> To: solr-user@lucene.apache.org
> Subject: bi-grams for common terms - any analyzers do that?
>
> Hi,
>
> I was going thru this LucidImagnaton presentation on analysis:
>
> http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-
> on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right
>
> 1) on p.31-33, it talks about forming bi-grams for the 32 most common
> terms during indexing. Is there an analyzer that does that?
>
> 2) on p. 34, it mentions that the default Solr configuraton would turn
> "L'art" into the phrase query "L art" but it is much more efficient to
> turn it into a single token 'L art'. Which analyzer would do that?
>
> Thanks.
> Andy
>
>
>


RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Steven A Rowe



> -Original Message-
> From: Andy [mailto:angelf...@yahoo.com]
> Sent: Thursday, September 23, 2010 6:05 AM
> To: solr-user@lucene.apache.org
> Subject: bi-grams for common terms - any analyzers do that?
> 
> Hi,
> 
> I was going thru this LucidImagnaton presentation on analysis:
> 
> http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-
> on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right
> 
> 1) on p.31-33, it talks about forming bi-grams for the 32 most common
> terms during indexing. Is there an analyzer that does that?
> 
> 2) on p. 34, it mentions that the default Solr configuraton would turn
> "L'art" into the phrase query "L art" but it is much more efficient to
> turn it into a single token 'L art'. Which analyzer would do that?
> 
> Thanks.
> Andy
> 
> 
>