RE: bi-grams for common terms - any analyzers do that?
Hi Yonik, >>If the new "autoGeneratePhraseQueries" is off, position doesn't matter, and >>the query will >>be treated as "index" OR "reader". Just wanted to make sure, in Solr does autoGeneratePhraseQueries = "off" treat the query with the *default* query operator as set in SolrConfig rather than necessarily using the Boolean "OR" operator? i.e. if and autoGeneratePhraseQueries = off then "IndexReader" -> "index" "reader" -> "index" AND "reader" Tom
RE: bi-grams for common terms - any analyzers do that?
Hi Jonathan, >> I'm afraid I'm having trouble understanding "if the analyzer returns more >> than one position back from a "queryparser token" >>I'm not sure if "the queryparser forms a phrase query without explicit phrase >>quotes" is a problem for me, I had no idea it happened until now, never >>noticed, and still don't really understand in what circumstances it happens. The problem I had was for a Boolean query "l'art AND historie" that the WordDelimiterFilter tokenized "l'art" as two tokens "l" at position 1 and "art" at position 2. So the queryparser decided this means a phrase query for "l" followed immediately by "art". See http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance for details. This would happen whenever any token filter split a token into more than one token. For example a filter that splits foo-bar into "foo" "bar". The exception is SynonymFilter or something like it. In the case of SynonymFilter, its not really a case of "splitting" one token into multiple tokens, but given one token of input, it outputs all the synonyms of the term. However all the tokens have the same position attribute. (see: http://www.lucidimagination.com/search/document/CDRG_ch05_5.6.19?q=synonym%20filter) So for example for the string "the small thing" if you had a synonym list for small: small=>tiny,teeny" input: postion|1 |2|3 token |the |small|thing Would output postion|1 |2|2|2|3 token |the |small| tiny|teeny|thing In this case when the queryParser gets back "small teeny tiny" since they have the same position, they are not turned into a phrase query. for "l'art" input postion|1 token |l'art output postion|1|2 token |l|art In this case there are two tokens with different positions so it treats them as a phrase query. Tom Burton-West
Re: bi-grams for common terms - any analyzers do that?
On Sat, Sep 25, 2010 at 8:21 PM, Jonathan Rochkind wrote: > Huh, okay, I didn't know that #2 happened at all. Can you explain or point me > to documentation to explain when it happens? I'm afraid I'm having trouble > understanding << if the analyzer returns more than one position back from a > "queryparser token" (whitespace). >> > > Not entirely sure what that means. Can you give an example? It's always happened, up until recently when it's been made configurable. An example is IndexReader being split into two tokens by WordDelimiterFilter and searched as "index reader" (i.e. the two terms must be directly next to each other for the document to match). If the new "autoGeneratePhraseQueries" is off, position doesn't matter, and the query will be treated as "index" OR "reader". -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
RE: bi-grams for common terms - any analyzers do that?
Huh, okay, I didn't know that #2 happened at all. Can you explain or point me to documentation to explain when it happens? I'm afraid I'm having trouble understanding << if the analyzer returns more than one position back from a "queryparser token" (whitespace). >> Not entirely sure what that means. Can you give an example? As much as the query parser pre-tokenization is a problem in many cases (for me too), I'm not sure if dismax could happen without some pre-tokenization, doesn't it need that so it can combine the scores of the individual words by "maximum disjunction" -- it's got to know what the individual terms are, if it's going to dismax combine them, no? I'm not sure if "the queryparser forms a phrase query without explicit phrase quotes" is a problem for me, I had no idea it happened until now, never noticed, and still don't really understand in what circumstances it happens. Jonathan From: Robert Muir [rcm...@gmail.com] Sent: Saturday, September 25, 2010 10:58 AM To: solr-user@lucene.apache.org Subject: Re: bi-grams for common terms - any analyzers do that? On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind wrote: > Wow, I never heard of autoGeneratePhraseQueries before. Is there any > documentation of what it does? > > My initial reaction is being confused because this sounds kind of like the > opposite of hte original issue. The original issue is that the query parsers > are splitting on whitespace _before_ they give tokens to the field > analyzers. The query parsers actually do this only with queries that are > NOT explicit phrase queries. I woudln't call this behavior "automatically > generating phrase queries" exactly, and wouldn't expect that turning off > "automatic generating of phrase queries" would prevent the pre-tokenization > by the query parser. But... it does somehow? > this is in reference to Tom's comment on his "l'art" problem ( http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance ). so, there are two problems: 1. that the queryparser "pre-tokenizes" on whitespace at all. 2. that the queryparser forms a phrase query, if the analyzer returns more than one position back from a "queryparser token" (whitespace). turning off autoGeneratePhraseQueries only solves problem #2, because its not appropriate for many languages. Before this option (e.g. Solr 1.4.x), you had to use the PositionFilter to workaround this problem. But PositionFilter simply "flattens/stacks" the positions (makes it seem as if they are all synonyms). With PositionFilter you couldn't have phrase queries at all... and you don't get a BooleanQuery coordination factor. with autoGeneratePhraseQueries=false, you won't get a phrase query unless it was in double quotes... its just that simple. fixing problem #1 alltogether, is the way to go. Because then the tokenization would be left to the analyzer completely, and you would have a lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605 -- Robert Muir rcm...@gmail.com
Re: bi-grams for common terms - any analyzers do that?
On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind wrote: > Wow, I never heard of autoGeneratePhraseQueries before. Is there any > documentation of what it does? > > My initial reaction is being confused because this sounds kind of like the > opposite of hte original issue. The original issue is that the query parsers > are splitting on whitespace _before_ they give tokens to the field > analyzers. The query parsers actually do this only with queries that are > NOT explicit phrase queries. I woudln't call this behavior "automatically > generating phrase queries" exactly, and wouldn't expect that turning off > "automatic generating of phrase queries" would prevent the pre-tokenization > by the query parser. But... it does somehow? > this is in reference to Tom's comment on his "l'art" problem ( http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance ). so, there are two problems: 1. that the queryparser "pre-tokenizes" on whitespace at all. 2. that the queryparser forms a phrase query, if the analyzer returns more than one position back from a "queryparser token" (whitespace). turning off autoGeneratePhraseQueries only solves problem #2, because its not appropriate for many languages. Before this option (e.g. Solr 1.4.x), you had to use the PositionFilter to workaround this problem. But PositionFilter simply "flattens/stacks" the positions (makes it seem as if they are all synonyms). With PositionFilter you couldn't have phrase queries at all... and you don't get a BooleanQuery coordination factor. with autoGeneratePhraseQueries=false, you won't get a phrase query unless it was in double quotes... its just that simple. fixing problem #1 alltogether, is the way to go. Because then the tokenization would be left to the analyzer completely, and you would have a lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605 -- Robert Muir rcm...@gmail.com
RE: bi-grams for common terms - any analyzers do that?
Wow, I never heard of autoGeneratePhraseQueries before. Is there any documentation of what it does? My initial reaction is being confused because this sounds kind of like the opposite of hte original issue. The original issue is that the query parsers are splitting on whitespace _before_ they give tokens to the field analyzers. The query parsers actually do this only with queries that are NOT explicit phrase queries. I woudln't call this behavior "automatically generating phrase queries" exactly, and wouldn't expect that turning off "automatic generating of phrase queries" would prevent the pre-tokenization by the query parser. But... it does somehow? Can anyone point me to more info about what autoGeneratePhraseQueries does exactly? If I can use it to turn off that behavior (in a way that only turns it off for some fields but not others even in a multi-field dismax query somehow?) that would be pretty darn useful, I've been struggling with that for a while. Jonathan From: Robert Muir [rcm...@gmail.com] Sent: Saturday, September 25, 2010 6:46 AM To: solr-user@lucene.apache.org Subject: Re: bi-grams for common terms - any analyzers do that? On Sat, Sep 25, 2010 at 1:04 AM, Andy wrote: > > But I thought specialized analyzers like CJKAnalyzer are designed for those > languages, which don't use whitespace to separate words. > yes > > Isn't it up to the tokenizer, not the QueryParser, to decide how to split > the query into tokens? > yes > I'm really confused. > actually it sounds like you understand the situation perfectly!! > If Solr's QueryParser will only split on whitespace no matter what then > what is the point of using CJKAnalyzer? > It sounds like Solr would be pretty useless for languages like CJK. Is > there any work around for this? Any CJK sites using Solr? > if you do not want all queries to be phrasequeries, you should use: then the lack of whitespace between words will not cause phrase queries. if you use this option, phrase queries will only be caused if the user explicitly puts terms in double quotes. -- Robert Muir rcm...@gmail.com
Re: bi-grams for common terms - any analyzers do that?
On Sat, Sep 25, 2010 at 1:04 AM, Andy wrote: > > But I thought specialized analyzers like CJKAnalyzer are designed for those > languages, which don't use whitespace to separate words. > yes > > Isn't it up to the tokenizer, not the QueryParser, to decide how to split > the query into tokens? > yes > I'm really confused. > actually it sounds like you understand the situation perfectly!! > If Solr's QueryParser will only split on whitespace no matter what then > what is the point of using CJKAnalyzer? > It sounds like Solr would be pretty useless for languages like CJK. Is > there any work around for this? Any CJK sites using Solr? > if you do not want all queries to be phrasequeries, you should use: then the lack of whitespace between words will not cause phrase queries. if you use this option, phrase queries will only be caused if the user explicitly puts terms in double quotes. -- Robert Muir rcm...@gmail.com
RE: bi-grams for common terms - any analyzers do that?
I'm looking for doing CJK applications by mid next year, also Euro/Russian. Are the analyzers for all those up and running? Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Fri, 9/24/10, Andy wrote: > From: Andy > Subject: RE: bi-grams for common terms - any analyzers do that? > To: solr-user@lucene.apache.org > Date: Friday, September 24, 2010, 10:04 PM > > --- On Thu, 9/23/10, Burton-West, Tom > wrote: > > > It also splits on whitespace which causes all CJK > queries > > to be treated as phrase queries regardless of the CJK > > tokenizer you use. > > But I thought specialized analyzers like CJKAnalyzer are > designed for those languages, which don't use whitespace to > separate words. > > Isn't it up to the tokenizer, not the QueryParser, to > decide how to split the query into tokens? > > I'm really confused. > > If Solr's QueryParser will only split on whitespace no > matter what then what is the point of using CJKAnalyzer? > > It sounds like Solr would be pretty useless for languages > like CJK. Is there any work around for this? Any CJK sites > using Solr? > > > >
RE: bi-grams for common terms - any analyzers do that?
--- On Thu, 9/23/10, Burton-West, Tom wrote: > It also splits on whitespace which causes all CJK queries > to be treated as phrase queries regardless of the CJK > tokenizer you use. But I thought specialized analyzers like CJKAnalyzer are designed for those languages, which don't use whitespace to separate words. Isn't it up to the tokenizer, not the QueryParser, to decide how to split the query into tokens? I'm really confused. If Solr's QueryParser will only split on whitespace no matter what then what is the point of using CJKAnalyzer? It sounds like Solr would be pretty useless for languages like CJK. Is there any work around for this? Any CJK sites using Solr?
Re: bi-grams for common terms - any analyzers do that?
On Thu, Sep 23, 2010 at 12:02 PM, Burton-West, Tom wrote: > > The problem with "l'art" is actually due to a bug or feature in the > QueryParser. Currently the QueryParser interacts with the token chain and > decides whether the tokens coming back from a tokenfilter should be treated > as a phrase query based on whether or not more than one non-synonym token > comes back from the tokestream for a single 'queryparser token'. > Just a note: in solr's trunk or 3x branch you have a lot more flexibility already with this stuff: 1. for the specific problem of l'art: you can use the ElisionFilterFactory, its actually designed to address this. But before it was a bit unwieldy to use (you had to supply your own list of french contractions: l', m', etc): with trunk or 3x you can just add it to your analyzer, if you don't specify a list it uses the default list from Lucene's FrenchAnalyzer. 2. if you are using WordDelimiterFilter, you can customize how it splits on a per-character basis. See https://issues.apache.org/jira/browse/SOLR-2059 , a user gave a nice example there of how you can treat '#' and '@' special for twitter messages. 3. in all cases, if you don't want phrase queries automatically formed unless the user put them in quotes, you can turn it off in your fieldtype: (somewhat related) Tom, thanks for posting your schema. given your problems with huge amounts of terms, i looked at your previous messages and ran some quick math and guestimated your average term length must be quite large. Yet i notice from your website ( http://www.hathitrust.org/visualizations_languages) it says you have 18,329 thai books (and you have no ThaiWordFilter in your schema). Are you sure that your terms are not filled with tons of very long untokenized thai sentences? (thai uses no spaces between words) just an idea :) -- Robert Muir rcm...@gmail.com
RE: bi-grams for common terms - any analyzers do that?
Hi all, The CommonGrams filter is designed to only work on phrase queries. It is designed to solve the problem of slow phrase queries with phrases containing common words, when you don't want to use stop words. It would not make sense for Boolean queries. Boolean queries just get passed through unchanged. For background on the CommonGramsFilter please see: http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 There are two filters, CommonGramsFilter and CommonGramsQueryFilter you use CommonGramsFilter on indexing and CommonGramsQueryFilter for query processing. CommonGramsFilter outputs both CommonGrams and Unigrams so that Boolean queries (i.e. non-phrase queries) will work. For example "the rain" would produce 3 tokens: the position 1 rain position 2 the-rain position 1 When you have a phrase query, you want Solr to search for the token "the-rain" so you don't want the unigrams. When you have a Boolean query, the CommonGramsQueryFilter only gets one token as input and simply outputs it. Appended below is a sample config from our schema.xml. For background on the problem with "l'art" please see: http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance We used a custom filter to change all punctuation to spaces. You could probably use one of the other filters to do this. (See the comments from David Smiley at the end of the blog post regarding possible approaches.)At the time, I just couldn't get WordDelimiterFilter to behave as documented with various combinations of parameters and was not aware of the other filters David mentions. The problem with "l'art" is actually due to a bug or feature in the QueryParser. Currently the QueryParser interacts with the token chain and decides whether the tokens coming back from a tokenfilter should be treated as a phrase query based on whether or not more than one non-synonym token comes back from the tokestream for a single 'queryparser token'. It also splits on whitespace which causes all CJK queries to be treated as phrase queries regardless of the CJK tokenizer you use. This is a contentious issue. See https://issues.apache.org/jira/browse/LUCENE-2458. There is a semi-workaround using PositionFilter, but it has many undesirable side effects. I believe Robert Muir, who is an expert on the various problems involved and opened Lucene-2458 is working on a better fix. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search − −
RE: bi-grams for common terms - any analyzers do that?
I've been thinking about the CommonGramsFilter for a while, and am confused about how it works. Can anyone provide examples? Are you meant to include the analyzer at both index and query time? The description on the wiki says among other things: "The CommonGramsQueryFilter converts the phrase query "the cat" into the single term query the_cat." -- does that mean it _only_ works on phrase queries?If you've indexed with commongrams, what will happen at query time to a non-phrase query <> ? Very confused. From: Steven A Rowe [sar...@syr.edu] Sent: Thursday, September 23, 2010 8:21 AM To: solr-user@lucene.apache.org Subject: RE: bi-grams for common terms - any analyzers do that? <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory> > -Original Message- > From: Andy [mailto:angelf...@yahoo.com] > Sent: Thursday, September 23, 2010 6:05 AM > To: solr-user@lucene.apache.org > Subject: bi-grams for common terms - any analyzers do that? > > Hi, > > I was going thru this LucidImagnaton presentation on analysis: > > http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks- > on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right > > 1) on p.31-33, it talks about forming bi-grams for the 32 most common > terms during indexing. Is there an analyzer that does that? > > 2) on p. 34, it mentions that the default Solr configuraton would turn > "L'art" into the phrase query "L art" but it is much more efficient to > turn it into a single token 'L art'. Which analyzer would do that? > > Thanks. > Andy > > >
RE: bi-grams for common terms - any analyzers do that?
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory> > -Original Message- > From: Andy [mailto:angelf...@yahoo.com] > Sent: Thursday, September 23, 2010 6:05 AM > To: solr-user@lucene.apache.org > Subject: bi-grams for common terms - any analyzers do that? > > Hi, > > I was going thru this LucidImagnaton presentation on analysis: > > http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks- > on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right > > 1) on p.31-33, it talks about forming bi-grams for the 32 most common > terms during indexing. Is there an analyzer that does that? > > 2) on p. 34, it mentions that the default Solr configuraton would turn > "L'art" into the phrase query "L art" but it is much more efficient to > turn it into a single token 'L art'. Which analyzer would do that? > > Thanks. > Andy > > >
bi-grams for common terms - any analyzers do that?
Hi, I was going thru this LucidImagnaton presentation on analysis: http://www.slideshare.net/LucidImagination/analyze-this-tips-and-tricks-on-getting-the-lucene-solr-analyzer-to-index-and-search-your-content-right 1) on p.31-33, it talks about forming bi-grams for the 32 most common terms during indexing. Is there an analyzer that does that? 2) on p. 34, it mentions that the default Solr configuraton would turn "L'art" into the phrase query "L art" but it is much more efficient to turn it into a single token 'L art'. Which analyzer would do that? Thanks. Andy