Re: Dismax , query phrases
: how would it fit c:some phrase into that structure? : : does this make sense? : : ( (a:some | b:some ) (a:phrase | b:phrase) ( c:some phrase) ) that's pretty much exactly what pf does, the only distinction is you get... +( (a:some | b:some ) (a:phrase | b:phrase) ) ( c:some phrase ) ...where the mm param only applies to the (mandatory) boolean built using the qf. -Hoss
Re: Dismax , query phrases
On Tue, 30 Sep 2008 11:43:57 -0700 (PDT) Chris Hostetter [EMAIL PROTECTED] wrote: : That's why I was wondering how Dismax breaks it all apart. It makes sense...I : suppose what I'd like to have is a way to tell dismax which fields NOT to : tokenize the input for. For these fields, it would pass the full q instead of : each part of it. Does this make sense? would it be useful at all? the *goal* makes sense, but the implementation would be ... problematic. you have to remember the DisMax parser's whole way of working is to make each chunk of input match against any qf field, and find the highest scoring field for each chunk, with this input... q = some phase qf = a b c ...you get... ( (a:some | b:some | c:some) (a:phrase | b:phrase | c:phrase) ) ...even if dismax could tell that c was a field that should only support exact matches, thanks Hoss, it would by a configuration option. how would it fit c:some phrase into that structure? does this make sense? ( (a:some | b:some ) (a:phrase | b:phrase) ( c:some phrase) ) I've already kinda forgotten how this thread started ... trying to get *exact* matches to always score higher using dismax - keeping in mind that I have multiple exact fields, with different boosts... but would it make sense to just use your exact fields in the pf, and have inexact versions of them in the qf? then docs that match your input exactly should score at the top, but less exact matches will also still match. aha! right, i think that makes sense...i obviously haven't got my head properly around all the different functionality of dismax. I will try it when I'm back @ work... right now, i seem to have solved the problem by using shingles -the fields are artists, song albumtitles ,so high matching on shingles is quite approximate to exact matching - except that I had to remove stopwords, so that impacts on performance. Thanks again :) B _ {Beto|Norberto|Numard} Meijome Which is worse: ignorance or apathy? Don't know. Don't care. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Dismax , query phrases
: That's why I was wondering how Dismax breaks it all apart. It makes sense...I : suppose what I'd like to have is a way to tell dismax which fields NOT to : tokenize the input for. For these fields, it would pass the full q instead of : each part of it. Does this make sense? would it be useful at all? the *goal* makes sense, but the implementation would be ... problematic. you have to remember the DisMax parser's whole way of working is to make each chunk of input match against any qf field, and find the highest scoring field for each chunk, with this input... q = some phase qf = a b c ...you get... ( (a:some | b:some | c:some) (a:phrase | b:phrase | c:phrase) ) ...even if dismax could tell that c was a field that should only support exact matches, how would it fit c:some phrase into that structure? I've already kinda forgotten how this thread started ... but would it make sense to just use your exact fields in the pf, and have inexact versions of them in the qf? then docs that match your input exactly should score at the top, but less exact matches will also still match. -Hoss
Re: Dismax , query phrases
On Fri, 26 Sep 2008 10:42:42 -0700 (PDT) Chris Hostetter [EMAIL PROTECTED] wrote: : tokenizer : class=solr.KeywordTokenizerFactory / !-- The LowerCase TokenFilter does : Now, when I search with ?q=the doors , all the terms in my q= aren't used : together to build the dismaxQuery , so I never get a match on the _exact fields: The query parser (even the dismax queryparser) does it's white space chunking before handing any input off to the analyzer for the appropriate field, so with [[ ?q=the doors ]] the and doors are going to get analyzed seperately ... which is why you see artist_exact:the^100.0 and artist_exact:doors^100.0 in your parsedquery -- *BUT* since you used KeywordTOkenizer at index time, you'll never get a match for either of those on any document (unles the artist is just the or doors) Hi Hoss :) thanks for the feedback - I arrived @ the same conclusion . The biz requirement is that these *_exact fields match exactly the original contents of the field. Right now we are using Dismax, and changing this means rewriting a lot of the queries , which isn't possible. That's why I was wondering how Dismax breaks it all apart. It makes sense...I suppose what I'd like to have is a way to tell dismax which fields NOT to tokenize the input for. For these fields, it would pass the full q instead of each part of it. Does this make sense? would it be useful at all? : I've tried with other queries that don't include stopwords (smashing pumpkins, : for example), and in all cases, if I don't use , only the LAST word is used : with my _exact fields ( tried with 1, 2 and 3 words, always the same against my : _exact fields..) this LAST word part doesn't make sense to me ... you can see the making it into your query on the *_exact fields in the first DisjunctionMaxQuery, do you have toStrings for these other queries we could see to understand what you mean? I agree, it makes sense as you say...i must have missed the initial tokens. I can't confirm atm, so I'll follow the common sense path :) As usual, thanks for your time and insights :) B _ {Beto|Norberto|Numard} Meijome Humans die and turn to dust, but writing makes us remembered 4000-year-old words of an Egyptian scribe I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Dismax , query phrases
I'm not fully following everything you've got here, but one thing jumped out at me... : tokenizer : class=solr.KeywordTokenizerFactory / !-- The LowerCase TokenFilter does : Now, when I search with ?q=the doors , all the terms in my q= aren't used : together to build the dismaxQuery , so I never get a match on the _exact fields: The query parser (even the dismax queryparser) does it's white space chunking before handing any input off to the analyzer for the appropriate field, so with [[ ?q=the doors ]] the and doors are going to get analyzed seperately ... which is why you see artist_exact:the^100.0 and artist_exact:doors^100.0 in your parsedquery -- *BUT* since you used KeywordTOkenizer at index time, you'll never get a match for either of those on any document (unles the artist is just the or doors) : I've tried with other queries that don't include stopwords (smashing pumpkins, : for example), and in all cases, if I don't use , only the LAST word is used : with my _exact fields ( tried with 1, 2 and 3 words, always the same against my : _exact fields..) this LAST word part doesn't make sense to me ... you can see the making it into your query on the *_exact fields in the first DisjunctionMaxQuery, do you have toStrings for these other queries we could see to understand what you mean? -Hoss
Re: Dismax , query phrases
On Wed, 24 Sep 2008 08:34:57 -0700 (PDT) Otis Gospodnetic [EMAIL PROTECTED] wrote: What happens if you change ps from 100 to 1 and comment out that ord function? Otis, I think what I am after is what Hoss described in his last paragraph in his reply to your email last year : http://www.nabble.com/DisMax-and-REQUIRED-OR-REQUIRED-query-rewrite-td13395349.html#a13395349 ie, I want everything that Dismax does, BUT , on certain fields, I want it to search for all the terms in my q= , as a phrase. I am thinking of modifying dismax to allow this to be passed as a configuration ( eg, fieldsSearchExact=artist_exact, title_exact), but if I can avoid it that'd be great :). any other ideas, anyone?? thanks! B _ {Beto|Norberto|Numard} Meijome Nature doesn't care how smart you are. You can still be wrong. Richard Feynman I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Dismax , query phrases
Hello, I've seen references to this in the list, but not completely explained...my apologies if this is FAQ (and for the length of the email). I am using dismax across a number of fields on an index with data about music albums songs - the fields are quite full of stop words. I am trying to boost 'exact' matches - ie, if you search for 'The Doors', those documents with 'The Doors' should be first. I've created the following fieldType and I use it for fields artist_exact and title_exact: fieldType name=lowerCaseString class=solr.TextField sortMissingLast=true omitNorms=true analyzer !-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token -- tokenizer class=solr.KeywordTokenizerFactory / !-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive -- filter class=solr.LowerCaseFilterFactory / !-- The TrimFilter removes any leading or trailing whitespace -- filter class=solr.TrimFilterFactory / /analyzer /fieldType I then give artist_exact and title_exact pretty high boosts ( title_exact^200.0 artist_exact^100.0 ) Now, when I search with ?q=the doors , all the terms in my q= aren't used together to build the dismaxQuery , so I never get a match on the _exact fields: (there are a few other fields involved...pretty self explanatory) str name=rawquerystringthe doors/str str name=querystringthe doors/str ___ str name=parsedquery +((DisjunctionMaxQuery((title_ngram2:th he^0.1 | artist_ngram2:th he^0.1 | title_ngram3:the^4.5 | artist_ngram3:the^3.5 | artist_exact:the^100.0 | title_exact:the^200.0)~0.01) DisjunctionMaxQuery((genre:door^0.2 | title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 | title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 | title_exact:doors^200.0)~0.01))~2) DisjunctionMaxQuery((title:door^2.0 | artist:door^0.8)~0.01) FunctionQuery((ord(release_year))^0.5) /str str name=parsedquery_toString +(((title_ngram2:th he^0.1 | artist_ngram2:th he^0.1 | title_ngram3:the^4.5 | artist_ngram3:the^3.5 | artist_exact:the^100.0 | title_exact:the^200.0)~0.01 (genre:door^0.2 | title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 | title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 | title_exact:doors^200.0)~0.01)~2) (title:door^2.0 | artist:door^0.8)~0.01 (ord(release_year))^0.5 but, if I build my search as ?q=the doors str name=parsedquery +DisjunctionMaxQuery((genre:door^0.2 | title_ngram2:th he e d do oo or rs^0.1 | artist_ngram2:th he e d do oo or rs^0.1 | title_ngram3:the he e d do doo oor ors^4.5 | title:door^6.0 | artist_ngram3:the he e d do doo oor ors^3.5 | artist:door^4.0 | artist_exact:the doors^100.0 | title_exact:the doors^200.0)~0.01) DisjunctionMaxQuery((title:door^2.0 | artist:door^0.8)~0.01) FunctionQuery((ord(release_year))^0.5) /str str name=parsedquery_toString +(genre:door^0.2 | title_ngram2:th he e d do oo or rs^0.1 | artist_ngram2:th he e d do oo or rs^0.1 | title_ngram3:the he e d do doo oor ors^4.5 | title:door^6.0 | artist_ngram3:the he e d do doo oor ors^3.5 | artist:door^4.0 | artist_exact:the doors^100.0 | title_exact:the doors^200.0)~0.01 (title:door^2.0 | artist:door^0.8)~0.01 (ord(release_year))^0.5 I've tried with other queries that don't include stopwords (smashing pumpkins, for example), and in all cases, if I don't use , only the LAST word is used with my _exact fields ( tried with 1, 2 and 3 words, always the same against my _exact fields..) What is the reason for this behaviour? my full dismax config is : str name=mm2-1 5-2 690%/str str name=spellchecktrue/str str name=spellcheck.extendedResultstrue/str str name=tie0.01/str str name=qf title_exact^200.0 artist_exact^100.0 title^6.0 title_ngram3^4.5 artist^4.0 artist_ngram3^3.5 title_ngram2^0.1 artist_ngram2^0.1 genre^0.2 /str str name=q.alt*:*/str str name=spellcheck.collatetrue/str str name=defTypedismax/str str name=spellcheck.onlyMorePopulartrue/str str name=rows10/str str name=pftitle^2.0 artist^0.8/str str name=echoParamsall/str str name=fl*,score/str str name=bford(release_year)^0.5/str str name=spellcheck.count1/str str name=ps100/str /lst TIA! B _ {Beto|Norberto|Numard} Meijome Never offend people with style when you can offend them with substance. Sam Brown I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Dismax , query phrases
What happens if you change ps from 100 to 1 and comment out that ord function? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Norberto Meijome [EMAIL PROTECTED] To: SOLR-Usr-ML solr-user@lucene.apache.org Sent: Wednesday, September 24, 2008 11:23:18 AM Subject: Dismax , query phrases Hello, I've seen references to this in the list, but not completely explained...my apologies if this is FAQ (and for the length of the email). I am using dismax across a number of fields on an index with data about music albums songs - the fields are quite full of stop words. I am trying to boost 'exact' matches - ie, if you search for 'The Doors', those documents with 'The Doors' should be first. I've created the following fieldType and I use it for fields artist_exact and title_exact: sortMissingLast=true omitNorms=true class=solr.KeywordTokenizerFactory / I then give artist_exact and title_exact pretty high boosts ( title_exact^200.0 artist_exact^100.0 ) Now, when I search with ?q=the doors , all the terms in my q= aren't used together to build the dismaxQuery , so I never get a match on the _exact fields: (there are a few other fields involved...pretty self explanatory) the doors the doors ___ +((DisjunctionMaxQuery((title_ngram2:th he^0.1 | artist_ngram2:th he^0.1 | title_ngram3:the^4.5 | artist_ngram3:the^3.5 | artist_exact:the^100.0 | title_exact:the^200.0)~0.01) DisjunctionMaxQuery((genre:door^0.2 | title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 | title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 | title_exact:doors^200.0)~0.01))~2) DisjunctionMaxQuery((title:door^2.0 | artist:door^0.8)~0.01) FunctionQuery((ord(release_year))^0.5) +(((title_ngram2:th he^0.1 | artist_ngram2:th he^0.1 | title_ngram3:the^4.5 | artist_ngram3:the^3.5 | artist_exact:the^100.0 | title_exact:the^200.0)~0.01 (genre:door^0.2 | title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 | title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 | title_exact:doors^200.0)~0.01)~2) (title:door^2.0 | artist:door^0.8)~0.01 (ord(release_year))^0.5 but, if I build my search as ?q=the doors +DisjunctionMaxQuery((genre:door^0.2 | title_ngram2:th he e d do oo or rs^0.1 | artist_ngram2:th he e d do oo or rs^0.1 | title_ngram3:the he e d do doo oor ors^4.5 | title:door^6.0 | artist_ngram3:the he e d do doo oor ors^3.5 | artist:door^4.0 | artist_exact:the doors^100.0 | title_exact:the doors^200.0)~0.01) DisjunctionMaxQuery((title:door^2.0 | artist:door^0.8)~0.01) FunctionQuery((ord(release_year))^0.5) +(genre:door^0.2 | title_ngram2:th he e d do oo or rs^0.1 | artist_ngram2:th he e d do oo or rs^0.1 | title_ngram3:the he e d do doo oor ors^4.5 | title:door^6.0 | artist_ngram3:the he e d do doo oor ors^3.5 | artist:door^4.0 | artist_exact:the doors^100.0 | title_exact:the doors^200.0)~0.01 (title:door^2.0 | artist:door^0.8)~0.01 (ord(release_year))^0.5 I've tried with other queries that don't include stopwords (smashing pumpkins, for example), and in all cases, if I don't use , only the LAST word is used with my _exact fields ( tried with 1, 2 and 3 words, always the same against my _exact fields..) What is the reason for this behaviour? my full dismax config is : 2-1 5-2 690% true true 0.01 title_exact^200.0 artist_exact^100.0 title^6.0 title_ngram3^4.5 artist^4.0 artist_ngram3^3.5 title_ngram2^0.1 artist_ngram2^0.1 genre^0.2 *:* true dismax true 10 title^2.0 artist^0.8 all *,score ord(release_year)^0.5 1 100 TIA! B _ {Beto|Norberto|Numard} Meijome Never offend people with style when you can offend them with substance. Sam Brown I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: Dismax , query phrases
On Wed, 24 Sep 2008 08:34:57 -0700 (PDT) Otis Gospodnetic [EMAIL PROTECTED] wrote: What happens if you change ps from 100 to 1 and comment out that ord function? Otis Hi Otis, no luck - without : str name=rawquerystringsmashing pumpkins/str str name=querystringsmashing pumpkins/str str name=parsedquery +((DisjunctionMaxQuery((genre:smash^0.2 | title_ngram2:sm ma as sh hi in ng^0.1 | artist_ngram2:sm ma as sh hi in ng^0.1 | title_ngram3:sma mas ash shi hin ing^4.5 | title:smash^6.0 | artist_ngram3:sma mas ash shi hin ing^3.5 | artist:smash^4.0 | artist_exact:smashing^100.0 | title_exact:smashing^200.0)~0.01) DisjunctionMaxQuery((genre:pumpkin^0.2 | title_ngram2:pu um mp pk ki in ns^0.1 | artist_ngram2:pu um mp pk ki in ns^0.1 | title_ngram3:pum ump mpk pki kin ins^4.5 | title:pumpkin^6.0 | artist_ngram3:pum ump mpk pki kin ins^3.5 | artist:pumpkin^4.0 | artist_exact:pumpkins^100.0 | title_exact:pumpkins^200.0)~0.01))~2) DisjunctionMaxQuery((title:smash pumpkin~1^2.0 | artist:smash pumpkin~1^0.8)~0.01) /str ___ str name=parsedquery_toString +(((genre:smash^0.2 | title_ngram2:sm ma as sh hi in ng^0.1 | artist_ngram2:sm ma as sh hi in ng^0.1 | title_ngram3:sma mas ash shi hin ing^4.5 | title:smash^6.0 | artist_ngram3:sma mas ash shi hin ing^3.5 | artist:smash^4.0 | artist_exact:smashing^100.0 | title_exact:smashing^200.0)~0.01 (genre:pumpkin^0.2 | title_ngram2:pu um mp pk ki in ns^0.1 | artist_ngram2:pu um mp pk ki in ns^0.1 | title_ngram3:pum ump mpk pki kin ins^4.5 | title:pumpkin^6.0 | artist_ngram3:pum ump mpk pki kin ins^3.5 | artist:pumpkin^4.0 | artist_exact:pumpkins^100.0 | title_exact:pumpkins^200.0)~0.01)~2) (title:smash pumpkin~1^2.0 | artist:smash pumpkin~1^0.8)~0.01 Still OK if I include ... I am trying on another setup, with same data, to work with shingles rather than on 'exact' ... dismax seems to handle it much better...but it may be that I haven't added to that config all the ngram3 ngram3 fields for substring matching... the resulting params were : str name=mm2-1 5-2 690%/str str name=spellchecktrue/str str name=spellcheck.extendedResultstrue/str str name=tie0.01/str str name=trstore_albums.xsl/str ___ str name=qf title_exact^200.0 artist_exact^100.0 title^6.0 title_ngram3^4.5 artist^4.0 artist_ngram3^3.5 title_ngram2^0.1 artist_ngram2^0.1 genre^0.2 /str str name=q.alt*:*/str str name=spellcheck.collatetrue/str str name=wtxml/str str name=defTypedismax/str str name=rows10/str str name=spellcheck.onlyMorePopulartrue/str str name=pftitle^2.0 artist^0.8/str str name=echoParamsall/str str name=fl*,score/str str name=spellcheck.count1/str str name=ps1/str str name=debugQuerytrue/str str name=echoParamsall/str str name=wtxml/str str name=qsmashing pumpkins/str thanks, B _ {Beto|Norberto|Numard} Meijome Don't remember what you can infer. Harry Tennant I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.