Re: Dismax , query phrases

2008-10-02 Thread Chris Hostetter

:  how would it fit c:some phrase into that structure?
: 
: does this make sense?
: 
:  ( (a:some | b:some ) (a:phrase | b:phrase) ( c:some phrase) )

that's pretty much exactly what pf does, the only distinction is you 
get...

 +( (a:some | b:some ) (a:phrase | b:phrase) )  ( c:some phrase ) 

...where the mm param only applies to the (mandatory) boolean built 
using the qf.


-Hoss



Re: Dismax , query phrases

2008-10-01 Thread Norberto Meijome
On Tue, 30 Sep 2008 11:43:57 -0700 (PDT)
Chris Hostetter [EMAIL PROTECTED] wrote:

 
 : That's why I was wondering how Dismax breaks it all apart. It makes
 sense...I : suppose what I'd like to have is a way to tell dismax which
 fields NOT to : tokenize the input for. For these fields, it would pass the
 full q instead of : each part of it. Does this make sense? would it be useful
 at all? 
 
 the *goal* makes sense, but the implementation would be ... problematic.
 
 you have to remember the DisMax parser's whole way of working is to make 
 each chunk of input match against any qf field, and find the highest 
 scoring field for each chunk, with this input...
 
   q = some phase   qf = a b c
 
 ...you get...
 
   ( (a:some | b:some | c:some) (a:phrase | b:phrase | c:phrase) )
 
 ...even if dismax could tell that c was a field that should only support 
 exact matches,

thanks Hoss,

it would by a configuration option. 

 how would it fit c:some phrase into that structure?

does this make sense?

 ( (a:some | b:some ) (a:phrase | b:phrase) ( c:some phrase) )


 I've already kinda forgotten how this thread started ... 

trying to get *exact* matches to always score higher using dismax - keeping in
mind that I have multiple exact fields, with different boosts...

 but would it make 
 sense to just use your exact fields in the pf, and have inexact versions 
 of them in the qf?  then docs that match your input exactly should score 
 at the top, but less exact matches will also still match.

aha! right, i think that makes sense...i obviously haven't got my head properly
around all the different functionality of dismax.

I will try it when I'm back @ work... right now, i seem to have solved the
problem by using shingles -the fields are artists, song  albumtitles ,so high
matching on shingles is quite approximate to exact matching - except that I had
to remove stopwords, so that impacts on performance.

Thanks again :)
B
_
{Beto|Norberto|Numard} Meijome

Which is worse: ignorance or apathy?
Don't know. Don't care.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: Dismax , query phrases

2008-09-30 Thread Chris Hostetter

: That's why I was wondering how Dismax breaks it all apart. It makes sense...I
: suppose what I'd like to have is a way to tell dismax which fields NOT to
: tokenize the input for. For these fields, it would pass the full q instead of
: each part of it. Does this make sense? would it be useful at all? 

the *goal* makes sense, but the implementation would be ... problematic.

you have to remember the DisMax parser's whole way of working is to make 
each chunk of input match against any qf field, and find the highest 
scoring field for each chunk, with this input...

q = some phase   qf = a b c

...you get...

( (a:some | b:some | c:some) (a:phrase | b:phrase | c:phrase) )

...even if dismax could tell that c was a field that should only support 
exact matches, how would it fit c:some phrase into that structure?

I've already kinda forgotten how this thread started ... but would it make 
sense to just use your exact fields in the pf, and have inexact versions 
of them in the qf?  then docs that match your input exactly should score 
at the top, but less exact matches will also still match.



-Hoss



Re: Dismax , query phrases

2008-09-29 Thread Norberto Meijome
On Fri, 26 Sep 2008 10:42:42 -0700 (PDT)
Chris Hostetter [EMAIL PROTECTED] wrote:

 : tokenizer
 : class=solr.KeywordTokenizerFactory / !-- The LowerCase TokenFilter does
 
 : Now, when I search with ?q=the doors , all the terms in my q= aren't used
 : together to build the dismaxQuery , so I never get a match on the _exact
 fields:
 
 The query parser (even the dismax queryparser) does it's white space 
 chunking before handing any input off to the analyzer for the 
 appropriate field, so with [[ ?q=the doors ]] the and doors are going 
 to get analyzed seperately ... which is why you see artist_exact:the^100.0 
 and artist_exact:doors^100.0 in your parsedquery -- *BUT* since you used 
 KeywordTOkenizer at index time, you'll never get a match for either of 
 those on any document (unles the artist is just the or doors)

Hi Hoss :)
thanks for the feedback - I arrived @ the same conclusion . The biz requirement
is that these *_exact fields match exactly the original contents of the field.
Right now we are using Dismax, and changing this means rewriting a lot of the
queries , which isn't possible. 

That's why I was wondering how Dismax breaks it all apart. It makes sense...I
suppose what I'd like to have is a way to tell dismax which fields NOT to
tokenize the input for. For these fields, it would pass the full q instead of
each part of it. Does this make sense? would it be useful at all? 

 : I've tried with other queries that don't include stopwords (smashing
 pumpkins, : for example), and in all cases, if I don't use  , only the LAST
 word is used : with my _exact fields ( tried with 1, 2 and 3 words, always
 the same against my : _exact fields..)
 
 this LAST word part doesn't make sense to me ... you can see the 
 making it into your query on the *_exact fields in the first 
 DisjunctionMaxQuery, do you have toStrings for these other queries we 
 could see to understand what you mean?

I agree, it makes sense as you say...i must have missed the initial tokens. I
can't confirm atm, so I'll follow the common sense path :)

As usual, thanks for your time and insights :)

B
_
{Beto|Norberto|Numard} Meijome

Humans die and turn to dust, but writing makes us remembered
  4000-year-old words of an Egyptian scribe

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: Dismax , query phrases

2008-09-26 Thread Chris Hostetter

I'm not fully following everything you've got here, but one thing jumped 
out at me...

:   tokenizer
: class=solr.KeywordTokenizerFactory / !-- The LowerCase TokenFilter does

: Now, when I search with ?q=the doors , all the terms in my q= aren't used
: together to build the dismaxQuery , so I never get a match on the _exact 
fields:

The query parser (even the dismax queryparser) does it's white space 
chunking before handing any input off to the analyzer for the 
appropriate field, so with [[ ?q=the doors ]] the and doors are going 
to get analyzed seperately ... which is why you see artist_exact:the^100.0 
and artist_exact:doors^100.0 in your parsedquery -- *BUT* since you used 
KeywordTOkenizer at index time, you'll never get a match for either of 
those on any document (unles the artist is just the or doors)

: I've tried with other queries that don't include stopwords (smashing pumpkins,
: for example), and in all cases, if I don't use  , only the LAST word is used
: with my _exact fields ( tried with 1, 2 and 3 words, always the same against 
my
: _exact fields..)

this LAST word part doesn't make sense to me ... you can see the 
making it into your query on the *_exact fields in the first 
DisjunctionMaxQuery, do you have toStrings for these other queries we 
could see to understand what you mean?



-Hoss



Re: Dismax , query phrases

2008-09-25 Thread Norberto Meijome
On Wed, 24 Sep 2008 08:34:57 -0700 (PDT)
Otis Gospodnetic [EMAIL PROTECTED] wrote:

 What happens if you change ps from 100 to 1 and comment out that ord function?
 
 

Otis, I think what I am after is what Hoss described in his last paragraph in 
his reply to your email last year :

http://www.nabble.com/DisMax-and-REQUIRED-OR-REQUIRED-query-rewrite-td13395349.html#a13395349

ie, I want everything that Dismax does, BUT , on certain fields, I want it to 
search for all the terms in my q= , as a phrase.

I am thinking of modifying dismax to allow this to be passed as a configuration 
( eg, fieldsSearchExact=artist_exact, title_exact), but if I can avoid it 
that'd be great :).

any other ideas, anyone??

thanks!
B
_
{Beto|Norberto|Numard} Meijome

Nature doesn't care how smart you are. You can still be wrong.
  Richard Feynman

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


Dismax , query phrases

2008-09-24 Thread Norberto Meijome
Hello,
I've seen references to this in the list, but not completely explained...my
apologies if this is FAQ (and for the length of the email).

I am using dismax across a number of fields on an index with data about music
albums  songs - the fields are quite full of stop words. I am trying to boost
'exact' matches - ie, if you search for 'The Doors', those documents with 'The
Doors' should be first. I've created the following fieldType and I use it for 
fields artist_exact and title_exact:


fieldType name=lowerCaseString class=solr.TextField
sortMissingLast=true omitNorms=true
analyzer
!-- KeywordTokenizer does no actual
tokenizing, so the entire input string is preserved as a single token
--
tokenizer
class=solr.KeywordTokenizerFactory / !-- The LowerCase TokenFilter does
what you expect, which can be when you want your sorting to be case insensitive
--
filter class=solr.LowerCaseFilterFactory /
!-- The TrimFilter removes any leading or
trailing whitespace -- filter class=solr.TrimFilterFactory /

/analyzer
/fieldType

I then give artist_exact and title_exact pretty high boosts ( title_exact^200.0
artist_exact^100.0 )

Now, when I search with ?q=the doors , all the terms in my q= aren't used
together to build the dismaxQuery , so I never get a match on the _exact fields:

(there are a few other fields involved...pretty self explanatory)

str name=rawquerystringthe doors/str
str name=querystringthe doors/str
___
str name=parsedquery
+((DisjunctionMaxQuery((title_ngram2:th he^0.1 | artist_ngram2:th he^0.1 |
title_ngram3:the^4.5 | artist_ngram3:the^3.5 | artist_exact:the^100.0 |
title_exact:the^200.0)~0.01) DisjunctionMaxQuery((genre:door^0.2 |
title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 |
title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor
ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 |
title_exact:doors^200.0)~0.01))~2) DisjunctionMaxQuery((title:door^2.0 |
artist:door^0.8)~0.01) FunctionQuery((ord(release_year))^0.5) /str

str name=parsedquery_toString +(((title_ngram2:th he^0.1 |
artist_ngram2:th he^0.1 | title_ngram3:the^4.5 | artist_ngram3:the^3.5 |
artist_exact:the^100.0 | title_exact:the^200.0)~0.01 (genre:door^0.2 |
title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 |
title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor
ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 |
title_exact:doors^200.0)~0.01)~2) (title:door^2.0 | artist:door^0.8)~0.01
(ord(release_year))^0.5


but, if I build my search as ?q=the doors 

str name=parsedquery
+DisjunctionMaxQuery((genre:door^0.2 | title_ngram2:th he e   d do oo or
rs^0.1 | artist_ngram2:th he e   d do oo or rs^0.1 | title_ngram3:the he  e
d  do doo oor ors^4.5 | title:door^6.0 | artist_ngram3:the he  e d  do doo
oor ors^3.5 | artist:door^4.0 | artist_exact:the doors^100.0 | title_exact:the
doors^200.0)~0.01) DisjunctionMaxQuery((title:door^2.0 | artist:door^0.8)~0.01)
FunctionQuery((ord(release_year))^0.5) /str

str name=parsedquery_toString +(genre:door^0.2 | title_ngram2:th he e   d
do oo or rs^0.1 | artist_ngram2:th he e   d do oo or rs^0.1 |
title_ngram3:the he  e d  do doo oor ors^4.5 | title:door^6.0 |
artist_ngram3:the he  e d  do doo oor ors^3.5 | artist:door^4.0 |
artist_exact:the doors^100.0 | title_exact:the doors^200.0)~0.01
(title:door^2.0 | artist:door^0.8)~0.01 (ord(release_year))^0.5

I've tried with other queries that don't include stopwords (smashing pumpkins,
for example), and in all cases, if I don't use  , only the LAST word is used
with my _exact fields ( tried with 1, 2 and 3 words, always the same against my
_exact fields..)

What is the reason for this behaviour? 

my full dismax config is :

str name=mm2-1 5-2 690%/str
str name=spellchecktrue/str
str name=spellcheck.extendedResultstrue/str
str name=tie0.01/str
str name=qf
title_exact^200.0 artist_exact^100.0 title^6.0 title_ngram3^4.5 artist^4.0
artist_ngram3^3.5 title_ngram2^0.1 artist_ngram2^0.1 genre^0.2 /str
str name=q.alt*:*/str
str name=spellcheck.collatetrue/str
str name=defTypedismax/str
str name=spellcheck.onlyMorePopulartrue/str
str name=rows10/str
str name=pftitle^2.0 artist^0.8/str
str name=echoParamsall/str
str name=fl*,score/str
str name=bford(release_year)^0.5/str
str name=spellcheck.count1/str
str name=ps100/str
/lst

TIA!
B
_
{Beto|Norberto|Numard} Meijome

Never offend people with style when you can offend them with substance.
  Sam Brown

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: Dismax , query phrases

2008-09-24 Thread Otis Gospodnetic
What happens if you change ps from 100 to 1 and comment out that ord function?


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Norberto Meijome [EMAIL PROTECTED]
 To: SOLR-Usr-ML solr-user@lucene.apache.org
 Sent: Wednesday, September 24, 2008 11:23:18 AM
 Subject: Dismax , query phrases
 
 Hello,
 I've seen references to this in the list, but not completely explained...my
 apologies if this is FAQ (and for the length of the email).
 
 I am using dismax across a number of fields on an index with data about music
 albums  songs - the fields are quite full of stop words. I am trying to boost
 'exact' matches - ie, if you search for 'The Doors', those documents with 'The
 Doors' should be first. I've created the following fieldType and I use it for 
 fields artist_exact and title_exact:
 
 
 
 sortMissingLast=true omitNorms=true
 
 
 
 class=solr.KeywordTokenizerFactory / 
 
 
 
 
 
 
 I then give artist_exact and title_exact pretty high boosts ( 
 title_exact^200.0
 artist_exact^100.0 )
 
 Now, when I search with ?q=the doors , all the terms in my q= aren't used
 together to build the dismaxQuery , so I never get a match on the _exact 
 fields:
 
 (there are a few other fields involved...pretty self explanatory)
 
 the doors
 the doors
 ___
 
 +((DisjunctionMaxQuery((title_ngram2:th he^0.1 | artist_ngram2:th he^0.1 |
 title_ngram3:the^4.5 | artist_ngram3:the^3.5 | artist_exact:the^100.0 |
 title_exact:the^200.0)~0.01) DisjunctionMaxQuery((genre:door^0.2 |
 title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 |
 title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor
 ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 |
 title_exact:doors^200.0)~0.01))~2) DisjunctionMaxQuery((title:door^2.0 |
 artist:door^0.8)~0.01) FunctionQuery((ord(release_year))^0.5) 
 
 +(((title_ngram2:th he^0.1 |
 artist_ngram2:th he^0.1 | title_ngram3:the^4.5 | artist_ngram3:the^3.5 |
 artist_exact:the^100.0 | title_exact:the^200.0)~0.01 (genre:door^0.2 |
 title_ngram2:do oo or rs^0.1 | artist_ngram2:do oo or rs^0.1 |
 title_ngram3:doo oor ors^4.5 | title:door^6.0 | artist_ngram3:doo oor
 ors^3.5 | artist:door^4.0 | artist_exact:doors^100.0 |
 title_exact:doors^200.0)~0.01)~2) (title:door^2.0 | artist:door^0.8)~0.01
 (ord(release_year))^0.5
 
 
 but, if I build my search as ?q=the doors 
 
 
 +DisjunctionMaxQuery((genre:door^0.2 | title_ngram2:th he e   d do oo or
 rs^0.1 | artist_ngram2:th he e   d do oo or rs^0.1 | title_ngram3:the he  
 e
 d  do doo oor ors^4.5 | title:door^6.0 | artist_ngram3:the he  e d  do doo
 oor ors^3.5 | artist:door^4.0 | artist_exact:the doors^100.0 | 
 title_exact:the
 doors^200.0)~0.01) DisjunctionMaxQuery((title:door^2.0 | 
 artist:door^0.8)~0.01)
 FunctionQuery((ord(release_year))^0.5) 
 
 +(genre:door^0.2 | title_ngram2:th he e   d
 do oo or rs^0.1 | artist_ngram2:th he e   d do oo or rs^0.1 |
 title_ngram3:the he  e d  do doo oor ors^4.5 | title:door^6.0 |
 artist_ngram3:the he  e d  do doo oor ors^3.5 | artist:door^4.0 |
 artist_exact:the doors^100.0 | title_exact:the doors^200.0)~0.01
 (title:door^2.0 | artist:door^0.8)~0.01 (ord(release_year))^0.5
 
 I've tried with other queries that don't include stopwords (smashing pumpkins,
 for example), and in all cases, if I don't use  , only the LAST word is used
 with my _exact fields ( tried with 1, 2 and 3 words, always the same against 
 my
 _exact fields..)
 
 What is the reason for this behaviour? 
 
 my full dismax config is :
 
 2-1 5-2 690%
 true
 true
 0.01
 
 title_exact^200.0 artist_exact^100.0 title^6.0 title_ngram3^4.5 artist^4.0
 artist_ngram3^3.5 title_ngram2^0.1 artist_ngram2^0.1 genre^0.2 
 *:*
 true
 dismax
 true
 10
 title^2.0 artist^0.8
 all
 *,score
 ord(release_year)^0.5
 1
 100
 
 
 TIA!
 B
 _
 {Beto|Norberto|Numard} Meijome
 
 Never offend people with style when you can offend them with substance.
   Sam Brown
 
 I speak for myself, not my employer. Contents may be hot. Slippery when wet.
 Reading disclaimers makes you go blind. Writing them is worse. You have been
 Warned.



Re: Dismax , query phrases

2008-09-24 Thread Norberto Meijome
On Wed, 24 Sep 2008 08:34:57 -0700 (PDT)
Otis Gospodnetic [EMAIL PROTECTED] wrote:

 What happens if you change ps from 100 to 1 and comment out that ord function?
 
 
 Otis

Hi Otis,

no luck - without   :
str name=rawquerystringsmashing pumpkins/str
str name=querystringsmashing pumpkins/str
str name=parsedquery
+((DisjunctionMaxQuery((genre:smash^0.2 | title_ngram2:sm ma as sh hi in 
ng^0.1 | artist_ngram2:sm ma as sh hi in ng^0.1 | title_ngram3:sma mas ash 
shi hin ing^4.5 | title:smash^6.0 | artist_ngram3:sma mas ash shi hin 
ing^3.5 | artist:smash^4.0 | artist_exact:smashing^100.0 | 
title_exact:smashing^200.0)~0.01) DisjunctionMaxQuery((genre:pumpkin^0.2 | 
title_ngram2:pu um mp pk ki in ns^0.1 | artist_ngram2:pu um mp pk ki in 
ns^0.1 | title_ngram3:pum ump mpk pki kin ins^4.5 | title:pumpkin^6.0 | 
artist_ngram3:pum ump mpk pki kin ins^3.5 | artist:pumpkin^4.0 | 
artist_exact:pumpkins^100.0 | title_exact:pumpkins^200.0)~0.01))~2) 
DisjunctionMaxQuery((title:smash pumpkin~1^2.0 | artist:smash 
pumpkin~1^0.8)~0.01)
/str
___
str name=parsedquery_toString
+(((genre:smash^0.2 | title_ngram2:sm ma as sh hi in ng^0.1 | 
artist_ngram2:sm ma as sh hi in ng^0.1 | title_ngram3:sma mas ash shi hin 
ing^4.5 | title:smash^6.0 | artist_ngram3:sma mas ash shi hin ing^3.5 | 
artist:smash^4.0 | artist_exact:smashing^100.0 | 
title_exact:smashing^200.0)~0.01 (genre:pumpkin^0.2 | title_ngram2:pu um mp pk 
ki in ns^0.1 | artist_ngram2:pu um mp pk ki in ns^0.1 | title_ngram3:pum 
ump mpk pki kin ins^4.5 | title:pumpkin^6.0 | artist_ngram3:pum ump mpk pki 
kin ins^3.5 | artist:pumpkin^4.0 | artist_exact:pumpkins^100.0 | 
title_exact:pumpkins^200.0)~0.01)~2) (title:smash pumpkin~1^2.0 | 
artist:smash pumpkin~1^0.8)~0.01

Still OK if I include  ...

I am trying on another setup, with same data, to work with shingles rather than 
on 'exact' ... dismax seems to handle it much better...but it may be that I 
haven't added to that config all the ngram3 ngram3 fields for substring 
matching...

the resulting params were :

str name=mm2-1 5-2 690%/str
str name=spellchecktrue/str
str name=spellcheck.extendedResultstrue/str
str name=tie0.01/str
str name=trstore_albums.xsl/str
___
str name=qf
title_exact^200.0 artist_exact^100.0 title^6.0 title_ngram3^4.5 artist^4.0 
artist_ngram3^3.5 title_ngram2^0.1 artist_ngram2^0.1 genre^0.2
/str
str name=q.alt*:*/str
str name=spellcheck.collatetrue/str
str name=wtxml/str
str name=defTypedismax/str
str name=rows10/str
str name=spellcheck.onlyMorePopulartrue/str
str name=pftitle^2.0 artist^0.8/str
str name=echoParamsall/str
str name=fl*,score/str
str name=spellcheck.count1/str
str name=ps1/str
str name=debugQuerytrue/str
str name=echoParamsall/str
str name=wtxml/str
str name=qsmashing pumpkins/str

thanks,
B
_
{Beto|Norberto|Numard} Meijome

Don't remember what you can infer.
   Harry Tennant

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.