Oh my ... thinking even more about it, I have to admit
you're right :) But that leaves me somewhat clueless again.
So I'll just try and share my thoughts on this. Maybe
someone will read this and can point me to a possible
solution ... or tell me where I'm wrong.
Say we have a schema with fields f1, f2 and f3. And the user
queries for "a b c" (without the quotes). What I would
expect as resulting query would be (leaving out the details
like tie, boosting etc.):
((f1:a OR f2:a OR f3:a) AND (f1:b OR f2:b OR f3:b) AND (f1:c
OR f2:c OR f3:c))
OR
((f1:ab OR f2:ab OR f3:ab) AND (f1:c f2:c f3:c))
OR
((f1:a OR f2:a OR f3:a) AND (f1:bc f2:bc f3:bc))
(possibly also f1:abc OR f2:abc .. and/or f1:a b c OR f2:a
b c etc. )
So every possibility of how to write compound words is covered.
But then there is the problem that there are fields that
require exact matching (something like EAN, manufacturer
code or product serial number. Unfortunately these can
contain whitespaces etc. So a b c can also be a valid
manufacturer code which sould match as a whole).
So I modeled the fields in the schema accordingly: making
exact match fields string and add ShingleFilter and
WordDelimiterFiler for content fields. And I thought the
fields analyzer stack would take care of how to process the
user input.
But when I pass the user query as phrase to the DisMax
Handler (so that every field gets to see the whole user
query and can tokenize and shingle it) I get a query like this:
(f1:a b c)
OR
(f2:a OR f2:ab OR f2:b OR f2:bc OR f2:c)
OR
(f3:a OR f3:ab OR f3:b OR f3:bc OR f3:c)
which apparently is not what I need as it also would find
for example documents that only contain a or b etc. When
using phrase fields this query is just added to the normal
query and therefore the query fails to find the compound words.
Also using the FieldQuery Analyzer does not yield the
desired results as the parsed queries as a matter of fact
look like the phrase queries from the DisMax parser.
I tried dozends of variations and I'm still pretty sure that
there must be a way to get this working. It doesn't look
that hard. But for now I will settle this for the weekend :)
Have a nice weekend all and thanks in advance for any
comments or replies.
Tobi
Chris Hostetter schrieb:
: Many thanks for your explanation. That really helped me a lot in understanding
: DisMax - and finally I realized that DisMax is not at all what I need.
: Actually I do not want results where "blue" is in one field and "tooth" in
: another (imagine you search for a notebook with blue tooth and get some blue
: products that accidentally have tooth in some field).
except that if you use the "pf" param as well, a search for...
blue tooth
can score products where "blue tooth" appears in one field higher then
products where "blue" apears in one field and "tooth" appears in another
field.
The approach you are describing might give you you better precisions (ie:
less total results) but it will have a loss in precision, a query like
this...
blue tooth notebook
...probably won't be able to find documents matching the terms
"product_type:notebook features:blue features:tooth" ... but dismax
can.
-Hoss