Oh my ... thinking even more about it, I have to admit you're right :) But that leaves me somewhat clueless again.

So I'll just try and share my thoughts on this. Maybe someone will read this and can point me to a possible solution ... or tell me where I'm wrong.

Say we have a schema with fields f1, f2 and f3. And the user queries for "a b c" (without the quotes). What I would expect as resulting query would be (leaving out the details like tie, boosting etc.):

((f1:a OR f2:a OR f3:a) AND (f1:b OR f2:b OR f3:b) AND (f1:c OR f2:c OR f3:c))
OR
((f1:ab OR f2:ab OR f3:ab) AND (f1:c f2:c f3:c))
OR
((f1:a OR f2:a OR f3:a) AND (f1:bc f2:bc f3:bc))

(possibly also f1:abc OR f2:abc .. and/or f1:a b c OR f2:a b c etc. )

So every possibility of how to write compound words is covered.

But then there is the problem that there are fields that require exact matching (something like EAN, manufacturer code or product serial number. Unfortunately these can contain whitespaces etc. So a b c can also be a valid manufacturer code which sould match as a whole).

So I modeled the fields in the schema accordingly: making exact match fields string and add ShingleFilter and WordDelimiterFiler for content fields. And I thought the fields analyzer stack would take care of how to process the user input.

But when I pass the user query as phrase to the DisMax Handler (so that every field gets to see the whole user query and can tokenize and shingle it) I get a query like this:
(f1:a b c)
OR
(f2:a OR f2:ab OR f2:b OR f2:bc OR f2:c)
OR
(f3:a OR f3:ab OR f3:b OR f3:bc OR f3:c)

which apparently is not what I need as it also would find for example documents that only contain a or b etc. When using phrase fields this query is just added to the normal query and therefore the query fails to find the compound words.

Also using the FieldQuery Analyzer does not yield the desired results as the parsed queries as a matter of fact look like the phrase queries from the DisMax parser.

I tried dozends of variations and I'm still pretty sure that there must be a way to get this working. It doesn't look that hard. But for now I will settle this for the weekend :)

Have a nice weekend all and thanks in advance for any comments or replies.

Tobi


Chris Hostetter schrieb:
: Many thanks for your explanation. That really helped me a lot in understanding
: DisMax - and finally I realized that DisMax is not at all what I need.
: Actually I do not want results where "blue" is in one field and "tooth" in
: another (imagine you search for a notebook with blue tooth and get some blue
: products that accidentally have tooth in some field).

except that if you use the "pf" param as well, a search for...

        blue tooth

can score products where "blue tooth" appears in one field higher then products where "blue" apears in one field and "tooth" appears in another field. The approach you are describing might give you you better precisions (ie: less total results) but it will have a loss in precision, a query like this...

        blue tooth notebook

...probably won't be able to find documents matching the terms "product_type:notebook features:blue features:tooth" ... but dismax can.


-Hoss


Reply via email to