First of all: sorry Chris, Walter .. I did not mean to put pressure on anyone. It's just that if you're stuck with something and you have that little needle stinging saying: maybe you're just too damn stupid for this ... :) So, thanks a lot for your answers.

As for index time expansion using synonyms: I think this is not an option for me since it would mean that I have to a) find all such words that might cause problems and b) find every variant that might possibly be used by customers. And then in the end I have to keep all my synonym files up-to-date. But the main design goal for my search implementation is little to no maintainance.

My original assumption for the DisMax Handler was, that it will just take the original query string and pass it to every field in its fieldlist using the fields configured analyzer stack. Maybe in the end add some stuff for the special options and so ... and then send the query to lucene. Can you explain why this approach was not choosen?

Thanks
Tobi


Chris Hostetter schrieb:
: Hmmm was my mail so weird or my question so stupid ... or is there simply
: noone with an answer? Not even a hint? :(

patience my freind, i've got a backlog of ~~500 Lucene related messages in my INBOX, and i was just reading your original email when this reply came in.

In generally this is a fairly hard problem ... the easiest solution i know of that works in most cases is to do index time expansion using the SYnonymFilter, so regardless of wether a document contains "usbcable" "usb-cable" or "usb cable" all three varients get indexed, and then the user can search for any of them.

the downside is that it can throw off your tf/idf stats for some terms (if they apear by themselves, and as part of a compound) and it can result in false positives for esoteric phrase searches (but that tends to be more of a theoretical problem then an actual one.

: > But this never happens since with the DisMax Searcher the parser produces a
: > query like this:
: > : > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
        ...
: > to deal with this compound word problem? Is there another query parser that
: > already does the trick?

take a look at the FieldQParserPlugin ... it passes the raw query string to the analyser of a specified field -- this would let your TokenFilters see the "stream" of tokens (which isn't possible with the conventional QueryParser tokenization rules) but it doesn't have any of the "field/query matric cross product" goodness of dismax -- you'd only be able to query the one field.

(Hmmm.... i wonder if DisMaxQParser 2.0 could have an option to let you specify a FieldType whose analyzer was used to tokenize the query string instead of using the Lucene QueryParser JavaCC tokenization, and *then* the tokens resulting from that initial analyzer could be passed to the analyzers of the various qf fields ... hmmm, that might be just crazy enough to be too crazy to work)




-Hoss



Reply via email to