First of all: sorry Chris, Walter .. I did not mean to put
pressure on anyone. It's just that if you're stuck with
something and you have that little needle stinging saying:
maybe you're just too damn stupid for this ... :) So, thanks
a lot for your answers.
As for index time expansion using synonyms: I think this is
not an option for me since it would mean that I have to a)
find all such words that might cause problems and b) find
every variant that might possibly be used by customers. And
then in the end I have to keep all my synonym files
up-to-date. But the main design goal for my search
implementation is little to no maintainance.
My original assumption for the DisMax Handler was, that it
will just take the original query string and pass it to
every field in its fieldlist using the fields configured
analyzer stack. Maybe in the end add some stuff for the
special options and so ... and then send the query to
lucene. Can you explain why this approach was not choosen?
Thanks
Tobi
Chris Hostetter schrieb:
: Hmmm was my mail so weird or my question so stupid ... or is there simply
: noone with an answer? Not even a hint? :(
patience my freind, i've got a backlog of ~~500 Lucene related messages in
my INBOX, and i was just reading your original email when this reply came
in.
In generally this is a fairly hard problem ... the easiest solution i know
of that works in most cases is to do index time expansion using the
SYnonymFilter, so regardless of wether a document contains "usbcable"
"usb-cable" or "usb cable" all three varients get indexed, and then the
user can search for any of them.
the downside is that it can throw off your tf/idf stats for some terms (if
they apear by themselves, and as part of a compound) and it can result in
false positives for esoteric phrase searches (but that tends to be more of
a theoretical problem then an actual one.
: > But this never happens since with the DisMax Searcher the parser produces a
: > query like this:
: >
: > ((category:blue | name:blue)~0.1 (category:tooth | name:tooth)~0.1)
...
: > to deal with this compound word problem? Is there another query parser that
: > already does the trick?
take a look at the FieldQParserPlugin ... it passes the raw query string
to the analyser of a specified field -- this would let your TokenFilters
see the "stream" of tokens (which isn't possible with the conventional
QueryParser tokenization rules) but it doesn't have any of the
"field/query matric cross product" goodness of dismax -- you'd only be
able to query the one field.
(Hmmm.... i wonder if DisMaxQParser 2.0 could have an option to let you
specify a FieldType whose analyzer was used to tokenize the query string
instead of using the Lucene QueryParser JavaCC tokenization, and *then*
the tokens resulting from that initial analyzer could be passed to the
analyzers of the various qf fields ... hmmm, that might be just crazy
enough to be too crazy to work)
-Hoss