:   I have the following use case. I could implement the solution but
: performance is affected. I need some smart ways of doing this.
: Use Case :
: Incoming data has two fields which have values like 'WAL MART STORES INC' 
: and 'wal-mart-stores-inc'.   
: Users can search the data either in 'walmart'  'wal mart' or 'wal-mart' 
: also partially on any part of the name from the start of word like 'wal',
: 'walm' 'wal m'  etc .   I could get the solution  by using two indexes, one
: as text field for the first field (wal mart ) column and sub word 
: wal-mart-stores (with WordDelimiterFilterFactory filter).  

there are lots of solutions that could work, all depending on what *else* 
you need to be able to match on besides just prefix queries where 
whitespace/punctuation are ignored.

One example: using KeywordTokenizer, along with a PatternReplaceFilter 
that throws away non letter charagers and a LowercaseFilter and then 
issuing all your queries as PrefixQueries will get w* wa* wal* and walm* 
to all match "wal mart", "WALMART", "WAL-mart", etc....  but that won't 
let "mart" match a document contain "wal mart" .. but you can always use 
copyField and hit one field for the first type of query, and the other 
field for "normal" queries.

depending on the nature of your data (ie: how many documents, how common 
certian prefixes are, etc...) you might get better performacne at the 
expense of a larger index if you use something like the 
EdgeNGramTokenFilter or EdgeNGramTokenizer to index all the prefixes of 
various sizes so you don't need to do a prefix query

The bottom line: there are *lots* of options, you'll need to experimentto 
find the right solution that matches when you want to match, and doesn't 
when you don't



-Hoss

Reply via email to