Re: Solr - Make Exact Search on Field with Fuzzy Query
Hi Erickson, Thanks for your valuable reply. Actually we had tried with just storing one field and highlighting on that field all the time , whether we search on it or not. It sometimes occurs issue , like if i search with the term : 'hospitality' . and I use field for highlighting , which having stemming applied. it returns me highlights with 'hospital' , 'hospitality'. whether it should return highlighting only on 'hospitality' as I am doing exact term search, can you suggest anything on this?? If we can eliminate this issue while highlighting on original field (having applied stemming on it). The other solutions are sounds really good, but as you said they are hard to implement and we at this point , wanted to implement inbuilt solutions if possible. Please suggest if we can eliminate above explained issue on highlighting. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888p4013067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Make Exact Search on Field with Fuzzy Query
Right, and going the other way (storing and highlighting on the non-stemmed field) would be unsatisfactory due because you'd get a hit on hospital in the stemmed field, but wouldn't highlight it if you searched on hospitality. I really don't see a good solution here. Highlighting seems to be one of those things that's easy in concept but has a zillion ways to go wrong. I guess I'd really just go with the copyField approach unless you can prove that it's really a problem. Perhaps lost in my first e-mail is that storing the field twice doesn't really affect search speed or _search_ requirements at all. Take a look here: http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/fileformats.html#file-names note that the *.fdt and *.fdx files are where the original raw copy goes (i.e. where data gets written when you specify stored=true) and they are completely independent of the files that contain the searchable data. So unless you're disk-space constrained, the additional storage really doesn't cost you much. Best Erick On Thu, Oct 11, 2012 at 2:31 AM, meghana meghana.rav...@amultek.com wrote: Hi Erickson, Thanks for your valuable reply. Actually we had tried with just storing one field and highlighting on that field all the time , whether we search on it or not. It sometimes occurs issue , like if i search with the term : 'hospitality' . and I use field for highlighting , which having stemming applied. it returns me highlights with 'hospital' , 'hospitality'. whether it should return highlighting only on 'hospitality' as I am doing exact term search, can you suggest anything on this?? If we can eliminate this issue while highlighting on original field (having applied stemming on it). The other solutions are sounds really good, but as you said they are hard to implement and we at this point , wanted to implement inbuilt solutions if possible. Please suggest if we can eliminate above explained issue on highlighting. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888p4013067.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr - Make Exact Search on Field with Fuzzy Query
0 down vote favorite We are using solr 3.6. We have field named Description. We want searching feature with stemming and also without stemming (exact word/phrase search), with highlighting in both . For that , we had made lot of research and come to conclusion, to use the copy field with data type which doesn't have stemming factory. it is working fine at now. (main field has stemming and copy field has not.) The data for that field is very large and we are having millions of documents; and as we want, both searching and highlighting on them; we need to keep this copy field stored and indexed both. which will increase index size a lot. we need to eliminate this duplication if possible any how. From the recent research, we read that combining fuzzy search with dismax will fulfill our requirement. (we have tried a bit but not getting success.) Please let me know , if this is possible, or any other solutions to make this happen. Thanks in Advance -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Make Exact Search on Field with Fuzzy Query
There's nothing really built in to Solr to allow this. Are you absolutely sure you can't just use the copyfield? Have you actually tried it? But I don't think you need to store the contents twice. Just store it once and always highlight on that field whether you search it or not. Since it's the raw text, you should be fine. You'll have two versions of the field tokenized of course, but that should take less space than you might think. You probably want to store the version with the stemming turned on... That said, storing twice only uses up some disk space, it doesn't require additional memory for searching. So unless you're running out of disk space you can just keep two stored versions around. But If none of that works you might write a custom filter that emits two tokens for each input token at indexing time, similar to what synonyms do. The original should have some special character appended, say $ and the second should be the results of stemming (note, there will be two tokens even if there is no stemming done). So, indexing running would index running$ and run. Now, when you need to search for an exact match on running, you search for running$. This works for the reverse too. Since the rule is append $ to all original tokens run gets indexed as run$ and run. Now, searching for run matches as does run$. But run$ does not match the doc that had running since the two tokens emitted in that case are run and running$. But look at what's happened here. You're indexing two tokens for every one token in the input. Furthermore, you're adding a bunch of unique tokens to the index. It's hard to see how this results in any savings over just using copyField. You have to index the two tokens since you have to distinguish between the stemmed and un-stemmed version. You might be able to do something really exotic with payloads. This is _really_ out of left field, but it just occurred to me. You'd have to define a transformation from the original word into the stemmed word that created a unique value. Something like no stemming - 0 removing ing - 1 removing s- 2 etc. Actually, this would have to be some kind of function on the letters removed so that removing ing mapped to, say, the ordinal position of the letter in the alphabet * position * 100. So ing would map to 'i' - 'a' + ('n' - 'a') * 100 + ('g' - 'a') * 1 etc... (you'd have to take considerable care to get this right for any code sets that had more than 100 possible code points)... Now, you've included the information about what the original word was and could use the payload to fail to match in the exact-match case. Of course the other issue would be to figure out the syntax to get the fact that you wanted an exact match down into your custom scorer. But as you can see, any scheme is harder than just flipping a switch, so I'd _really_ verify that you can't just use copyField Best Erick On Wed, Oct 10, 2012 at 7:38 AM, meghana meghana.rav...@amultek.com wrote: 0 down vote favorite We are using solr 3.6. We have field named Description. We want searching feature with stemming and also without stemming (exact word/phrase search), with highlighting in both . For that , we had made lot of research and come to conclusion, to use the copy field with data type which doesn't have stemming factory. it is working fine at now. (main field has stemming and copy field has not.) The data for that field is very large and we are having millions of documents; and as we want, both searching and highlighting on them; we need to keep this copy field stored and indexed both. which will increase index size a lot. we need to eliminate this duplication if possible any how. From the recent research, we read that combining fuzzy search with dismax will fulfill our requirement. (we have tried a bit but not getting success.) Please let me know , if this is possible, or any other solutions to make this happen. Thanks in Advance -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888.html Sent from the Solr - User mailing list archive at Nabble.com.