Re: Solr - Make Exact Search on Field with Fuzzy Query

2012-10-11 Thread meghana
Hi Erickson,

Thanks for your valuable reply. 

Actually we had tried with just storing one field and highlighting on that
field all the time , whether we search on it or not.

It sometimes occurs issue , like if i search with the term : 'hospitality' .
and I use field for highlighting , which having stemming applied. it returns
me highlights with 'hospital' , 'hospitality'. whether it should return
highlighting only on 'hospitality' as I am doing exact term search, can you
suggest anything on this?? If we can eliminate this issue while highlighting
on original field (having applied stemming on it). 

The other solutions are sounds really good, but as you said they are hard to
implement and we at this point , wanted to implement inbuilt solutions if
possible. 

Please suggest if we can eliminate above explained issue on highlighting.

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888p4013067.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Make Exact Search on Field with Fuzzy Query

2012-10-11 Thread Erick Erickson
Right, and going the other way (storing and highlighting on the non-stemmed
field) would be unsatisfactory due because you'd get a hit on hospital in the
stemmed field, but wouldn't highlight it if you searched on hospitality.

I really don't see a good solution here. Highlighting seems to be one of those
things that's easy in concept but has a zillion ways to go wrong.

I guess I'd really just go with the copyField approach unless you can prove that
it's really a problem. Perhaps lost in my first e-mail is that storing
the field twice
doesn't really affect search speed or _search_ requirements at all. Take a
look here:
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_0/fileformats.html#file-names

note that the *.fdt and *.fdx files are where the original raw copy goes
(i.e. where data gets written when you specify stored=true)
and they are completely independent of the files that contain the searchable
data. So unless you're disk-space constrained, the additional storage really
doesn't cost you much.

Best
Erick

On Thu, Oct 11, 2012 at 2:31 AM, meghana meghana.rav...@amultek.com wrote:
 Hi Erickson,

 Thanks for your valuable reply.

 Actually we had tried with just storing one field and highlighting on that
 field all the time , whether we search on it or not.

 It sometimes occurs issue , like if i search with the term : 'hospitality' .
 and I use field for highlighting , which having stemming applied. it returns
 me highlights with 'hospital' , 'hospitality'. whether it should return
 highlighting only on 'hospitality' as I am doing exact term search, can you
 suggest anything on this?? If we can eliminate this issue while highlighting
 on original field (having applied stemming on it).

 The other solutions are sounds really good, but as you said they are hard to
 implement and we at this point , wanted to implement inbuilt solutions if
 possible.

 Please suggest if we can eliminate above explained issue on highlighting.

 Thanks.



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888p4013067.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Solr - Make Exact Search on Field with Fuzzy Query

2012-10-10 Thread meghana
 0 down vote favorite


We are using solr 3.6.

We have field named Description. We want searching feature with stemming and
also without stemming (exact word/phrase search), with highlighting in both
.

For that , we had made lot of research and come to conclusion, to use the
copy field with data type which doesn't have stemming factory. it is working
fine at now.

(main field has stemming and copy field has not.)

The data for that field is very large and we are having millions of
documents; and as we want, both searching and highlighting on them; we need
to keep this copy field stored and indexed both. which will increase index
size a lot.

we need to eliminate this duplication if possible any how.

From the recent research, we read that combining fuzzy search with dismax
will fulfill our requirement. (we have tried a bit but not getting success.)

Please let me know , if this is possible, or any other solutions to make
this happen.

Thanks in Advance




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr - Make Exact Search on Field with Fuzzy Query

2012-10-10 Thread Erick Erickson
There's nothing really built in to Solr to allow this. Are you
absolutely sure you can't just use the copyfield? Have you
actually tried it?

But I don't think you need to store the contents twice. Just
store it once and always highlight on that field whether you
search it or not. Since it's the raw text, you should be fine.
You'll have two versions of the field tokenized of course, but
that should take less space than you might think. You
probably want to store the version with the stemming turned on...

That said, storing twice only uses up some disk space, it
doesn't require additional memory for searching. So unless
you're running out of disk space you can just keep two stored
versions around.

But

If none of that works you might write a custom filter that
emits two tokens for each input token at indexing
time, similar to what synonyms do. The original should
have some special character appended, say $ and the
second should be the results of stemming (note, there
will be two tokens even if there is no stemming done).
So, indexing running would index running$ and run.
Now, when you need to search for an exact match on
running, you search for running$.

This works for the reverse too. Since the rule is append
$ to all original tokens run gets indexed as run$ and run.
Now, searching for run matches as does run$. But
run$ does not match the doc that had running since the two
tokens emitted in that case are run and running$.

But look at what's happened here. You're indexing two tokens
for every one token in the input. Furthermore, you're adding
a bunch of unique tokens to the index. It's hard to see how this
results in any savings over just using copyField. You have
to index the two tokens since you have to distinguish between
the stemmed and un-stemmed version.

You might be able to do something really exotic with payloads.
This is _really_ out of left field, but it just occurred to me. You'd
have to define a transformation from the original word into the
stemmed word that created a unique value. Something like
no stemming - 0
removing ing - 1
removing s- 2

etc. Actually, this would have to be some kind of function on the
letters removed so that removing ing mapped to, say,
the ordinal position of the letter in the alphabet * position * 100. So
ing would map to 'i' - 'a' + ('n' - 'a') * 100 + ('g' - 'a') * 1 etc...
(you'd have to take considerable care to get this right for any
code sets that had more than 100 possible code points)...
Now, you've included the information about what the original
word was and could use the payload to fail to match in the
exact-match case. Of course the other issue would be to figure
out the syntax to get the fact that you wanted an exact match
down into your custom scorer.

But as you can see, any scheme is harder than just flipping a switch,
so I'd _really_ verify that you can't just use copyField

Best
Erick

On Wed, Oct 10, 2012 at 7:38 AM, meghana meghana.rav...@amultek.com wrote:
  0 down vote favorite


 We are using solr 3.6.

 We have field named Description. We want searching feature with stemming and
 also without stemming (exact word/phrase search), with highlighting in both
 .

 For that , we had made lot of research and come to conclusion, to use the
 copy field with data type which doesn't have stemming factory. it is working
 fine at now.

 (main field has stemming and copy field has not.)

 The data for that field is very large and we are having millions of
 documents; and as we want, both searching and highlighting on them; we need
 to keep this copy field stored and indexed both. which will increase index
 size a lot.

 we need to eliminate this duplication if possible any how.

 From the recent research, we read that combining fuzzy search with dismax
 will fulfill our requirement. (we have tried a bit but not getting success.)

 Please let me know , if this is possible, or any other solutions to make
 this happen.

 Thanks in Advance




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Solr-Make-Exact-Search-on-Field-with-Fuzzy-Query-tp4012888.html
 Sent from the Solr - User mailing list archive at Nabble.com.