On Jan 10, 2011, at 5:04 PM, lee carroll wrote: > Hi Grant, > > Its a search relevancy problem. For example: > > a document about london reads like > > London is not very good for a peaceful break. > > we analyse this at the (i can't remember the technical term) is it lexical > level? (bloody hell i think you may have wrote the book !) anyway which > produces tokens in our index of say > > "London good peaceful holiday" > > users search for cities which would be nice for them to take a holiday in > say the search is > "good for a peaceful break" > > and bang london is top. talk about a relevancy problem :-)
First question, why are you getting rid of "not"? Despite it's reputation as a "stopword", it does carry a significant amount of meaning for you. Then, you could probably do some phrase based searching that would help in some cases. > > now i was thinking of using phrase matches in the synonyms file but is that > the best approach or could nlp help here? I suppose it could. During indexing, you could detect that it is a negative connotation and change it to be "bad for a peaceful break" or something like that. I'm not aware of any system that does that. You could also use some sentiment analysis to analyze the sentence and determine it is a negative sentence and then tag it as negative such that your query takes that into account. Payloads and/or marker tokens would likely help here. -Grant > > cheers lee > > > > > On 10 January 2011 18:21, Grant Ingersoll <gsing...@apache.org> wrote: > >> >> On Jan 10, 2011, at 12:42 PM, lee carroll wrote: >> >>> Hi >>> >>> I'm indexing a set of documents which have a conversational writing >> style. >>> In particular the authors are very fond >>> of listing facts in a variety of ways (this is to keep a human reader >>> interested) but its causing my index trouble. >>> >>> For example instead of listing facts like: the house is white, the castle >> is >>> pretty. >>> >>> We get the house is the complete opposite of black and the castle is not >>> ugly. >>> >>> What are the best approaches to resolve these sorts of issues. Even if >> its >>> just handling "not" correctly would be a good start >>> >> >> Hmm, good problem. I guess I'd start by stepping back and ask what is the >> problem you are trying to solve? You've stated, I think, one half of the >> problem, namely that your authors have a conversational style, but you >> haven't stated what your users are expecting to do with this information? >> Is this a pure search app? Is it something else that is just backed by >> Solr but the user would never do a search? >> >> Do you have a relevance problem? Also, what is your notion of handling >> "not" correctly? In other words, more details are welcome! >> >> -Grant >> >> -------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> >> -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search