Hi Alejandro,

N-grams <http://en.wikipedia.org/wiki/N-gram> might be a good fit.

Using bigrams (n-grams of length 2) for "london", you'd get tokens "lo", "on", 
"nd", "do", "on".  This should provide the hit ordering you want.

Although it's not listed on Solr's analysis factories wiki page 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters>, there is an 
NGramFilterFactory, with attributes maxGramSize and minGramSize.  See the 
example usage on the javadocs here: 
<http://lucene.apache.org/solr/api/org/apache/solr/analysis/NGramFilterFactory.html>.
  Also a tokenizer variant: 
<http://lucene.apache.org/solr/api/org/apache/solr/analysis/NGramTokenizerFactory.html>.
 

Steve

-----Original Message-----
From: Alejandro Cuesta [mailto:alejandro.cue...@gmail.com] 
Sent: Wednesday, May 16, 2012 12:51 PM
To: solr-user@lucene.apache.org
Subject: Sort by length percentage match

Hi,

I have a field  containing "cities" and I'd like to sort the results based on 
length percentage match.

Example:

Asuming I've got these cities in the index:

   london, south west london, londonderry, oxford

And I search for "london", I'd like to get a list sorted like this:

london                    (6/6, 100% match)
londonderry             (6/11, 54% match)
south west london   (6/17, 35% match)

I know Lucene uses a different scoring algorithm base on term frecuency and 
inverse document frecuency (tf & idf) but in my specific example I need to use 
this scoring strategy.

Can anyone give a clue or start point please?
Is there a better technology to perform this kind of search?

Thanks,

Alejandro

Reply via email to