Re: Custom scoring for searhing geographic objects

2010-12-19 Thread Alexey Serba
Hi Pavel,

I had the similar problem several years ago - I had to find
geographical locations in textual descriptions, geocode these objects
to lat/long during indexing process and allow users to filter/sort
search results to specific geographical areas. The important issue was
that there were several types of geographical objects - street  town
 region  country. The idea was to geocode to most narrow
geographical area as possible. Relevance logic in this case could be
specified as find the most narrow result that is unique identified by
your text or search query.  So I came up with custom algorithm that
was quite good in terms of performance and precision/recall. Here's
the simple description:
* You can intersect all text/searchquery terms with locations
dictionary to find only geo terms
* Search in your locations Lucene index and filter only street objects
(the most narrow areas). Due to tf*idf formula you'll get the most
relevant results. Then you need to post process N (3/5/10) results and
verify that they are matches indeed. I did intersect search terms with
result's terms and make another lucene search to verify if these terms
are unique identifying the match. If it's then return matching street.
If there's no any match proceed using the same algorithm with towns,
regions, countries.

HTH,
Alexey

On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov char...@gmail.com wrote:
 Hi,
 Please give me advise how to create custom scoring. I need to result that
 documents were in order, depending on how popular each term in the document
 (popular = how many times it appears in the index) and length of the
 document (less terms - higher in search results).

 For example, index contains following data:

 ID    | SEARCH_FIELD
 --
 1     | Russia
 2     | Russia, Moscow
 3     | Russia, Volgograd
 4     | Russia, Ivanovo
 5     | Russia, Ivanovo, Altayskaya street 45
 6     | Russia, Moscow, Kremlin
 7     | Russia, Moscow, Altayskaya street
 8     | Russia, Moscow, Altayskaya street 15
 9     | Russia, Moscow, Altayskaya street 15/26


 And I should get next results:


 Query                     | Document result set
 --
 Russia                    | 1,2,4,3,6,7,8,9,5
 Moscow                  | 2,6,7,8,9
 Ivanovo                    | 4,5
 Altayskaya              | 7,8,9,5

 In fact --- it is a search for geographic objects (cities, streets, houses).
 At the same time can be given only part of the address, and the results
 should appear the most relevant results.

 Thanks.
 --
 Pavel Minchenkov



Custom scoring for searhing geographic objects

2010-12-15 Thread Pavel Minchenkov
Hi,
Please give me advise how to create custom scoring. I need to result that
documents were in order, depending on how popular each term in the document
(popular = how many times it appears in the index) and length of the
document (less terms - higher in search results).

For example, index contains following data:

ID| SEARCH_FIELD
--
1 | Russia
2 | Russia, Moscow
3 | Russia, Volgograd
4 | Russia, Ivanovo
5 | Russia, Ivanovo, Altayskaya street 45
6 | Russia, Moscow, Kremlin
7 | Russia, Moscow, Altayskaya street
8 | Russia, Moscow, Altayskaya street 15
9 | Russia, Moscow, Altayskaya street 15/26


And I should get next results:


Query | Document result set
--
Russia| 1,2,4,3,6,7,8,9,5
Moscow  | 2,6,7,8,9
Ivanovo| 4,5
Altayskaya  | 7,8,9,5

In fact --- it is a search for geographic objects (cities, streets, houses).
At the same time can be given only part of the address, and the results
should appear the most relevant results.

Thanks.
-- 
Pavel Minchenkov


Re: Custom scoring for searhing geographic objects

2010-12-15 Thread Grant Ingersoll
Have a look at http://lucene.apache.org/java/3_0_2/scoring.html on how Lucene's 
scoring works.  You can override the Similarity class in Solr as well via the 
schema.xml file.  

On Dec 15, 2010, at 10:28 AM, Pavel Minchenkov wrote:

 Hi,
 Please give me advise how to create custom scoring. I need to result that
 documents were in order, depending on how popular each term in the document
 (popular = how many times it appears in the index) and length of the
 document (less terms - higher in search results).
 
 For example, index contains following data:
 
 ID| SEARCH_FIELD
 --
 1 | Russia
 2 | Russia, Moscow
 3 | Russia, Volgograd
 4 | Russia, Ivanovo
 5 | Russia, Ivanovo, Altayskaya street 45
 6 | Russia, Moscow, Kremlin
 7 | Russia, Moscow, Altayskaya street
 8 | Russia, Moscow, Altayskaya street 15
 9 | Russia, Moscow, Altayskaya street 15/26
 
 
 And I should get next results:
 
 
 Query | Document result set
 --
 Russia| 1,2,4,3,6,7,8,9,5
 Moscow  | 2,6,7,8,9
 Ivanovo| 4,5
 Altayskaya  | 7,8,9,5
 
 In fact --- it is a search for geographic objects (cities, streets, houses).
 At the same time can be given only part of the address, and the results
 should appear the most relevant results.
 
 Thanks.
 -- 
 Pavel Minchenkov

--
Grant Ingersoll
http://www.lucidimagination.com