Some further details out of my mind:
- it is a stream based feature
- IDF estimates get updated and refined as more and more documents pass through
- it is actually IDF weighting with stopwords and boosting
-- stopwords should be ignored and not get vectorized
-- boosting should give some boost to vectors

There are some further configuration parameters.
nmin - minimum number of occurrences
type (of IDF weighting) - flat, linear, logarithmic
- flat, gives IDF the value of 0 if occurrences of the string in the
        document is less than nmin, else it is 1.
- linear, interpolates linearly between 0 and 1,
          returns 0 if occurrences is below nmin,
          returns (1 - (# of docs with string found / # of docs passed through))
- logarithmic, uses natural logarithm, weights rarity more heavily,
          returns 0 if occurrences is below nmin,
          returns exponential_log(# of docs passed through / # of docs with 
string found)

I think logarithmic was default (as far as I can remember).


A question while thinking about this feature, is it possible with solr/lucene to
have access to IDF for strings from the index while processing new documents?


-- Bernd

Am 05.09.2014 16:35, schrieb Jack Krupansky:
> Sounds like a great future to add to Solr, especially if it would facilitate 
> more automatic relevancy enhancement. LucidWorks Search has a
> feature called "unsupervised feedback" that does that but something like a 
> docvector might make it a more realistic default.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: "Jürgen Wagner (DVT)"
> Sent: Friday, September 5, 2014 10:29 AM
> To: solr-user@lucene.apache.org
> Subject: Re: FAST-like document vector data structures in Solr?
> 
> Thanks for posting this. I was just about to send off a message of
> similar content :-)
> 
> Important to add:
> 
> - In FAST ESP, you could have more than one such docvector associated
> with a document, in order to reflect different metrics.
> 
> - Term weights in docvectors are document-relative, not absolute.
> 
> - Processing is done in the search processor (close to the index), not
> in the QR server (providing transformations on the result list).
> 
> This docvector could be used for unsupervised clustering,
> related-to/similarity search, tag clouds or more weird stuff like
> identifying experts on topics contained in a particular document.
> 
> With Solr, it seems I have to handcraft the term vectors to reflect the
> right weights, to approximate the effect of FAST docvectors, e.g., by
> normalizing them to [0...10000). Processing performance would still be
> different from the classical FAST docvectors. The space consumption may
> become ugly for a 200+ GB range shard, however, FAST has also been quite
> generous with disk space, anyway.
> 
> So, the interesting question is whether there is a more canonical way of
> handling this in Solr/Lucene, or if something the like is planned for 5.0+.
> 
> Best regards,
> --Jürgen
> 
> On 05.09.2014 16:02, Jack Krupansky wrote:
>> For reference:
>>
>> “Item Similarity Vector Reference
>>
>> This property represents a similarity reference when searching for similar 
>> items. This is a similarity vector representation that is returned
>> for each item in the query result in the docvector managed property.
>>
>> The value is a string formatted according to the following format:
>>
>> [string1,weight1][string2,weight2]...[stringN,weightN]
>>
>> When performing a find similar query, the SimilarTo element should contain a 
>> string parameter with the value of the docvector managed property
>> of the item that is to be used as the similarity reference. The similarity 
>> vector consists of a set of "term,weight" expressions, indicating
>> the most important terms or concepts in the item and the corresponding 
>> perceived importance (weight). Terms can be single words or phrases.
>>
>> The weight is a float value between 0 and 1, where 1 indicates the highest 
>> relevance.
>>
>> The similarity vector is created during item processing and indicates the 
>> most important terms or concepts in the item and the corresponding
>>  weight.”
>>
>> See:
>> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx
>>
>> -- Jack Krupansky
> 

-- 
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Reply via email to