Sounds like a great future to add to Solr, especially if it would facilitate more automatic relevancy enhancement. LucidWorks Search has a feature called "unsupervised feedback" that does that but something like a docvector might make it a more realistic default.

-- Jack Krupansky

-----Original Message----- From: "Jürgen Wagner (DVT)"
Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...10000). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest relevance.

The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

Reply via email to