Re: FAST-like document vector data structures in Solr?

Jack Krupansky Fri, 05 Sep 2014 07:37:05 -0700

Sounds like a great future to add to Solr, especially if it would facilitatemore automatic relevancy enhancement. LucidWorks Search has a feature called"unsupervised feedback" that does that but something like a docvector mightmake it a more realistic default.


-- Jack Krupansky

-----Original Message-----From: "Jürgen Wagner (DVT)"

Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...10000). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:

For reference:

“Item Similarity Vector Reference
This property represents a similarity reference when searching for similaritems. This is a similarity vector representation that is returned foreach item in the query result in the docvector managed property.
The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]
When performing a find similar query, the SimilarTo element should containa string parameter with the value of the docvector managed property of theitem that is to be used as the similarity reference. The similarity vectorconsists of a set of "term,weight" expressions, indicating the mostimportant terms or concepts in the item and the corresponding perceivedimportance (weight). Terms can be single words or phrases.
The weight is a float value between 0 and 1, where 1 indicates the highestrelevance.
The similarity vector is created during item processing and indicates themost important terms or concepts in the item and the correspondingweight.”
See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

Re: FAST-like document vector data structures in Solr?

Reply via email to