For reference: “Item Similarity Vector Reference
This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of "term,weight" expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky From: "Jürgen Wagner (DVT)" Sent: Friday, September 5, 2014 7:03 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Hello Jim, yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components. However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred. Cheers, --Jürgen On 05.09.2014 10:55, jim ferenczi wrote: Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" <juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a "semantic fingerprint" of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center "Intelligence" & Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de -------------------------------------------------------------------------------- Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071