Re: FAST-like document vector data structures in Solr?

Jack Krupansky Fri, 05 Sep 2014 07:03:32 -0700

For reference:

“Item Similarity Vector Reference


This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for each 
item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a 
string parameter with the value of the docvector managed property of the item 
that is to be used as the similarity reference. The similarity vector consists 
of a set of "term,weight" expressions, indicating the most important terms or 
concepts in the item and the corresponding perceived importance (weight). Terms 
can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.

The similarity vector is created during item processing and indicates the most 
important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

From: "Jürgen Wagner (DVT)" 
Sent: Friday, September 5, 2014 7:03 AM
To: solr-user@lucene.apache.org 
Subject: Re: FAST-like document vector data structures in Solr?

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently 
mapping docvectors to these mechanisms and create term vectors myself from 
third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the 
performance of MoreLikeThis queries based on TermVectors is suboptimal on large 
document sets, so a more efficient support of such retrievals in the Lucene 
kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" <juergen.wag...@devoteam.com
:
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
"semantic fingerprint" of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen





-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center "Intelligence"
& Senior Cloud Consultant 

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de


--------------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht 
Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: FAST-like document vector data structures in Solr?

Reply via email to