Re: FAST-like document vector data structures in Solr?

2014-09-08 Thread Bernd Fehling
Some further details out of my mind:
- it is a stream based feature
- IDF estimates get updated and refined as more and more documents pass through
- it is actually IDF weighting with stopwords and boosting
-- stopwords should be ignored and not get vectorized
-- boosting should give some boost to vectors

There are some further configuration parameters.
nmin - minimum number of occurrences
type (of IDF weighting) - flat, linear, logarithmic
- flat, gives IDF the value of 0 if occurrences of the string in the
document is less than nmin, else it is 1.
- linear, interpolates linearly between 0 and 1,
  returns 0 if occurrences is below nmin,
  returns (1 - (# of docs with string found / # of docs passed through))
- logarithmic, uses natural logarithm, weights rarity more heavily,
  returns 0 if occurrences is below nmin,
  returns exponential_log(# of docs passed through / # of docs with 
string found)

I think logarithmic was default (as far as I can remember).


A question while thinking about this feature, is it possible with solr/lucene to
have access to IDF for strings from the index while processing new documents?


-- Bernd

Am 05.09.2014 16:35, schrieb Jack Krupansky:
 Sounds like a great future to add to Solr, especially if it would facilitate 
 more automatic relevancy enhancement. LucidWorks Search has a
 feature called unsupervised feedback that does that but something like a 
 docvector might make it a more realistic default.
 
 -- Jack Krupansky
 
 -Original Message- From: Jürgen Wagner (DVT)
 Sent: Friday, September 5, 2014 10:29 AM
 To: solr-user@lucene.apache.org
 Subject: Re: FAST-like document vector data structures in Solr?
 
 Thanks for posting this. I was just about to send off a message of
 similar content :-)
 
 Important to add:
 
 - In FAST ESP, you could have more than one such docvector associated
 with a document, in order to reflect different metrics.
 
 - Term weights in docvectors are document-relative, not absolute.
 
 - Processing is done in the search processor (close to the index), not
 in the QR server (providing transformations on the result list).
 
 This docvector could be used for unsupervised clustering,
 related-to/similarity search, tag clouds or more weird stuff like
 identifying experts on topics contained in a particular document.
 
 With Solr, it seems I have to handcraft the term vectors to reflect the
 right weights, to approximate the effect of FAST docvectors, e.g., by
 normalizing them to [0...1). Processing performance would still be
 different from the classical FAST docvectors. The space consumption may
 become ugly for a 200+ GB range shard, however, FAST has also been quite
 generous with disk space, anyway.
 
 So, the interesting question is whether there is a more canonical way of
 handling this in Solr/Lucene, or if something the like is planned for 5.0+.
 
 Best regards,
 --Jürgen
 
 On 05.09.2014 16:02, Jack Krupansky wrote:
 For reference:

 “Item Similarity Vector Reference

 This property represents a similarity reference when searching for similar 
 items. This is a similarity vector representation that is returned
 for each item in the query result in the docvector managed property.

 The value is a string formatted according to the following format:

 [string1,weight1][string2,weight2]...[stringN,weightN]

 When performing a find similar query, the SimilarTo element should contain a 
 string parameter with the value of the docvector managed property
 of the item that is to be used as the similarity reference. The similarity 
 vector consists of a set of term,weight expressions, indicating
 the most important terms or concepts in the item and the corresponding 
 perceived importance (weight). Terms can be single words or phrases.

 The weight is a float value between 0 and 1, where 1 indicates the highest 
 relevance.

 The similarity vector is created during item processing and indicates the 
 most important terms or concepts in the item and the corresponding
  weight.”

 See:
 http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

 -- Jack Krupansky
 

-- 
*
Bernd FehlingBielefeld University Library
Dipl.-Inform. (FH)LibTec - Library Technology
Universitätsstr. 25  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060   bernd.fehling(at)uni-bielefeld.de

BASE - Bielefeld Academic Search Engine - www.base-search.net
*


FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
semantic fingerprint of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread jim ferenczi
Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
:

 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen




Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am
presently mapping docvectors to these mechanisms and create term vectors
myself from third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the
performance of MoreLikeThis queries based on TermVectors is suboptimal
on large document sets, so a more efficient support of such retrievals
in the Lucene kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:
 Hi,
 Something like ?:
 https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
 And just to show some impressive search functionality of the wiki: ;)
 https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

 Cheers,
 Jim


 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
 :
 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen




-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center Intelligence
 Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de
http://www.devoteam.de/


Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky
For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for each 
item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a 
string parameter with the value of the docvector managed property of the item 
that is to be used as the similarity reference. The similarity vector consists 
of a set of term,weight expressions, indicating the most important terms or 
concepts in the item and the corresponding perceived importance (weight). Terms 
can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.

The similarity vector is created during item processing and indicates the most 
important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

From: Jürgen Wagner (DVT) 
Sent: Friday, September 5, 2014 7:03 AM
To: solr-user@lucene.apache.org 
Subject: Re: FAST-like document vector data structures in Solr?

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently 
mapping docvectors to these mechanisms and create term vectors myself from 
third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the 
performance of MoreLikeThis queries based on TermVectors is suboptimal on large 
document sets, so a more efficient support of such retrievals in the Lucene 
kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com
:
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
semantic fingerprint of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen





-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением
i.A. Jürgen Wagner
Head of Competence Center Intelligence
 Senior Cloud Consultant 

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht 
Darmstadt HRB 6450; Tax Number: DE 172 993 071 



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Mikhail Khludnev
Jürgen,

I can't get it. Can you tell more about this feature or point to the doc?
Thanks


On Fri, Sep 5, 2014 at 11:44 AM, Jürgen Wagner (DVT) 
juergen.wag...@devoteam.com wrote:

 Hello all,
   as the migration from FAST to Solr is a relevant topic for several of
 our customers, there is one issue that does not seem to be addressed by
 Lucene/Solr: document vectors FAST-style. These document vectors are
 used to form metrics of similarity, i.e., they may be used as a
 semantic fingerprint of documents to define similarity relations. I
 can think of several ways of approximating a mapping of this mechanism
 to Solr, but there are always drawbacks - mostly performance-wise.

 Has anybody else encountered and possibly approached this challenge so far?

 Is there anything in the roadmap of Solr that has not revealed itself to
 me, addressing this issue?

 Your input is greatly appreciated!

 Cheers,
 --Jürgen




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com


Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
 For reference:

 “Item Similarity Vector Reference

 This property represents a similarity reference when searching for similar 
 items. This is a similarity vector representation that is returned for each 
 item in the query result in the docvector managed property.

 The value is a string formatted according to the following format:

 [string1,weight1][string2,weight2]...[stringN,weightN]

 When performing a find similar query, the SimilarTo element should contain a 
 string parameter with the value of the docvector managed property of the item 
 that is to be used as the similarity reference. The similarity vector 
 consists of a set of term,weight expressions, indicating the most important 
 terms or concepts in the item and the corresponding perceived importance 
 (weight). Terms can be single words or phrases.

 The weight is a float value between 0 and 1, where 1 indicates the highest 
 relevance.

 The similarity vector is created during item processing and indicates the 
 most important terms or concepts in the item and the corresponding weight.”

 See:
 http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

 -- Jack Krupansky



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky
Sounds like a great future to add to Solr, especially if it would facilitate 
more automatic relevancy enhancement. LucidWorks Search has a feature called 
unsupervised feedback that does that but something like a docvector might 
make it a more realistic default.


-- Jack Krupansky

-Original Message- 
From: Jürgen Wagner (DVT)

Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:

For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for 
each item in the query result in the docvector managed property.


The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain 
a string parameter with the value of the docvector managed property of the 
item that is to be used as the similarity reference. The similarity vector 
consists of a set of term,weight expressions, indicating the most 
important terms or concepts in the item and the corresponding perceived 
importance (weight). Terms can be single words or phrases.


The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.


The similarity vector is created during item processing and indicates the 
most important terms or concepts in the item and the corresponding 
 weight.”


See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky