Re: FAST-like document vector data structures in Solr?
Some further details out of my mind: - it is a stream based feature - IDF estimates get updated and refined as more and more documents pass through - it is actually IDF weighting with stopwords and boosting -- stopwords should be ignored and not get vectorized -- boosting should give some boost to vectors There are some further configuration parameters. nmin - minimum number of occurrences type (of IDF weighting) - flat, linear, logarithmic - flat, gives IDF the value of 0 if occurrences of the string in the document is less than nmin, else it is 1. - linear, interpolates linearly between 0 and 1, returns 0 if occurrences is below nmin, returns (1 - (# of docs with string found / # of docs passed through)) - logarithmic, uses natural logarithm, weights rarity more heavily, returns 0 if occurrences is below nmin, returns exponential_log(# of docs passed through / # of docs with string found) I think logarithmic was default (as far as I can remember). A question while thinking about this feature, is it possible with solr/lucene to have access to IDF for strings from the index while processing new documents? -- Bernd Am 05.09.2014 16:35, schrieb Jack Krupansky: Sounds like a great future to add to Solr, especially if it would facilitate more automatic relevancy enhancement. LucidWorks Search has a feature called unsupervised feedback that does that but something like a docvector might make it a more realistic default. -- Jack Krupansky -Original Message- From: Jürgen Wagner (DVT) Sent: Friday, September 5, 2014 10:29 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute. - Processing is done in the search processor (close to the index), not in the QR server (providing transformations on the result list). This docvector could be used for unsupervised clustering, related-to/similarity search, tag clouds or more weird stuff like identifying experts on topics contained in a particular document. With Solr, it seems I have to handcraft the term vectors to reflect the right weights, to approximate the effect of FAST docvectors, e.g., by normalizing them to [0...1). Processing performance would still be different from the classical FAST docvectors. The space consumption may become ugly for a 200+ GB range shard, however, FAST has also been quite generous with disk space, anyway. So, the interesting question is whether there is a more canonical way of handling this in Solr/Lucene, or if something the like is planned for 5.0+. Best regards, --Jürgen On 05.09.2014 16:02, Jack Krupansky wrote: For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
FAST-like document vector data structures in Solr?
Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen
Re: FAST-like document vector data structures in Solr?
Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen
Re: FAST-like document vector data structures in Solr?
Hello Jim, yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components. However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred. Cheers, --Jürgen On 05.09.2014 10:55, jim ferenczi wrote: Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de http://www.devoteam.de/ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: FAST-like document vector data structures in Solr?
For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky From: Jürgen Wagner (DVT) Sent: Friday, September 5, 2014 7:03 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Hello Jim, yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently mapping docvectors to these mechanisms and create term vectors myself from third-party text mining components. However, it's not quite like the FAST docvectors. Particularily, the performance of MoreLikeThis queries based on TermVectors is suboptimal on large document sets, so a more efficient support of such retrievals in the Lucene kernel would be preferred. Cheers, --Jürgen On 05.09.2014 10:55, jim ferenczi wrote: Hi, Something like ?: https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component And just to show some impressive search functionality of the wiki: ;) https://cwiki.apache.org/confluence/dosearchsite.action?where=solrspaceSearch=truequeryString=document+vectors Cheers, Jim 2014-09-05 9:44 GMT+02:00 Jürgen Wagner (DVT) juergen.wag...@devoteam.com : Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением i.A. Jürgen Wagner Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: FAST-like document vector data structures in Solr?
Jürgen, I can't get it. Can you tell more about this feature or point to the doc? Thanks On Fri, Sep 5, 2014 at 11:44 AM, Jürgen Wagner (DVT) juergen.wag...@devoteam.com wrote: Hello all, as the migration from FAST to Solr is a relevant topic for several of our customers, there is one issue that does not seem to be addressed by Lucene/Solr: document vectors FAST-style. These document vectors are used to form metrics of similarity, i.e., they may be used as a semantic fingerprint of documents to define similarity relations. I can think of several ways of approximating a mapping of this mechanism to Solr, but there are always drawbacks - mostly performance-wise. Has anybody else encountered and possibly approached this challenge so far? Is there anything in the roadmap of Solr that has not revealed itself to me, addressing this issue? Your input is greatly appreciated! Cheers, --Jürgen -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: FAST-like document vector data structures in Solr?
Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute. - Processing is done in the search processor (close to the index), not in the QR server (providing transformations on the result list). This docvector could be used for unsupervised clustering, related-to/similarity search, tag clouds or more weird stuff like identifying experts on topics contained in a particular document. With Solr, it seems I have to handcraft the term vectors to reflect the right weights, to approximate the effect of FAST docvectors, e.g., by normalizing them to [0...1). Processing performance would still be different from the classical FAST docvectors. The space consumption may become ugly for a 200+ GB range shard, however, FAST has also been quite generous with disk space, anyway. So, the interesting question is whether there is a more canonical way of handling this in Solr/Lucene, or if something the like is planned for 5.0+. Best regards, --Jürgen On 05.09.2014 16:02, Jack Krupansky wrote: For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky
Re: FAST-like document vector data structures in Solr?
Sounds like a great future to add to Solr, especially if it would facilitate more automatic relevancy enhancement. LucidWorks Search has a feature called unsupervised feedback that does that but something like a docvector might make it a more realistic default. -- Jack Krupansky -Original Message- From: Jürgen Wagner (DVT) Sent: Friday, September 5, 2014 10:29 AM To: solr-user@lucene.apache.org Subject: Re: FAST-like document vector data structures in Solr? Thanks for posting this. I was just about to send off a message of similar content :-) Important to add: - In FAST ESP, you could have more than one such docvector associated with a document, in order to reflect different metrics. - Term weights in docvectors are document-relative, not absolute. - Processing is done in the search processor (close to the index), not in the QR server (providing transformations on the result list). This docvector could be used for unsupervised clustering, related-to/similarity search, tag clouds or more weird stuff like identifying experts on topics contained in a particular document. With Solr, it seems I have to handcraft the term vectors to reflect the right weights, to approximate the effect of FAST docvectors, e.g., by normalizing them to [0...1). Processing performance would still be different from the classical FAST docvectors. The space consumption may become ugly for a 200+ GB range shard, however, FAST has also been quite generous with disk space, anyway. So, the interesting question is whether there is a more canonical way of handling this in Solr/Lucene, or if something the like is planned for 5.0+. Best regards, --Jürgen On 05.09.2014 16:02, Jack Krupansky wrote: For reference: “Item Similarity Vector Reference This property represents a similarity reference when searching for similar items. This is a similarity vector representation that is returned for each item in the query result in the docvector managed property. The value is a string formatted according to the following format: [string1,weight1][string2,weight2]...[stringN,weightN] When performing a find similar query, the SimilarTo element should contain a string parameter with the value of the docvector managed property of the item that is to be used as the similarity reference. The similarity vector consists of a set of term,weight expressions, indicating the most important terms or concepts in the item and the corresponding perceived importance (weight). Terms can be single words or phrases. The weight is a float value between 0 and 1, where 1 indicates the highest relevance. The similarity vector is created during item processing and indicates the most important terms or concepts in the item and the corresponding weight.” See: http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx -- Jack Krupansky