Some further details out of my mind: - it is a stream based feature - IDF estimates get updated and refined as more and more documents pass through - it is actually IDF weighting with stopwords and boosting -- stopwords should be ignored and not get vectorized -- boosting should give some boost to vectors
There are some further configuration parameters. nmin - minimum number of occurrences type (of IDF weighting) - flat, linear, logarithmic - flat, gives IDF the value of 0 if occurrences of the string in the document is less than nmin, else it is 1. - linear, interpolates linearly between 0 and 1, returns 0 if occurrences is below nmin, returns (1 - (# of docs with string found / # of docs passed through)) - logarithmic, uses natural logarithm, weights rarity more heavily, returns 0 if occurrences is below nmin, returns exponential_log(# of docs passed through / # of docs with string found) I think logarithmic was default (as far as I can remember). A question while thinking about this feature, is it possible with solr/lucene to have access to IDF for strings from the index while processing new documents? -- Bernd Am 05.09.2014 16:35, schrieb Jack Krupansky: > Sounds like a great future to add to Solr, especially if it would facilitate > more automatic relevancy enhancement. LucidWorks Search has a > feature called "unsupervised feedback" that does that but something like a > docvector might make it a more realistic default. > > -- Jack Krupansky > > -----Original Message----- From: "Jürgen Wagner (DVT)" > Sent: Friday, September 5, 2014 10:29 AM > To: solr-user@lucene.apache.org > Subject: Re: FAST-like document vector data structures in Solr? > > Thanks for posting this. I was just about to send off a message of > similar content :-) > > Important to add: > > - In FAST ESP, you could have more than one such docvector associated > with a document, in order to reflect different metrics. > > - Term weights in docvectors are document-relative, not absolute. > > - Processing is done in the search processor (close to the index), not > in the QR server (providing transformations on the result list). > > This docvector could be used for unsupervised clustering, > related-to/similarity search, tag clouds or more weird stuff like > identifying experts on topics contained in a particular document. > > With Solr, it seems I have to handcraft the term vectors to reflect the > right weights, to approximate the effect of FAST docvectors, e.g., by > normalizing them to [0...10000). Processing performance would still be > different from the classical FAST docvectors. The space consumption may > become ugly for a 200+ GB range shard, however, FAST has also been quite > generous with disk space, anyway. > > So, the interesting question is whether there is a more canonical way of > handling this in Solr/Lucene, or if something the like is planned for 5.0+. > > Best regards, > --Jürgen > > On 05.09.2014 16:02, Jack Krupansky wrote: >> For reference: >> >> “Item Similarity Vector Reference >> >> This property represents a similarity reference when searching for similar >> items. This is a similarity vector representation that is returned >> for each item in the query result in the docvector managed property. >> >> The value is a string formatted according to the following format: >> >> [string1,weight1][string2,weight2]...[stringN,weightN] >> >> When performing a find similar query, the SimilarTo element should contain a >> string parameter with the value of the docvector managed property >> of the item that is to be used as the similarity reference. The similarity >> vector consists of a set of "term,weight" expressions, indicating >> the most important terms or concepts in the item and the corresponding >> perceived importance (weight). Terms can be single words or phrases. >> >> The weight is a float value between 0 and 1, where 1 indicates the highest >> relevance. >> >> The similarity vector is created during item processing and indicates the >> most important terms or concepts in the item and the corresponding >> weight.” >> >> See: >> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx >> >> -- Jack Krupansky > -- ************************************************************* Bernd Fehling Bielefeld University Library Dipl.-Inform. (FH) LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************