Sort Facet Values by "Interestingness"?

heuwing Wed, 03 Aug 2016 05:21:05 -0700

Dear everybody,

as the JSON-API now makes configuration of facets and sub-facets easier,there appears to be a lot of potential to enable instant calculation offacet-recommendations for a query, that is, to sort facets by theirrelative importance/interestingess/signficance for a current queryrelative to the complete collection or relative to a result set definedby a different query.

An example would be to show the most typical terms which are used indescriptions of horror-movies, in contrast to the most popular ones forthis query, as these may include terms that occur as often in other genres.


This feature has been discussed earlier in the context of solr:
*http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity

*http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html

In elasticsearch, the specific feature that I am looking for is calledSignificant Terms Aggregation:https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation


As of now, I have two questions:

a) Are there workarounds in the current solr-implementation or knownpatches that implement such a sort-option for fields with a large numberof possible values, e.g. text-fields? (for smaller vocabularies it iseasy to do this client-side with two queries)b) Are there plans to implement this in facet.pivot or in thefacet.json-API?

The first step could be to define "interestingness" as a sort-option forfacets and to define interestingness as facet-count in the result-set ascompared to the complete collection: documentfrequency_termX(bucket) *inverse_documentfrequency_termX(collection)

As an extension, the JSON-API could be used to change the domain used asbase for the comparison. Another interesting option would be to comparefacet-counts against a current parent-facet for nested facets, e.g. the5 most interesting terms by genre for a query on 70s movies, returningthe terms specific to horror, comedy, action etc. compared to allterminology at the time (i.e. in the parent-query).

A call-back-function could be used to define other measures ofinterestingness such as the log-likelihood-ratio(http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html).Most measures need at least the following 4 values: document-frequencyfor a term for the result-set, document-frequency for the result-set,document-frequency for a term in the index (or base-domain),document-frequency in the index (or base-domain).

I guess, this feature might be of interest for those who want to do somesmall-scale term-analysis in addition to search, e.g. as in my case indigital humanities projects. But it might also be an interestingnavigation device, e.g. when searching on job-offers to show the skillsthat are most distinctive for a category.

It would be great to know, if others are interested in this feature. Ifthere are any implementations out there or if anybody else is working onthis, a pointer would be a great start. In the absence of existingsolutions: Perhaps somebody has some idea on where and how to startimplementing this?


Best regards,

Ben

Sort Facet Values by "Interestingness"?

Reply via email to