Dear everybody,

as the JSON-API now makes configuration of facets and sub-facets easier, there appears to be a lot of potential to enable instant calculation of facet-recommendations for a query, that is, to sort facets by their relative importance/interestingess/signficance for a current query relative to the complete collection or relative to a result set defined by a different query.

An example would be to show the most typical terms which are used in descriptions of horror-movies, in contrast to the most popular ones for this query, as these may include terms that occur as often in other genres.

This feature has been discussed earlier in the context of solr:
*http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
* http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html

In elasticsearch, the specific feature that I am looking for is called Significant Terms Aggregation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation

As of now, I have two questions:

a) Are there workarounds in the current solr-implementation or known patches that implement such a sort-option for fields with a large number of possible values, e.g. text-fields? (for smaller vocabularies it is easy to do this client-side with two queries) b) Are there plans to implement this in facet.pivot or in the facet.json-API?

The first step could be to define "interestingness" as a sort-option for facets and to define interestingness as facet-count in the result-set as compared to the complete collection: documentfrequency_termX(bucket) * inverse_documentfrequency_termX(collection)

As an extension, the JSON-API could be used to change the domain used as base for the comparison. Another interesting option would be to compare facet-counts against a current parent-facet for nested facets, e.g. the 5 most interesting terms by genre for a query on 70s movies, returning the terms specific to horror, comedy, action etc. compared to all terminology at the time (i.e. in the parent-query).

A call-back-function could be used to define other measures of interestingness such as the log-likelihood-ratio (http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most measures need at least the following 4 values: document-frequency for a term for the result-set, document-frequency for the result-set, document-frequency for a term in the index (or base-domain), document-frequency in the index (or base-domain).

I guess, this feature might be of interest for those who want to do some small-scale term-analysis in addition to search, e.g. as in my case in digital humanities projects. But it might also be an interesting navigation device, e.g. when searching on job-offers to show the skills that are most distinctive for a category.

It would be great to know, if others are interested in this feature. If there are any implementations out there or if anybody else is working on this, a pointer would be a great start. In the absence of existing solutions: Perhaps somebody has some idea on where and how to start implementing this?

Best regards,

Ben


Reply via email to