Dear everybody,
as the JSON-API now makes configuration of facets and sub-facets easier,
there appears to be a lot of potential to enable instant calculation of
facet-recommendations for a query, that is, to sort facets by their
relative importance/interestingess/signficance for a current query
relative to the complete collection or relative to a result set defined
by a different query.
An example would be to show the most typical terms which are used in
descriptions of horror-movies, in contrast to the most popular ones for
this query, as these may include terms that occur as often in other genres.
This feature has been discussed earlier in the context of solr:
*http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
*
http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
In elasticsearch, the specific feature that I am looking for is called
Significant Terms Aggregation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
As of now, I have two questions:
a) Are there workarounds in the current solr-implementation or known
patches that implement such a sort-option for fields with a large number
of possible values, e.g. text-fields? (for smaller vocabularies it is
easy to do this client-side with two queries)
b) Are there plans to implement this in facet.pivot or in the
facet.json-API?
The first step could be to define "interestingness" as a sort-option for
facets and to define interestingness as facet-count in the result-set as
compared to the complete collection: documentfrequency_termX(bucket) *
inverse_documentfrequency_termX(collection)
As an extension, the JSON-API could be used to change the domain used as
base for the comparison. Another interesting option would be to compare
facet-counts against a current parent-facet for nested facets, e.g. the
5 most interesting terms by genre for a query on 70s movies, returning
the terms specific to horror, comedy, action etc. compared to all
terminology at the time (i.e. in the parent-query).
A call-back-function could be used to define other measures of
interestingness such as the log-likelihood-ratio
(http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html).
Most measures need at least the following 4 values: document-frequency
for a term for the result-set, document-frequency for the result-set,
document-frequency for a term in the index (or base-domain),
document-frequency in the index (or base-domain).
I guess, this feature might be of interest for those who want to do some
small-scale term-analysis in addition to search, e.g. as in my case in
digital humanities projects. But it might also be an interesting
navigation device, e.g. when searching on job-offers to show the skills
that are most distinctive for a category.
It would be great to know, if others are interested in this feature. If
there are any implementations out there or if anybody else is working on
this, a pointer would be a great start. In the absence of existing
solutions: Perhaps somebody has some idea on where and how to start
implementing this?
Best regards,
Ben