Re: Sort Facet Values by "Interestingness"?

Ben Heuwing Wed, 03 Aug 2016 08:23:07 -0700

Hi Joel,

thank you, this sounds great!

As to your first proposal: I am a bit out of my depth here, as I havenot worked with streaming expressions so far. But I will try out yourexample using the facet() expression on a simple use case as soon as youpublish it.

Using the TermsComponent directly, would that imply that I have toretrieve all possible candidates and then sent them back as aterms.list to get their df? However, I assume that this would still befaster than having 2 repeated facet-calls. Or did you suggest to use thecomponent in a customized RequestHandler?


Regards,

Ben

Am 03.08.2016 um 14:57 schrieb Joel Bernstein:

Also the TermsComponent now can export the docFreq for a list of terms and
the numDocs for the index. This can be used as a general purpose mechanism
for scoring facets with a callback.

https://issues.apache.org/jira/browse/SOLR-9243

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein<joels...@gmail.com>  wrote:

What you're describing is implemented with Graph aggregations in this
ticket using tf-idf. Other scoring methods can be implemented as well.

https://issues.apache.org/jira/browse/SOLR-9193

I'll update this thread with a description of how this can be used with
the facet() streaming expression as well as with graph queries later today.



Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:18 AM,<heuw...@uni-hildesheim.de>  wrote:

Dear everybody,

as the JSON-API now makes configuration of facets and sub-facets easier,
there appears to be a lot of potential to enable instant calculation of
facet-recommendations for a query, that is, to sort facets by their
relative importance/interestingess/signficance for a current query relative
to the complete collection or relative to a result set defined by a
different query.

An example would be to show the most typical terms which are used in
descriptions of horror-movies, in contrast to the most popular ones for
this query, as these may include terms that occur as often in other genres.

This feature has been discussed earlier in the context of solr:
*
http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
*
http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html

In elasticsearch, the specific feature that I am looking for is called
Significant Terms Aggregation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation

As of now, I have two questions:

a) Are there workarounds in the current solr-implementation or known
patches that implement such a sort-option for fields with a large number of
possible values, e.g. text-fields? (for smaller vocabularies it is easy to
do this client-side with two queries)
b) Are there plans to implement this in facet.pivot or in the
facet.json-API?

The first step could be to define "interestingness" as a sort-option for
facets and to define interestingness as facet-count in the result-set as
compared to the complete collection: documentfrequency_termX(bucket) *
inverse_documentfrequency_termX(collection)

As an extension, the JSON-API could be used to change the domain used as
base for the comparison. Another interesting option would be to compare
facet-counts against a current parent-facet for nested facets, e.g. the 5
most interesting terms by genre for a query on 70s movies, returning the
terms specific to horror, comedy, action etc. compared to all terminology
at the time (i.e. in the parent-query).

A call-back-function could be used to define other measures of
interestingness such as the log-likelihood-ratio (
http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
measures need at least the following 4 values: document-frequency for a
term for the result-set, document-frequency for the result-set,
document-frequency for a term in the index (or base-domain),
document-frequency in the index (or base-domain).

I guess, this feature might be of interest for those who want to do some
small-scale term-analysis in addition to search, e.g. as in my case in
digital humanities projects. But it might also be an interesting navigation
device, e.g. when searching on job-offers to show the skills that are most
distinctive for a category.

It would be great to know, if others are interested in this feature. If
there are any implementations out there or if anybody else is working on
this, a pointer would be a great start. In the absence of existing
solutions: Perhaps somebody has some idea on where and how to start
implementing this?

Best regards,

Ben


--

Ben Heuwing, Dr. phil.
Wissenschaftlicher Mitarbeiter
Institut für Informationswissenschaft und Sprachtechnologie
Universität Hildesheim

Postanschrift:
Universitätsplatz 1
D-31141 Hildesheim


Büro:
Lübeckerstraße 3
Raum L017

+49(0)5121 883-30316
heuw...@uni-hildesheim.de

Homepage<https://www.uni-hildesheim.de/fb3/institute/iwist/mitglieder/heuwing/>

Dissertationsschrift publiziert: /Usability-Ergebnisse alsWissensressource in Organisationen/ - Print<http://www.vwh-verlag.de/vwh/?p=995> | Online<http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus4-3914>

Re: Sort Facet Values by "Interestingness"?

Reply via email to