Re: Sort Facet Values by "Interestingness"?

Joel Bernstein Wed, 03 Aug 2016 05:58:04 -0700

Also the TermsComponent now can export the docFreq for a list of terms and
the numDocs for the index. This can be used as a general purpose mechanism
for scoring facets with a callback.


https://issues.apache.org/jira/browse/SOLR-9243

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein <joels...@gmail.com> wrote:

> What you're describing is implemented with Graph aggregations in this
> ticket using tf-idf. Other scoring methods can be implemented as well.
>
> https://issues.apache.org/jira/browse/SOLR-9193
>
> I'll update this thread with a description of how this can be used with
> the facet() streaming expression as well as with graph queries later today.
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Aug 3, 2016 at 8:18 AM, <heuw...@uni-hildesheim.de> wrote:
>
>> Dear everybody,
>>
>> as the JSON-API now makes configuration of facets and sub-facets easier,
>> there appears to be a lot of potential to enable instant calculation of
>> facet-recommendations for a query, that is, to sort facets by their
>> relative importance/interestingess/signficance for a current query relative
>> to the complete collection or relative to a result set defined by a
>> different query.
>>
>> An example would be to show the most typical terms which are used in
>> descriptions of horror-movies, in contrast to the most popular ones for
>> this query, as these may include terms that occur as often in other genres.
>>
>> This feature has been discussed earlier in the context of solr:
>> *
>> http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
>> *
>> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
>>
>> In elasticsearch, the specific feature that I am looking for is called
>> Significant Terms Aggregation:
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
>>
>> As of now, I have two questions:
>>
>> a) Are there workarounds in the current solr-implementation or known
>> patches that implement such a sort-option for fields with a large number of
>> possible values, e.g. text-fields? (for smaller vocabularies it is easy to
>> do this client-side with two queries)
>> b) Are there plans to implement this in facet.pivot or in the
>> facet.json-API?
>>
>> The first step could be to define "interestingness" as a sort-option for
>> facets and to define interestingness as facet-count in the result-set as
>> compared to the complete collection: documentfrequency_termX(bucket) *
>> inverse_documentfrequency_termX(collection)
>>
>> As an extension, the JSON-API could be used to change the domain used as
>> base for the comparison. Another interesting option would be to compare
>> facet-counts against a current parent-facet for nested facets, e.g. the 5
>> most interesting terms by genre for a query on 70s movies, returning the
>> terms specific to horror, comedy, action etc. compared to all terminology
>> at the time (i.e. in the parent-query).
>>
>> A call-back-function could be used to define other measures of
>> interestingness such as the log-likelihood-ratio (
>> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html). Most
>> measures need at least the following 4 values: document-frequency for a
>> term for the result-set, document-frequency for the result-set,
>> document-frequency for a term in the index (or base-domain),
>> document-frequency in the index (or base-domain).
>>
>> I guess, this feature might be of interest for those who want to do some
>> small-scale term-analysis in addition to search, e.g. as in my case in
>> digital humanities projects. But it might also be an interesting navigation
>> device, e.g. when searching on job-offers to show the skills that are most
>> distinctive for a category.
>>
>> It would be great to know, if others are interested in this feature. If
>> there are any implementations out there or if anybody else is working on
>> this, a pointer would be a great start. In the absence of existing
>> solutions: Perhaps somebody has some idea on where and how to start
>> implementing this?
>>
>> Best regards,
>>
>> Ben
>>
>>
>>
>

Re: Sort Facet Values by "Interestingness"?

Reply via email to