Re: Sort Facet Values by "Interestingness"?

Joel Bernstein Thu, 04 Aug 2016 07:01:03 -0700

Ok let's explore how to use scoreNodes() with the facet() expression.

scoreNodes is a graph expression, so it expects certain fields to be on
each Tuple. The fields are:


1) node: The node field is the node id gathered by the gatherNodes()
function.
2) collection: This is the collection that the node is belongs to
3) field: The field the node was gathered from.

The facet function does not include these fields automatically so we'll
need to adjust the tuples returned by the facet function using the select
function.

The pseudo code is:

scoreNodes(select(facet(...)))

In order to add detail to this let's take a simple case:

facet(collection1,
      q="*:*",
      buckets="author",
      bucketSorts="count(*) desc",
      bucketSizeLimit=100,
      count(*))

The tuples for this would look like this:

author : joel
count(*) : 5

author : jim
count(*) : 4


So three things need to be done to these tuples to work with scoreNodes:

1) the author field needs to be renamed "node", so it looks like the node
id of a gatherNodes function.
2) The "collection"  needs to be added.
3) The "field" needs to be added

So we can wrap a scoreNodes and a select function around the facet function
like this:

scoreNodes(
                    select(facet(collection1,
                     q="*:*",
                     buckets="author",
                     bucketSorts="count(*) desc",
                     bucketSizeLimit=100,
                     count(*)),
               author as node,
               count(*),
               replace(collection,null,withValue=collection1),
               replace(field, null, withValue=author)))











Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 11:53 AM, Joel Bernstein <joels...@gmail.com> wrote:

> You first gather the candidates and then call the TermsComponent with a
> callback. The scoreNodes expression does this and it's very fast because
> Streaming expressions are run from a Solr node in the same cluster.
>
> The TermsComponent will return the global docFreq for the terms and global
> numDocs for the collection, so you'll be able to compute idf for each term.
>
>
>
>
>
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Aug 3, 2016 at 11:22 AM, Ben Heuwing <heuw...@uni-hildesheim.de>
> wrote:
>
>> Hi Joel,
>>
>> thank you, this sounds great!
>>
>> As to your first proposal: I am a bit out of my depth here, as I have not
>> worked with streaming expressions so far. But I will try out your example
>> using the facet() expression on a simple use case as soon as you publish it.
>>
>> Using the TermsComponent directly, would that imply that I have to
>> retrieve all possible candidates and then sent them back as a  terms.list
>> to get their df? However, I assume that this would still be faster than
>> having 2 repeated facet-calls. Or did you suggest to use the component in a
>> customized RequestHandler?
>>
>> Regards,
>>
>> Ben
>>
>>
>> Am 03.08.2016 um 14:57 schrieb Joel Bernstein:
>>
>>> Also the TermsComponent now can export the docFreq for a list of terms
>>> and
>>> the numDocs for the index. This can be used as a general purpose
>>> mechanism
>>> for scoring facets with a callback.
>>>
>>> https://issues.apache.org/jira/browse/SOLR-9243
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein<joels...@gmail.com>
>>> wrote:
>>>
>>> What you're describing is implemented with Graph aggregations in this
>>>> ticket using tf-idf. Other scoring methods can be implemented as well.
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-9193
>>>>
>>>> I'll update this thread with a description of how this can be used with
>>>> the facet() streaming expression as well as with graph queries later
>>>> today.
>>>>
>>>>
>>>>
>>>> Joel Bernstein
>>>> http://joelsolr.blogspot.com/
>>>>
>>>> On Wed, Aug 3, 2016 at 8:18 AM,<heuw...@uni-hildesheim.de>  wrote:
>>>>
>>>> Dear everybody,
>>>>>
>>>>> as the JSON-API now makes configuration of facets and sub-facets
>>>>> easier,
>>>>> there appears to be a lot of potential to enable instant calculation of
>>>>> facet-recommendations for a query, that is, to sort facets by their
>>>>> relative importance/interestingess/signficance for a current query
>>>>> relative
>>>>> to the complete collection or relative to a result set defined by a
>>>>> different query.
>>>>>
>>>>> An example would be to show the most typical terms which are used in
>>>>> descriptions of horror-movies, in contrast to the most popular ones for
>>>>> this query, as these may include terms that occur as often in other
>>>>> genres.
>>>>>
>>>>> This feature has been discussed earlier in the context of solr:
>>>>> *
>>>>> http://stackoverflow.duapp.com/questions/26399264/how-
>>>>> can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
>>>>> *
>>>>> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-
>>>>> concept-td504070.html
>>>>>
>>>>> In elasticsearch, the specific feature that I am looking for is called
>>>>> Significant Terms Aggregation:
>>>>> https://www.elastic.co/guide/en/elasticsearch/reference/
>>>>> current/search-aggregations-bucket-significantterms-
>>>>> aggregation.html#search-aggregations-bucket-
>>>>> significantterms-aggregation
>>>>>
>>>>> As of now, I have two questions:
>>>>>
>>>>> a) Are there workarounds in the current solr-implementation or known
>>>>> patches that implement such a sort-option for fields with a large
>>>>> number of
>>>>> possible values, e.g. text-fields? (for smaller vocabularies it is
>>>>> easy to
>>>>> do this client-side with two queries)
>>>>> b) Are there plans to implement this in facet.pivot or in the
>>>>> facet.json-API?
>>>>>
>>>>> The first step could be to define "interestingness" as a sort-option
>>>>> for
>>>>> facets and to define interestingness as facet-count in the result-set
>>>>> as
>>>>> compared to the complete collection: documentfrequency_termX(bucket) *
>>>>> inverse_documentfrequency_termX(collection)
>>>>>
>>>>> As an extension, the JSON-API could be used to change the domain used
>>>>> as
>>>>> base for the comparison. Another interesting option would be to compare
>>>>> facet-counts against a current parent-facet for nested facets, e.g.
>>>>> the 5
>>>>> most interesting terms by genre for a query on 70s movies, returning
>>>>> the
>>>>> terms specific to horror, comedy, action etc. compared to all
>>>>> terminology
>>>>> at the time (i.e. in the parent-query).
>>>>>
>>>>> A call-back-function could be used to define other measures of
>>>>> interestingness such as the log-likelihood-ratio (
>>>>> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html).
>>>>> Most
>>>>> measures need at least the following 4 values: document-frequency for a
>>>>> term for the result-set, document-frequency for the result-set,
>>>>> document-frequency for a term in the index (or base-domain),
>>>>> document-frequency in the index (or base-domain).
>>>>>
>>>>> I guess, this feature might be of interest for those who want to do
>>>>> some
>>>>> small-scale term-analysis in addition to search, e.g. as in my case in
>>>>> digital humanities projects. But it might also be an interesting
>>>>> navigation
>>>>> device, e.g. when searching on job-offers to show the skills that are
>>>>> most
>>>>> distinctive for a category.
>>>>>
>>>>> It would be great to know, if others are interested in this feature. If
>>>>> there are any implementations out there or if anybody else is working
>>>>> on
>>>>> this, a pointer would be a great start. In the absence of existing
>>>>> solutions: Perhaps somebody has some idea on where and how to start
>>>>> implementing this?
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>>
>> --
>>
>> Ben Heuwing, Dr. phil.
>> Wissenschaftlicher Mitarbeiter
>> Institut für Informationswissenschaft und Sprachtechnologie
>> Universität Hildesheim
>>
>> Postanschrift:
>> Universitätsplatz 1
>> D-31141 Hildesheim
>>
>>
>> Büro:
>> Lübeckerstraße 3
>> Raum L017
>>
>> +49(0)5121 883-30316
>> heuw...@uni-hildesheim.de
>> Homepage <https://www.uni-hildesheim.de/fb3/institute/iwist/
>> mitglieder/heuwing/>
>>
>> Dissertationsschrift publiziert: /Usability-Ergebnisse als
>> Wissensressource in Organisationen/ - Print <
>> http://www.vwh-verlag.de/vwh/?p=995> | Online <
>> http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus4-3914>
>>
>>
>

Re: Sort Facet Values by "Interestingness"?

Reply via email to