You first gather the candidates and then call the TermsComponent with a
callback. The scoreNodes expression does this and it's very fast because
Streaming expressions are run from a Solr node in the same cluster.

The TermsComponent will return the global docFreq for the terms and global
numDocs for the collection, so you'll be able to compute idf for each term.










Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Aug 3, 2016 at 11:22 AM, Ben Heuwing <heuw...@uni-hildesheim.de>
wrote:

> Hi Joel,
>
> thank you, this sounds great!
>
> As to your first proposal: I am a bit out of my depth here, as I have not
> worked with streaming expressions so far. But I will try out your example
> using the facet() expression on a simple use case as soon as you publish it.
>
> Using the TermsComponent directly, would that imply that I have to
> retrieve all possible candidates and then sent them back as a  terms.list
> to get their df? However, I assume that this would still be faster than
> having 2 repeated facet-calls. Or did you suggest to use the component in a
> customized RequestHandler?
>
> Regards,
>
> Ben
>
>
> Am 03.08.2016 um 14:57 schrieb Joel Bernstein:
>
>> Also the TermsComponent now can export the docFreq for a list of terms and
>> the numDocs for the index. This can be used as a general purpose mechanism
>> for scoring facets with a callback.
>>
>> https://issues.apache.org/jira/browse/SOLR-9243
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>> On Wed, Aug 3, 2016 at 8:52 AM, Joel Bernstein<joels...@gmail.com>
>> wrote:
>>
>> What you're describing is implemented with Graph aggregations in this
>>> ticket using tf-idf. Other scoring methods can be implemented as well.
>>>
>>> https://issues.apache.org/jira/browse/SOLR-9193
>>>
>>> I'll update this thread with a description of how this can be used with
>>> the facet() streaming expression as well as with graph queries later
>>> today.
>>>
>>>
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Wed, Aug 3, 2016 at 8:18 AM,<heuw...@uni-hildesheim.de>  wrote:
>>>
>>> Dear everybody,
>>>>
>>>> as the JSON-API now makes configuration of facets and sub-facets easier,
>>>> there appears to be a lot of potential to enable instant calculation of
>>>> facet-recommendations for a query, that is, to sort facets by their
>>>> relative importance/interestingess/signficance for a current query
>>>> relative
>>>> to the complete collection or relative to a result set defined by a
>>>> different query.
>>>>
>>>> An example would be to show the most typical terms which are used in
>>>> descriptions of horror-movies, in contrast to the most popular ones for
>>>> this query, as these may include terms that occur as often in other
>>>> genres.
>>>>
>>>> This feature has been discussed earlier in the context of solr:
>>>> *
>>>>
>>>> http://stackoverflow.duapp.com/questions/26399264/how-can-i-sort-facets-by-their-tf-idf-score-rather-than-popularity
>>>> *
>>>>
>>>> http://lucene.472066.n3.nabble.com/Facets-with-an-IDF-concept-td504070.html
>>>>
>>>> In elasticsearch, the specific feature that I am looking for is called
>>>> Significant Terms Aggregation:
>>>>
>>>> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#search-aggregations-bucket-significantterms-aggregation
>>>>
>>>> As of now, I have two questions:
>>>>
>>>> a) Are there workarounds in the current solr-implementation or known
>>>> patches that implement such a sort-option for fields with a large
>>>> number of
>>>> possible values, e.g. text-fields? (for smaller vocabularies it is easy
>>>> to
>>>> do this client-side with two queries)
>>>> b) Are there plans to implement this in facet.pivot or in the
>>>> facet.json-API?
>>>>
>>>> The first step could be to define "interestingness" as a sort-option for
>>>> facets and to define interestingness as facet-count in the result-set as
>>>> compared to the complete collection: documentfrequency_termX(bucket) *
>>>> inverse_documentfrequency_termX(collection)
>>>>
>>>> As an extension, the JSON-API could be used to change the domain used as
>>>> base for the comparison. Another interesting option would be to compare
>>>> facet-counts against a current parent-facet for nested facets, e.g. the
>>>> 5
>>>> most interesting terms by genre for a query on 70s movies, returning the
>>>> terms specific to horror, comedy, action etc. compared to all
>>>> terminology
>>>> at the time (i.e. in the parent-query).
>>>>
>>>> A call-back-function could be used to define other measures of
>>>> interestingness such as the log-likelihood-ratio (
>>>> http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html).
>>>> Most
>>>> measures need at least the following 4 values: document-frequency for a
>>>> term for the result-set, document-frequency for the result-set,
>>>> document-frequency for a term in the index (or base-domain),
>>>> document-frequency in the index (or base-domain).
>>>>
>>>> I guess, this feature might be of interest for those who want to do some
>>>> small-scale term-analysis in addition to search, e.g. as in my case in
>>>> digital humanities projects. But it might also be an interesting
>>>> navigation
>>>> device, e.g. when searching on job-offers to show the skills that are
>>>> most
>>>> distinctive for a category.
>>>>
>>>> It would be great to know, if others are interested in this feature. If
>>>> there are any implementations out there or if anybody else is working on
>>>> this, a pointer would be a great start. In the absence of existing
>>>> solutions: Perhaps somebody has some idea on where and how to start
>>>> implementing this?
>>>>
>>>> Best regards,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>>
> --
>
> Ben Heuwing, Dr. phil.
> Wissenschaftlicher Mitarbeiter
> Institut für Informationswissenschaft und Sprachtechnologie
> Universität Hildesheim
>
> Postanschrift:
> Universitätsplatz 1
> D-31141 Hildesheim
>
>
> Büro:
> Lübeckerstraße 3
> Raum L017
>
> +49(0)5121 883-30316
> heuw...@uni-hildesheim.de
> Homepage <
> https://www.uni-hildesheim.de/fb3/institute/iwist/mitglieder/heuwing/>
>
> Dissertationsschrift publiziert: /Usability-Ergebnisse als
> Wissensressource in Organisationen/ - Print <
> http://www.vwh-verlag.de/vwh/?p=995> | Online <
> http://nbn-resolving.de/urn:nbn:de:gbv:hil2-opus4-3914>
>
>

Reply via email to