For an experiment you can test out the significantTerms Streaming
Expression, which uses the foreground count and background count to score
terms.

https://solr.apache.org/guide/8_9/search-sample.html#significantterms
https://solr.apache.org/guide/8_9/stream-source-reference.html#significantterms-parameters








Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Jun 22, 2022 at 2:37 AM Danilo Tomasoni <tomas...@cosbi.eu> wrote:

> Hello Dave, first of all thank you for your answer.
>
> I need to clarify that I've used separate (and quite good) NER  algorithms
> offline and the results were imported to solr.
>
> Unfortunately the approach that you suggest using the morelikethis
> functionality is not suitable for my needs since I need to discover
> statistically significative relations between NER entities, while MLT will
> give me NER entities "similar" to the ones I'm looking for, as far as I
> understand.
>
> Anyone knows why the relatedness is high even if the foreground (and even
> background) popularity is 0?
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu<
> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
> >
> http://www.cosbi.eu<
> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
> >
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
> ________________________________
> Da: Dave <hastings.recurs...@gmail.com>
> Inviato: martedì 21 giugno 2022 19:51
> A: users@solr.apache.org <users@solr.apache.org>
> Oggetto: Re: Semantic Knowledge Graph theoric question
>
> [CAUTION: EXTERNAL SENDER]
> [Please check correspondence between Sender Display Name and Sender Email
> Address before clicking on any link or opening attachments]
>
>
> Two hints. The ner from solr isn’t very good, and the relatedness function
> is iffy at best.
>
> I would take a different approach. Get the ner data as you have it now and
> use shingles to make a better formed complete index using stop words then
> use the mlt mech to see if it’s better.   If it is, great if not it’s just
> an idea.
>
>
> > On Jun 21, 2022, at 12:02 PM, Danilo Tomasoni <tomas...@cosbi.eu> wrote:
> >
> > Hello all,
> > I'm experimenting with the SKG features available through json.facet API
> in solr 8.11 to discover semantic relations between medical text
> pre-annotated with NER algorithms.
> > I store the NER annotations, annotation id, span ecc in separate solr
> fields, to keep text clean.
> >
> > The first results looks promising but I found a behaviour that surprises
> me.
> > To give a bit of context I'm looking for covid-related papers with a
> standard query (q parameter)
> > Then I set my foreground query to be a set of keywords in OR related to
> the mithochondria, and the background query is set to *.
> >
> > Then the json.facet parameters are like
> >
> > "json.facet": {
> >    "gene":{
> >      "type": "terms",
> >      "field": "abstracts_gene_pubtator_annotation_ids",
> >      "sort": { "r1": "desc" },
> >      "limit": 3,
> >      "facet": {
> >        "r1" : "relatedness($fore,$back)"
> >        }
> >      }
> >    }
> > This should give gene stored in abstracts_gene_pubtator_annotation_ids
> that are more likely to occur in mitochondrial papers.
> > Running a test query gives me this surprising result
> >
> > ...
> >        "gene": {
> >          "buckets": [
> >            {
> >              "val": "3091",
> >              "count": 1,
> >              "rtitles1": {
> >                "relatedness": 0.55649,
> >                "foreground_popularity": 0,
> >                "background_popularity": 0.00018
> >              }
> >            },
> > ...
> > or for a similar query even bigger relatedness values
> > ...
> >    "buckets": [
> >      {
> >        "val": "MESH:D028361",
> >        "count": 1,
> >        "rabstract_conclusions0": {
> >          "relatedness": 0.91506,
> >          "foreground_popularity": 5e-05,
> >          "background_popularity": 5e-05
> >        },
> >
> > ...
> >
> > But If I recall the z-score formula
> >
> > countFG("3091") - totalFG * probBG
> > ------------------------------------------------
> > sqrt( totalFG * (1-probBG)*probBG )
> >
> > and set countFG("3091") to 1 this means that the relatedness should be
> negative (or at most 0) if totalFG * probBG >=1, while here I find a quite
> positive relatedness.
> > Maybe this can be controlled with min_popularity, but I don't understand
> how to use it in conjunction with type=terms and
> field=abstracts_gene_pubtator_annotation_ids
> >
> > Can you please tell me the correct syntax, and if my reasoning is
> correct?
> > Thank you
> > Danilo
> >
> > Danilo Tomasoni
> >
> > Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> > Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> > tomas...@cosbi.eu<
> https://webmail.cosbi.eu/owa/redir.aspx?C=VNXi3_8-qSZTBi-FPvMwmwSB3IhCOjY8nuCBIfcNIs_5SgD-zNPWCA..&URL=mailto%3acalabro%40cosbi.eu
> >
> > http://www.cosbi.eu<
> https://webmail.cosbi.eu/owa/redir.aspx?C=CkilyF54_imtLHzZqF1gCGvmYXjsnf4bzGynd8OXm__5SgD-zNPWCA..&URL=http%3a%2f%2fwww.cosbi.eu%2f
> >
> >
> > As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> > It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> > P Please don't print this e-mail unless you really need to
>

Reply via email to