Re: Handling acronyms

2021-01-15 Thread Michael Gibney
EDIT: "the equivalent terms are separated by commas (as they should be)" =>
"the equivalent terms are _not_ separated by commas (as they should be)"

On Fri, Jan 15, 2021 at 10:09 AM Michael Gibney 
wrote:

> Shaun,
>
> I'm not 100% sure, but don't give up on this just yet:
>
> > For example if I enter diabetes it finds the acronym DM for diabetes
> mellitus
>
> I think the behavior you're observing may simply be a side-effect of a
> misconfiguration of synonyms.txt. In the example you posted, the equivalent
> terms are separated by commas (as they should be), which would lead to
> treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
> mellitus", which as you point out is clearly not what you want. Do you see
> similar results for `DM, diabetes mellitus` (which should be parsed as
> meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?
>
> (see the note about ensuring proper comma-separation in my earlier
> response)
>
> Michael
>
>
> On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell 
> wrote:
>
>> Hi Michael
>>
>> Thanks for that I'll have a study later.  It's just reminded me of the
>> expand option which I meant to have a look at.
>>
>> Thanks
>> Shaun
>>
>> On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
>> wrote:
>>
>> > The equivalent terms on the right-hand side of the `=>` operator in the
>> > example you sent should be separated by a comma. You mention you already
>> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research
>> Network`)
>> > and that that yielded unexpected results as well. I would recommend
>> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
>> and
>> > applying the synonym filter _after_ case normalization in the analysis
>> > chain (there are other ways you could do, but the key point being that
>> you
>> > need to pay attention to case and how it interacts with the order in
>> which
>> > filters are applied).
>> >
>> > Re: Charlie's recommendation to apply these at index-time, a word of
>> > caution (and it's possible that this is in fact the underlying cause of
>> > some of the unexpected behavior you're observine?): be careful if you're
>> > using term _expansion_ at index-time (i.e., mapping single terms to
>> > multiple terms, which I note appears to be what you're trying to do in
>> the
>> > example lines you provided). Multi-term index-time synonyms can lead to
>> > unexpected results for positional queries (either explicit phrase
>> queries,
>> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
>> at
>> > least two good overviews of this topic, one by Mike McCandless focusing
>> on
>> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The
>> underlying
>> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
>> > relevant.
>> >
>> > One way to work around this is to "collapse" (rather than expand)
>> synonyms,
>> > at both index and query time. Another option would be to apply synonym
>> > expansion only at query-time. It's also worth noting that increasing
>> phrase
>> > slop (`ps` param, etc.) can cause the issues with index-time synonym
>> > expansion to "fly under the radar" a little, wrt the most blatant "false
>> > negative" manifestations of index-time synonym issues for phrase
>> queries.
>> >
>> > [1]
>> >
>> >
>> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
>> > [2]
>> >
>> >
>> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
>> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
>> >
>> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
>> > ch...@opensourceconnections.com> wrote:
>> >
>> > > I'm wondering if you should be using these acronyms at index time, not
>> > > search time. It will make your index bigger and you'll have to
>> re-index
>> > > to add new synonyms (as they may apply to old documents) but this
>> could
>> > > be an occasional task, and in the meantime you could use query-time
>> > > synonyms for the new ones.
>> > >
>> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy
>> to
>> > me.
>> > >
>> > > Cheers
>> > >
>> > > Charlie
>> > >
>> > > On 15/01/2021 09:48, Shaun Campbell wrote:
>> > > > I have a medical journals search application and I've a list of some
>> > > 9,000
>> > > > acronyms like this:
>> > > >
>> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
>> > Questionnaire
>> > > > SRN=>SRN Stroke Research Network
>> > > > IGBP=>IGBP isolated gastric bypass
>> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
>> > > Obstructive
>> > > > sleep apnoea–hypopnoea
>> > > > SRM=>SRM standardised response mean
>> > > > SRT=>SRT substrate reduction therapy
>> > > > SRS=>SRS Sexual Rating Scale
>> > > > SRU=>SRU stroke rehabilitation unit
>> > > > T2w=>T2w T2-weighted
>> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale
>> > > > MSOA=>MSOA middle-layer super output area
>> > > > SSA=>SSA 

Re: Handling acronyms

2021-01-15 Thread Michael Gibney
Shaun,

I'm not 100% sure, but don't give up on this just yet:

> For example if I enter diabetes it finds the acronym DM for diabetes
mellitus

I think the behavior you're observing may simply be a side-effect of a
misconfiguration of synonyms.txt. In the example you posted, the equivalent
terms are separated by commas (as they should be), which would lead to
treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
mellitus", which as you point out is clearly not what you want. Do you see
similar results for `DM, diabetes mellitus` (which should be parsed as
meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?

(see the note about ensuring proper comma-separation in my earlier response)

Michael


On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell 
wrote:

> Hi Michael
>
> Thanks for that I'll have a study later.  It's just reminded me of the
> expand option which I meant to have a look at.
>
> Thanks
> Shaun
>
> On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
> wrote:
>
> > The equivalent terms on the right-hand side of the `=>` operator in the
> > example you sent should be separated by a comma. You mention you already
> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> > and that that yielded unexpected results as well. I would recommend
> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
> and
> > applying the synonym filter _after_ case normalization in the analysis
> > chain (there are other ways you could do, but the key point being that
> you
> > need to pay attention to case and how it interacts with the order in
> which
> > filters are applied).
> >
> > Re: Charlie's recommendation to apply these at index-time, a word of
> > caution (and it's possible that this is in fact the underlying cause of
> > some of the unexpected behavior you're observine?): be careful if you're
> > using term _expansion_ at index-time (i.e., mapping single terms to
> > multiple terms, which I note appears to be what you're trying to do in
> the
> > example lines you provided). Multi-term index-time synonyms can lead to
> > unexpected results for positional queries (either explicit phrase
> queries,
> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
> at
> > least two good overviews of this topic, one by Mike McCandless focusing
> on
> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> > relevant.
> >
> > One way to work around this is to "collapse" (rather than expand)
> synonyms,
> > at both index and query time. Another option would be to apply synonym
> > expansion only at query-time. It's also worth noting that increasing
> phrase
> > slop (`ps` param, etc.) can cause the issues with index-time synonym
> > expansion to "fly under the radar" a little, wrt the most blatant "false
> > negative" manifestations of index-time synonym issues for phrase queries.
> >
> > [1]
> >
> >
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> > [2]
> >
> >
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
> >
> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> > ch...@opensourceconnections.com> wrote:
> >
> > > I'm wondering if you should be using these acronyms at index time, not
> > > search time. It will make your index bigger and you'll have to re-index
> > > to add new synonyms (as they may apply to old documents) but this could
> > > be an occasional task, and in the meantime you could use query-time
> > > synonyms for the new ones.
> > >
> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> > me.
> > >
> > > Cheers
> > >
> > > Charlie
> > >
> > > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > > I have a medical journals search application and I've a list of some
> > > 9,000
> > > > acronyms like this:
> > > >
> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> > Questionnaire
> > > > SRN=>SRN Stroke Research Network
> > > > IGBP=>IGBP isolated gastric bypass
> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > > Obstructive
> > > > sleep apnoea–hypopnoea
> > > > SRM=>SRM standardised response mean
> > > > SRT=>SRT substrate reduction therapy
> > > > SRS=>SRS Sexual Rating Scale
> > > > SRU=>SRU stroke rehabilitation unit
> > > > T2w=>T2w T2-weighted
> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > > > MSOA=>MSOA middle-layer super output area
> > > > SSA=>SSA site-specific assessment
> > > > SSC=>SSC Study Steering Committee
> > > > SSB=>SSB short-stretch bandage
> > > > SSE=>SSE sum squared error
> > > > SSD=>SSD social services department
> > > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> > > >
> > > > I tried to put them in a synonyms file, either just with a comma
> > between,
> > > > or with an arrow 

Re: Handling acronyms

2021-01-15 Thread Shaun Campbell
Hi Michael

Thanks for that I'll have a study later.  It's just reminded me of the
expand option which I meant to have a look at.

Thanks
Shaun

On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
wrote:

> The equivalent terms on the right-hand side of the `=>` operator in the
> example you sent should be separated by a comma. You mention you already
> tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> and that that yielded unexpected results as well. I would recommend
> pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
> applying the synonym filter _after_ case normalization in the analysis
> chain (there are other ways you could do, but the key point being that you
> need to pay attention to case and how it interacts with the order in which
> filters are applied).
>
> Re: Charlie's recommendation to apply these at index-time, a word of
> caution (and it's possible that this is in fact the underlying cause of
> some of the unexpected behavior you're observine?): be careful if you're
> using term _expansion_ at index-time (i.e., mapping single terms to
> multiple terms, which I note appears to be what you're trying to do in the
> example lines you provided). Multi-term index-time synonyms can lead to
> unexpected results for positional queries (either explicit phrase queries,
> or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
> least two good overviews of this topic, one by Mike McCandless focusing on
> Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> relevant.
>
> One way to work around this is to "collapse" (rather than expand) synonyms,
> at both index and query time. Another option would be to apply synonym
> expansion only at query-time. It's also worth noting that increasing phrase
> slop (`ps` param, etc.) can cause the issues with index-time synonym
> expansion to "fly under the radar" a little, wrt the most blatant "false
> negative" manifestations of index-time synonym issues for phrase queries.
>
> [1]
>
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> [2]
>
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> [3] https://issues.apache.org/jira/browse/LUCENE-4312
>
> On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> ch...@opensourceconnections.com> wrote:
>
> > I'm wondering if you should be using these acronyms at index time, not
> > search time. It will make your index bigger and you'll have to re-index
> > to add new synonyms (as they may apply to old documents) but this could
> > be an occasional task, and in the meantime you could use query-time
> > synonyms for the new ones.
> >
> > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> me.
> >
> > Cheers
> >
> > Charlie
> >
> > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > I have a medical journals search application and I've a list of some
> > 9,000
> > > acronyms like this:
> > >
> > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> Questionnaire
> > > SRN=>SRN Stroke Research Network
> > > IGBP=>IGBP isolated gastric bypass
> > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > Obstructive
> > > sleep apnoea–hypopnoea
> > > SRM=>SRM standardised response mean
> > > SRT=>SRT substrate reduction therapy
> > > SRS=>SRS Sexual Rating Scale
> > > SRU=>SRU stroke rehabilitation unit
> > > T2w=>T2w T2-weighted
> > > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > > MSOA=>MSOA middle-layer super output area
> > > SSA=>SSA site-specific assessment
> > > SSC=>SSC Study Steering Committee
> > > SSB=>SSB short-stretch bandage
> > > SSE=>SSE sum squared error
> > > SSD=>SSD social services department
> > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> > >
> > > I tried to put them in a synonyms file, either just with a comma
> between,
> > > or with an arrow in between and the acronym repeated on the right like
> > > above, and no matter what I try I'm getting really strange search
> > results.
> > > It's like words in one acronym are matching with the same word in
> another
> > > acronym and then searching with that acronym which is completely
> > unrelated.
> > >
> > > I don't think Solr can handle this, but does anyone know of any crafty
> > > tricks in Solr to handle this situation where I can either search by
> the
> > > acronym or by the text?
> > >
> > > Shaun
> > >
> >
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > 
> > Founding member of The Search Network 
> > and co-author of Searching the Enterprise
> > 
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
>


Re: Handling acronyms

2021-01-15 Thread Shaun Campbell
Hi Charlie

I was indexing at index time only. The synonyms/acronyms were coming from
the published journals xml files so I wasn't expecting to maintain them
myself.  If it worked, I was expecting, hopefully, to update the synonyms
file automatically.

As I just explained to Bernd I'm finding that because I'm just using
supplied acronyms from the documents there's some overlap on the words used
and it's giving me unexpected results.  For example if I enter diabetes it
finds the acronym DM for diabetes mellitus, which then coincides with an
authors initials and puts them at the top of the list which is completely
wrong, or is it?  Perhaps I was looking for an author DM. Just too much
noise to be useful I think.

Thanks for your input anyway.
Shaun



On Fri, 15 Jan 2021 at 11:18, Charlie Hull 
wrote:

> I'm wondering if you should be using these acronyms at index time, not
> search time. It will make your index bigger and you'll have to re-index
> to add new synonyms (as they may apply to old documents) but this could
> be an occasional task, and in the meantime you could use query-time
> synonyms for the new ones.
>
> Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.
>
> Cheers
>
> Charlie
>
> On 15/01/2021 09:48, Shaun Campbell wrote:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Re: Handling acronyms

2021-01-15 Thread Shaun Campbell
Hi Bernd

Thanks for that. I think it is working, but I think unfortunately what I'm
trying to do is impossible/not logical.  When I enter a term it goes off
and searches using all the matching acronyms, because I'm finding a term
used in more than one synonym eg diabetes.

I think at the end of the day this produces too much "noise" to make any
sense of the results.   Think I will have to park this for now.

Thanks
Shaun

On Fri, 15 Jan 2021 at 10:35, Bernd Fehling 
wrote:

> If you are using multiword synonyms, acronyms, ...
> Your should escape the space within the multiwords.
>
> As synonyms.txt:
> SRN, Stroke\ Research\ Network
> IGBP, isolated\ gastric\ bypass
> ...
>
> Redards
> Bernd
>
>
> Am 15.01.21 um 10:48 schrieb Shaun Campbell:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>


Re: Handling acronyms

2021-01-15 Thread Michael Gibney
The equivalent terms on the right-hand side of the `=>` operator in the
example you sent should be separated by a comma. You mention you already
tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
and that that yielded unexpected results as well. I would recommend
pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
applying the synonym filter _after_ case normalization in the analysis
chain (there are other ways you could do, but the key point being that you
need to pay attention to case and how it interacts with the order in which
filters are applied).

Re: Charlie's recommendation to apply these at index-time, a word of
caution (and it's possible that this is in fact the underlying cause of
some of the unexpected behavior you're observine?): be careful if you're
using term _expansion_ at index-time (i.e., mapping single terms to
multiple terms, which I note appears to be what you're trying to do in the
example lines you provided). Multi-term index-time synonyms can lead to
unexpected results for positional queries (either explicit phrase queries,
or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
least two good overviews of this topic, one by Mike McCandless focusing on
Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
relevant.

One way to work around this is to "collapse" (rather than expand) synonyms,
at both index and query time. Another option would be to apply synonym
expansion only at query-time. It's also worth noting that increasing phrase
slop (`ps` param, etc.) can cause the issues with index-time synonym
expansion to "fly under the radar" a little, wrt the most blatant "false
negative" manifestations of index-time synonym issues for phrase queries.

[1]
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
[2]
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
[3] https://issues.apache.org/jira/browse/LUCENE-4312

On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
ch...@opensourceconnections.com> wrote:

> I'm wondering if you should be using these acronyms at index time, not
> search time. It will make your index bigger and you'll have to re-index
> to add new synonyms (as they may apply to old documents) but this could
> be an occasional task, and in the meantime you could use query-time
> synonyms for the new ones.
>
> Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.
>
> Cheers
>
> Charlie
>
> On 15/01/2021 09:48, Shaun Campbell wrote:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Re: Handling acronyms

2021-01-15 Thread Charlie Hull
I'm wondering if you should be using these acronyms at index time, not 
search time. It will make your index bigger and you'll have to re-index 
to add new synonyms (as they may apply to old documents) but this could 
be an occasional task, and in the meantime you could use query-time 
synonyms for the new ones.


Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.

Cheers

Charlie

On 15/01/2021 09:48, Shaun Campbell wrote:

I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun



--
Charlie Hull - Managing Consultant at OpenSource Connections Limited 

Founding member of The Search Network  
and co-author of Searching the Enterprise 


tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828


Re: Handling acronyms

2021-01-15 Thread Bernd Fehling

If you are using multiword synonyms, acronyms, ...
Your should escape the space within the multiwords.

As synonyms.txt:
SRN, Stroke\ Research\ Network
IGBP, isolated\ gastric\ bypass
...

Redards
Bernd


Am 15.01.21 um 10:48 schrieb Shaun Campbell:

I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun



Handling acronyms

2021-01-15 Thread Shaun Campbell
I have a medical journals search application and I've a list of some 9,000
acronyms like this:

MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
SRN=>SRN Stroke Research Network
IGBP=>IGBP isolated gastric bypass
TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for Obstructive
sleep apnoea–hypopnoea
SRM=>SRM standardised response mean
SRT=>SRT substrate reduction therapy
SRS=>SRS Sexual Rating Scale
SRU=>SRU stroke rehabilitation unit
T2w=>T2w T2-weighted
Ab-P=>Ab-P Aberdeen participation restriction subscale
MSOA=>MSOA middle-layer super output area
SSA=>SSA site-specific assessment
SSC=>SSC Study Steering Committee
SSB=>SSB short-stretch bandage
SSE=>SSE sum squared error
SSD=>SSD social services department
NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument

I tried to put them in a synonyms file, either just with a comma between,
or with an arrow in between and the acronym repeated on the right like
above, and no matter what I try I'm getting really strange search results.
It's like words in one acronym are matching with the same word in another
acronym and then searching with that acronym which is completely unrelated.

I don't think Solr can handle this, but does anyone know of any crafty
tricks in Solr to handle this situation where I can either search by the
acronym or by the text?

Shaun