Hi Mikhail, Elisabeth, and Atin,

Thank you for your inputs. After working on this issue for days, I finally
found the main culprits, and I've taken the following steps:

1. Doing synonym expansion only at query-time, not at index time, in order
to get correct multi-word synonyms.
2. Using WhiteSpaceTokenizer instead of StandardTokenizer or
KeywordTokenizer, otherwise, hyphenated words like immuno-oncology will
always be split into immuno and oncology, which will not be found in the
synonyms definitions!
3. Using the SnowballPorterFilter for stemming only after
SynonymGraphFilter => otherwise, immuno-oncology will be stemmed into
immuno-oncolog, which does not match the immuno-oncology in the synonyms.txt
 file.

I found this presentation
https://www.slideshare.net/BertrandRigaldies/the-solr-multiterms-synonyms-maze-graphs
incredibly helpful, as well as setting up a minimal example of an index
containing only 5 documents.
It may turn out at a later point that I need to use synonyms at index-time
for speed, in which case I would only index the single-word synonyms there,
as suggested by Bertrand Rigaldies in the above presentation.

@Mikhail "It's usually tough." I've noticed :)
@Elisabeth: Thank you for your suggestion. From the description, it seems
like this fixes query-time expansion of synonyms, which the
SynonymGraphFilter and the query parser handle correctly in newer Solr
versions.

Best regards,
Annika



On Thu, Mar 7, 2024 at 12:56 PM atin janki <[email protected]> wrote:

> Hi Annika,
>
> Can you please share a sample query and how it is being expanded.
> Also, share how you expect it to be expanded.
> It would help to replicate your scenario and understand the problem better.
>
> Best Regards,
> Atin Janki
>
>
> On Tue, Mar 5, 2024 at 4:21 PM elisabeth benoit <[email protected]
> >
> wrote:
>
> > Hello Annika,
> >
> > For multiwords synonyms, we have been using
> >
> https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/healthonnet/hon-lucene-synonyms&g=ZWU1ZmU1OWFjYWFmNTdhYw==&h=ZGJiZjQzY2Q3MTYwZDU3MmQ5OGViZDAzMTQ2YzRiZWRmMjUyODNmM2YzZjViMTA2ZjJlZWE2OTQ2NjRiMTdhZQ==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
> jar, that we just
> > rebuild with solr 9.2.1 (a modification is needed, if you ever need
> > details).
> >
> > It overrides edismax query parser and expands multiwords synonyms at
> query
> > time.
> >
> > We didnt want to expand synonyms at index time cause we had this problem:
> >
> > in the index: mairie
> > synonym: hotel de ville
> >
> > and then at query time, with query 'hotel', mairie would match.
> >
> > With hon-lucene, when user asks for "hotel de ville", we match with
> mairie,
> > but "hotel" doesnt match with mairie.
> >
> > You might have performance issues with hon-lucene if you have hundred of
> > synonyms. But it's worth testing.
> >
> > Best regards,
> > Elisabeth
> >
> > Le lun. 4 mars 2024 à 17:16, Mikhail Khludnev <[email protected]> a écrit
> :
> >
> > > Hello Annika,
> > > You may use SolrAdmin/Analysys page, debugQuery and explainOther params
> > to
> > > dig into particular case. It's usually tough.
> > >  I've found one clue in the ref guide:
> > >  To get fully correct positional queries when your synonym replacements
> > are
> > > multiple tokens, you should instead apply synonyms using this filter at
> > > query time.
> > > Probably you may start from something simple.
> > >
> > > On Mon, Mar 4, 2024 at 5:23 PM Annika Gable
> > > <[email protected]> wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm using Solr 9.1, and I'm trying to set up synonyms. I managed to
> get
> > > > synonyms to work for single-word synonyms, but not for multiword and
> > > > hyphenated synonyms.
> > > >
> > > > In the final state, I am planning on having a very extensive synonym
> > file
> > > > (hundreds, if not thousands of lines) because I want to always find
> > > results
> > > > for all child terms and other synonyms of a given search term. This
> is
> > > why
> > > > I thought it may make sense to list all synonyms in the index. But
> > > getting
> > > > it to work with query-time synonym expansion would also be great
> > already.
> > > >
> > > > For now, I am testing with equivalent synonyms. I am always querying
> > > using
> > > > quotation marks around the multi-word query.
> > > >
> > > > What I have tried:
> > > > 1. I included sow=false in the query as recommended here
> > > >
> > > >
> > >
> >
> https://checkpoint.url-protection.com/v1/url?o=https%3A//lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/&g=OTQzMzE0MjVhNzNmYTcwMQ==&h=MmNjMmFhOWY4ZDE0ODUwMDA0NWE1NTQzZGI3NzYyOGJkODQ3MDBiZmUxZTYxMzg2OWE0ZTZlOTMxZmE2MDgzOA==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
> > > >
> > > > 2. I used the SynonymGraphFilter either only at query time, or at
> index
> > > > time, or both -> I got the same number of results when querying
> > > single-word
> > > > synonyms, as expected (e.g. TIGIT, domvanalimab), but querying
> > multi-word
> > > > synonyms did not find the other synonyms correctly.
> > > > 3. I made all text fields into a text_field (which uses the
> > > > KeywordTokenizer) instead of text_general (which uses the
> > > > StandardTokenizer), in order to prevent splitting up multi-word
> > queries.
> > > ->
> > > > This still did not make multiword-synonyms work.
> > > >
> > > >
> > > > My country-synonyms.txt file looks like this:
> > > >
> > > > TIGIT, domvanalimab, COM902, BMS-986207, Anti-TIGIT Antibody
> > > > immuno-oncology, immunooncology
> > > > Afghanistan, AF, AFG
> > > > Albania, AL, ALB
> > > >
> > > >
> > > > And the relevant query fields from my schema.xml look like this, with
> > > > text_general being the fieldtype of the catchall field
> > > >
> > > > <fieldType name="text_field" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >     <analyzer type="index">
> > > >        <tokenizer class="solr.KeywordTokenizerFactory" />
> > > >        <filter class="solr.LowerCaseFilterFactory" />
> > > >        <filter class="solr.SynonymGraphFilterFactory"
> > > > synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> > > >        <filter class="solr.FlattenGraphFilterFactory"/>
> > > >     </analyzer>
> > > >     <analyzer type="query">
> > > >        <tokenizer class="solr.KeywordTokenizerFactory" />
> > > >        <filter class="solr.LowerCaseFilterFactory" />
> > > >        <filter class="solr.SynonymGraphFilterFactory"
> > > > synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> > > >     </analyzer>
> > > > </fieldType>
> > > > <fieldType name="text_general" class="solr.TextField"
> > > > positionIncrementGap="100">
> > > >     <analyzer type="index">
> > > >        <tokenizer class="solr.StandardTokenizerFactory" />
> > > >        <filter class="solr.LowerCaseFilterFactory" />
> > > >        <filter class="solr.SnowballPorterFilterFactory"
> > > language="English"
> > > > />
> > > >        <filter class="solr.SynonymGraphFilterFactory"
> > > > synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> > > >        <filter class="solr.FlattenGraphFilterFactory"/>
> > > >     </analyzer>
> > > >     <analyzer type="query">
> > > >        <tokenizer class="solr.StandardTokenizerFactory" />
> > > >        <filter class="solr.LowerCaseFilterFactory" />
> > > >        <filter class="solr.SnowballPorterFilterFactory"
> > > language="English"
> > > > />
> > > >        <filter class="solr.SynonymGraphFilterFactory"
> > > > synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> > > >     </analyzer>
> > > > </fieldType>
> > > >
> > > >
> > > > Any hints would be appreciated!
> > > >
> > > > --
> > > > PRIVILEGED AND CONFIDENTIAL
> > > > PLEASE NOTE: The information contained in this
> > > > message is privileged and confidential, and is intended only for the
> > use
> > > > of
> > > > the individual to whom it is addressed and others who have been
> > > > specifically authorized to receive it. If you are not the intended
> > > > recipient, you are hereby notified that any dissemination,
> distribution
> > > or
> > > > copying of this communication is strictly prohibited. If you have
> > > received
> > > > this communication in error, or if any problems occur with
> > transmission,
> > > > please contact the sender and kindly delete any copies of this
> > > > communication. Thank you.
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > >
> >
>

-- 
PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this 
message is privileged and confidential, and is intended only for the use of 
the individual to whom it is addressed and others who have been 
specifically authorized to receive it. If you are not the intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of this communication is strictly prohibited. If you have received 
this communication in error, or if any problems occur with transmission, 
please contact the sender and kindly delete any copies of this 
communication. Thank you.



Reply via email to