Re: Multi-word synonyms not working

Charlie Hull Mon, 11 Mar 2024 02:33:23 -0700

Hi Annika,

Glad you like Bertrand's Haystack presentation! My colleague DanielWrigley recently wrote an overview blog on synonyms herehttps://opensourceconnections.com/blog/2023/03/29/applying-synonyms-types-strategies-tools-and-a-glimpse-into-the-future/which links to several other synonym blogs on our site.


Best

Charlie

On 08/03/2024 15:09, Annika Gable wrote:

Hi Mikhail, Elisabeth, and Atin,

Thank you for your inputs. After working on this issue for days, I finally
found the main culprits, and I've taken the following steps:

1. Doing synonym expansion only at query-time, not at index time, in order
to get correct multi-word synonyms.
2. Using WhiteSpaceTokenizer instead of StandardTokenizer or
KeywordTokenizer, otherwise, hyphenated words like immuno-oncology will
always be split into immuno and oncology, which will not be found in the
synonyms definitions!
3. Using the SnowballPorterFilter for stemming only after
SynonymGraphFilter => otherwise, immuno-oncology will be stemmed into
immuno-oncolog, which does not match the immuno-oncology in the synonyms.txt
  file.

I found this presentation
https://www.slideshare.net/BertrandRigaldies/the-solr-multiterms-synonyms-maze-graphs
incredibly helpful, as well as setting up a minimal example of an index
containing only 5 documents.
It may turn out at a later point that I need to use synonyms at index-time
for speed, in which case I would only index the single-word synonyms there,
as suggested by Bertrand Rigaldies in the above presentation.

@Mikhail "It's usually tough." I've noticed :)
@Elisabeth: Thank you for your suggestion. From the description, it seems
like this fixes query-time expansion of synonyms, which the
SynonymGraphFilter and the query parser handle correctly in newer Solr
versions.

Best regards,
Annika



On Thu, Mar 7, 2024 at 12:56 PM atin janki <[email protected]> wrote:

Hi Annika,

Can you please share a sample query and how it is being expanded.
Also, share how you expect it to be expanded.
It would help to replicate your scenario and understand the problem better.

Best Regards,
Atin Janki


On Tue, Mar 5, 2024 at 4:21 PM elisabeth benoit <[email protected]
wrote:

Hello Annika,

For multiwords synonyms, we have been using

https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/healthonnet/hon-lucene-synonyms&g=ZWU1ZmU1OWFjYWFmNTdhYw==&h=ZGJiZjQzY2Q3MTYwZDU3MmQ5OGViZDAzMTQ2YzRiZWRmMjUyODNmM2YzZjViMTA2ZjJlZWE2OTQ2NjRiMTdhZQ==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
jar, that we just

rebuild with solr 9.2.1 (a modification is needed, if you ever need
details).

It overrides edismax query parser and expands multiwords synonyms at

query

time.

We didnt want to expand synonyms at index time cause we had this problem:

in the index: mairie
synonym: hotel de ville

and then at query time, with query 'hotel', mairie would match.

With hon-lucene, when user asks for "hotel de ville", we match with

mairie,

but "hotel" doesnt match with mairie.

You might have performance issues with hon-lucene if you have hundred of
synonyms. But it's worth testing.

Best regards,
Elisabeth

Le lun. 4 mars 2024 à 17:16, Mikhail Khludnev <[email protected]> a écrit

Hello Annika,
You may use SolrAdmin/Analysys page, debugQuery and explainOther params

to

dig into particular case. It's usually tough.
  I've found one clue in the ref guide:
  To get fully correct positional queries when your synonym replacements

are

multiple tokens, you should instead apply synonyms using this filter at
query time.
Probably you may start from something simple.

On Mon, Mar 4, 2024 at 5:23 PM Annika Gable
<[email protected]> wrote:

Hello,

I'm using Solr 9.1, and I'm trying to set up synonyms. I managed to

get

synonyms to work for single-word synonyms, but not for multiword and
hyphenated synonyms.

In the final state, I am planning on having a very extensive synonym

file

(hundreds, if not thousands of lines) because I want to always find

results

for all child terms and other synonyms of a given search term. This

is

why

I thought it may make sense to list all synonyms in the index. But

getting

it to work with query-time synonym expansion would also be great

already.

For now, I am testing with equivalent synonyms. I am always querying

using

quotation marks around the multi-word query.

What I have tried:
1. I included sow=false in the query as recommended here

https://checkpoint.url-protection.com/v1/url?o=https%3A//lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/&g=OTQzMzE0MjVhNzNmYTcwMQ==&h=MmNjMmFhOWY4ZDE0ODUwMDA0NWE1NTQzZGI3NzYyOGJkODQ3MDBiZmUxZTYxMzg2OWE0ZTZlOTMxZmE2MDgzOA==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==

2. I used the SynonymGraphFilter either only at query time, or at

index

time, or both -> I got the same number of results when querying

single-word

synonyms, as expected (e.g. TIGIT, domvanalimab), but querying

multi-word

synonyms did not find the other synonyms correctly.
3. I made all text fields into a text_field (which uses the
KeywordTokenizer) instead of text_general (which uses the
StandardTokenizer), in order to prevent splitting up multi-word

queries.

->

This still did not make multiword-synonyms work.


My country-synonyms.txt file looks like this:

TIGIT, domvanalimab, COM902, BMS-986207, Anti-TIGIT Antibody
immuno-oncology, immunooncology
Afghanistan, AF, AFG
Albania, AL, ALB


And the relevant query fields from my schema.xml look like this, with
text_general being the fieldtype of the catchall field

<fieldType name="text_field" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
     </analyzer>
     <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
     </analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory"

language="English"

/>
        <filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
     </analyzer>
     <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.SnowballPorterFilterFactory"

language="English"

/>
        <filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
     </analyzer>
</fieldType>


Any hints would be appreciated!

--
PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this
message is privileged and confidential, and is intended only for the

use

of
the individual to whom it is addressed and others who have been
specifically authorized to receive it. If you are not the intended
recipient, you are hereby notified that any dissemination,

distribution

or

copying of this communication is strictly prohibited. If you have

received

this communication in error, or if any problems occur with

transmission,

please contact the sender and kindly delete any copies of this
communication. Thank you.

--
Sincerely yours
Mikhail Khludnev

--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network and co-author of Searching the Enterprise
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828

OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II

Re: Multi-word synonyms not working

Reply via email to