Hi Annika,
Glad you like Bertrand's Haystack presentation! My colleague Daniel
Wrigley recently wrote an overview blog on synonyms here
https://opensourceconnections.com/blog/2023/03/29/applying-synonyms-types-strategies-tools-and-a-glimpse-into-the-future/
which links to several other synonym blogs on our site.
Best
Charlie
On 08/03/2024 15:09, Annika Gable wrote:
Hi Mikhail, Elisabeth, and Atin,
Thank you for your inputs. After working on this issue for days, I finally
found the main culprits, and I've taken the following steps:
1. Doing synonym expansion only at query-time, not at index time, in order
to get correct multi-word synonyms.
2. Using WhiteSpaceTokenizer instead of StandardTokenizer or
KeywordTokenizer, otherwise, hyphenated words like immuno-oncology will
always be split into immuno and oncology, which will not be found in the
synonyms definitions!
3. Using the SnowballPorterFilter for stemming only after
SynonymGraphFilter => otherwise, immuno-oncology will be stemmed into
immuno-oncolog, which does not match the immuno-oncology in the synonyms.txt
file.
I found this presentation
https://www.slideshare.net/BertrandRigaldies/the-solr-multiterms-synonyms-maze-graphs
incredibly helpful, as well as setting up a minimal example of an index
containing only 5 documents.
It may turn out at a later point that I need to use synonyms at index-time
for speed, in which case I would only index the single-word synonyms there,
as suggested by Bertrand Rigaldies in the above presentation.
@Mikhail "It's usually tough." I've noticed :)
@Elisabeth: Thank you for your suggestion. From the description, it seems
like this fixes query-time expansion of synonyms, which the
SynonymGraphFilter and the query parser handle correctly in newer Solr
versions.
Best regards,
Annika
On Thu, Mar 7, 2024 at 12:56 PM atin janki <[email protected]> wrote:
Hi Annika,
Can you please share a sample query and how it is being expanded.
Also, share how you expect it to be expanded.
It would help to replicate your scenario and understand the problem better.
Best Regards,
Atin Janki
On Tue, Mar 5, 2024 at 4:21 PM elisabeth benoit <[email protected]
wrote:
Hello Annika,
For multiwords synonyms, we have been using
https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/healthonnet/hon-lucene-synonyms&g=ZWU1ZmU1OWFjYWFmNTdhYw==&h=ZGJiZjQzY2Q3MTYwZDU3MmQ5OGViZDAzMTQ2YzRiZWRmMjUyODNmM2YzZjViMTA2ZjJlZWE2OTQ2NjRiMTdhZQ==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
jar, that we just
rebuild with solr 9.2.1 (a modification is needed, if you ever need
details).
It overrides edismax query parser and expands multiwords synonyms at
query
time.
We didnt want to expand synonyms at index time cause we had this problem:
in the index: mairie
synonym: hotel de ville
and then at query time, with query 'hotel', mairie would match.
With hon-lucene, when user asks for "hotel de ville", we match with
mairie,
but "hotel" doesnt match with mairie.
You might have performance issues with hon-lucene if you have hundred of
synonyms. But it's worth testing.
Best regards,
Elisabeth
Le lun. 4 mars 2024 à 17:16, Mikhail Khludnev <[email protected]> a écrit
:
Hello Annika,
You may use SolrAdmin/Analysys page, debugQuery and explainOther params
to
dig into particular case. It's usually tough.
I've found one clue in the ref guide:
To get fully correct positional queries when your synonym replacements
are
multiple tokens, you should instead apply synonyms using this filter at
query time.
Probably you may start from something simple.
On Mon, Mar 4, 2024 at 5:23 PM Annika Gable
<[email protected]> wrote:
Hello,
I'm using Solr 9.1, and I'm trying to set up synonyms. I managed to
get
synonyms to work for single-word synonyms, but not for multiword and
hyphenated synonyms.
In the final state, I am planning on having a very extensive synonym
file
(hundreds, if not thousands of lines) because I want to always find
results
for all child terms and other synonyms of a given search term. This
is
why
I thought it may make sense to list all synonyms in the index. But
getting
it to work with query-time synonym expansion would also be great
already.
For now, I am testing with equivalent synonyms. I am always querying
using
quotation marks around the multi-word query.
What I have tried:
1. I included sow=false in the query as recommended here
https://checkpoint.url-protection.com/v1/url?o=https%3A//lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/&g=OTQzMzE0MjVhNzNmYTcwMQ==&h=MmNjMmFhOWY4ZDE0ODUwMDA0NWE1NTQzZGI3NzYyOGJkODQ3MDBiZmUxZTYxMzg2OWE0ZTZlOTMxZmE2MDgzOA==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
2. I used the SynonymGraphFilter either only at query time, or at
index
time, or both -> I got the same number of results when querying
single-word
synonyms, as expected (e.g. TIGIT, domvanalimab), but querying
multi-word
synonyms did not find the other synonyms correctly.
3. I made all text fields into a text_field (which uses the
KeywordTokenizer) instead of text_general (which uses the
StandardTokenizer), in order to prevent splitting up multi-word
queries.
->
This still did not make multiword-synonyms work.
My country-synonyms.txt file looks like this:
TIGIT, domvanalimab, COM902, BMS-986207, Anti-TIGIT Antibody
immuno-oncology, immunooncology
Afghanistan, AF, AFG
Albania, AL, ALB
And the relevant query fields from my schema.xml look like this, with
text_general being the fieldtype of the catchall field
<fieldType name="text_field" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
</fieldType>
<fieldType name="text_general" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory"
language="English"
/>
<filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory"
language="English"
/>
<filter class="solr.SynonymGraphFilterFactory"
synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
</analyzer>
</fieldType>
Any hints would be appreciated!
--
PRIVILEGED AND CONFIDENTIAL
PLEASE NOTE: The information contained in this
message is privileged and confidential, and is intended only for the
use
of
the individual to whom it is addressed and others who have been
specifically authorized to receive it. If you are not the intended
recipient, you are hereby notified that any dissemination,
distribution
or
copying of this communication is strictly prohibited. If you have
received
this communication in error, or if any problems occur with
transmission,
please contact the sender and kindly delete any copies of this
communication. Thank you.
--
Sincerely yours
Mikhail Khludnev
--
Charlie Hull - Managing Consultant at OpenSource Connections Limited
Founding member of The Search Network and co-author of Searching the Enterprise
tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
Amtsgericht Charlottenburg | HRB 230712 B
Geschäftsführer: John M. Woodell | David E. Pugh
Finanzamt: Berlin Finanzamt für Körperschaften II