Re: Solr 6.4 new SynonymGraphFilter help for multi-word synonyms

2017-02-02 Thread Cliff Dickinson
Steve and Shawn, thanks for your replies/explanations!

I eagerly await the completion of the Solr JIRA ticket referenced above in
a future release.  Many thanks for addressing this challenge that has had
me banging my head against my desk off and on for the last couple years!

Cliff

On Thu, Feb 2, 2017 at 1:01 PM, Steve Rowe <sar...@gmail.com> wrote:

> Hi Cliff,
>
> The Solr query parsers (standard/“Lucene” and e/dismax anyway) have a
> problem that prevents SynonymGraphFilter from working: the text fed to your
> query analyzer is first split on whitespace.  So e.g. a query containing
> “United States” will never match multi-word synonym “United States”->”US”,
> since the analyzer will fist see “United” and then, separately, “States”.
>
> I fixed the whitespace splitting problem in the classic Lucene query
> parser in <https://issues.apache.org/jira/browse/LUCENE-2605>.  (Note
> that this is *not* the same as Solr’s standard/“Lucene” query parser, which
> is actually a fork of Lucene’s query parser with added functionality.)
>
> There is a Solr JIRA I’m working on to fix the whitespace splitting
> problem: <https://issues.apache.org/jira/browse/SOLR-9185>.  I hope to
> get it committed in time for inclusion in Solr 6.5.
>
> --
> Steve
> www.lucidworks.com
>
> > On Feb 2, 2017, at 9:50 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> >
> > On 2/2/2017 7:36 AM, Cliff Dickinson wrote:
> >> The SynonymGraphFilter API documentation contains the following
> statement
> >> at the end:
> >>
> >> "To get fully correct positional queries when your synonym replacements
> are
> >> multiple tokens, you should instead apply synonyms using this
> TokenFilter
> >> at query time and translate the resulting graph to a TermAutomatonQuery
> >> e.g. using TokenStreamToTermAutomatonQuery."
> >
> > Lucene is a programming API for search.  That documentation is intended
> > for people who are writing Lucene programs.  Those users would be
> > constructing query objects in their own code, so they would most likely
> > know exactly which object needs to be changed to TermAutomatonQuery.
> >
> > Solr is a Lucene program ... and an immensely complicated one.  Many
> > Lucene improvements require changes in the end program for full
> > support.  I suspect that Solr's capability has not been updated to use
> > this new feature in Lucene.  I cannot say for sure, I hope someone who
> > is familiar with this Lucene change and Solr internals can comment.
> >
> > Thanks,
> > Shawn
> >
>
>


Solr 6.4 new SynonymGraphFilter help for multi-word synonyms

2017-02-02 Thread Cliff Dickinson
I've been eagerly awaiting the release of the new SynonymGraphFilter in
Solr 6.4.  We have the need to support multi-word synonyms, which were
always problematic with the old SynonymFilterFactory.  I've upgraded to
Solr 6.4 and replaced the old filter with the new one, but am not seeing
the results that I had hoped for yet.  I suspect my configuration is
lacking something important.

I'm starting with the simple example in the SynonymGraphFilterFactory API
doucmentation:








And example entry in the synonyms.txt file is:

booster, representative of athletics interest

My problem with the old filter has always been that if I run a query for
"booster", I get results containing any of the following words: booster,
representative, athletics, interest.  This is way more results than I
want.  A document that only contains athletics, but none of the other words
in the synonym is returned.  What I really want are documents that contain
"booster" or the full synonym phrase of "representative of athletics
interest".  How could I accomplish this?

The SynonymGraphFilter API documentation contains the following statement
at the end:

"To get fully correct positional queries when your synonym replacements are
multiple tokens, you should instead apply synonyms using this TokenFilter
at query time and translate the resulting graph to a TermAutomatonQuery
e.g. using TokenStreamToTermAutomatonQuery."

How do I use TokenStreamtoTermAutomationQuery or can this not be configured
in Solr, but only by writing code against Lucene?  Would this even address
my issue?

I've found synonyms to be very frustrating in Solr and am hoping this new
filter will be a big improvement.  Thanks in advance for the help!