Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-28 Thread Doug Turnbull
I like that idea Alan. The trick is for QueryBuilder's 'newSynonymQuery' to
be useful in that context, you need to pass terms with metadata down to the
subclass. This is what I started working on a few weeks ago:

https://github.com/o19s/lucene-solr/commit/0fc3930671ef002cfbb5e3d52b6f8edc3715bf14

I don't think it's as simple as overriding analyzeBoolean/analyzeMultiBoolean
as Rob suggests, as there's also analyzeGraphBoolean and the  that would
also need to collect this metadata. I wouldn't want to copy paste all this
code into a subclass just to add one token attribute.

-Doug



On Wed, Nov 28, 2018 at 12:25 PM Alan Woodward  wrote:

> I think we can expose this information now with a small tweak to the
> SynonymGraphFilter, using the already-existing TypeAttribute.
>
> SGF is hard-coded to set the type attribute to “SYNONYM” on all tokens
> that it inserts into the stream.  It should be simple to add another
> constructor parameter allowing users to change this; then you can chain
> synonym filters, one for each type of expansion you want: synonym, hyponym,
> hypernym, whatever, each setting the type attribute differently.
>
> > On 28 Nov 2018, at 15:59, Michael Gibney 
> wrote:
> >
> > I think the objection to "boosting" in token filters isn't because it
> > is "too much", but rather because it breaks the abstraction of the
> > analysis chain to directly target scoring (as implied by
> > characterizing as "boosting").
> >
> > That said, I'm sympathetic to an approach that would establish an
> > Attribute to expose the kind of information that would be useful in
> > the context of synonyms (or other sorts of derived tokens discussed
> > here, where it could be useful to express information about token
> > derivation). Such an Attribute would not be directly related to
> > scoring/boosting, but would be related to analysis per se, (e.g.,
> > source token text, thesaurus, degree of confidence, etc.); support
> > could be selectively implemented by TokenFilters, and optionally
> > leveraged by query builders (e.g., translated to boosts) or even
> > recorded to index Payloads by a final custom analysis component 
> >
> > "You can look at any attribute on the tokenstream you want", "rely on
> > abstract attributes (type, ...) then it should be easy to sub-class
> > the query builder to access them".  Obviously that works iff analysis
> > components record the relevant information in attributes on the
> > tokenstream, which I think they currently don't (for much of the
> > information that has been discussed here) ... and I know of no
> > standard way to express the relevant information on the tokenstream.
> >
> > I can see that such an Attribute would be out of place (too
> > specialized) in the context of the Attributes in lucene/core; but
> > there are lots of more specialized Attributes in the various
> > submodules under lucene/analysis/* (SynonymGraphFilter lives in
> > analysis-common, FWIW). Again, this doesn't strike me as terribly
> > specialized, if one thinks of it more generally as a
> > "derivation/relationship" Attribute.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-28 Thread Alan Woodward
I think we can expose this information now with a small tweak to the 
SynonymGraphFilter, using the already-existing TypeAttribute.

SGF is hard-coded to set the type attribute to “SYNONYM” on all tokens that it 
inserts into the stream.  It should be simple to add another constructor 
parameter allowing users to change this; then you can chain synonym filters, 
one for each type of expansion you want: synonym, hyponym, hypernym, whatever, 
each setting the type attribute differently.

> On 28 Nov 2018, at 15:59, Michael Gibney  wrote:
> 
> I think the objection to "boosting" in token filters isn't because it
> is "too much", but rather because it breaks the abstraction of the
> analysis chain to directly target scoring (as implied by
> characterizing as "boosting").
> 
> That said, I'm sympathetic to an approach that would establish an
> Attribute to expose the kind of information that would be useful in
> the context of synonyms (or other sorts of derived tokens discussed
> here, where it could be useful to express information about token
> derivation). Such an Attribute would not be directly related to
> scoring/boosting, but would be related to analysis per se, (e.g.,
> source token text, thesaurus, degree of confidence, etc.); support
> could be selectively implemented by TokenFilters, and optionally
> leveraged by query builders (e.g., translated to boosts) or even
> recorded to index Payloads by a final custom analysis component 
> 
> "You can look at any attribute on the tokenstream you want", "rely on
> abstract attributes (type, ...) then it should be easy to sub-class
> the query builder to access them".  Obviously that works iff analysis
> components record the relevant information in attributes on the
> tokenstream, which I think they currently don't (for much of the
> information that has been discussed here) ... and I know of no
> standard way to express the relevant information on the tokenstream.
> 
> I can see that such an Attribute would be out of place (too
> specialized) in the context of the Attributes in lucene/core; but
> there are lots of more specialized Attributes in the various
> submodules under lucene/analysis/* (SynonymGraphFilter lives in
> analysis-common, FWIW). Again, this doesn't strike me as terribly
> specialized, if one thinks of it more generally as a
> "derivation/relationship" Attribute.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-28 Thread Michael Gibney
I think the objection to "boosting" in token filters isn't because it
is "too much", but rather because it breaks the abstraction of the
analysis chain to directly target scoring (as implied by
characterizing as "boosting").

That said, I'm sympathetic to an approach that would establish an
Attribute to expose the kind of information that would be useful in
the context of synonyms (or other sorts of derived tokens discussed
here, where it could be useful to express information about token
derivation). Such an Attribute would not be directly related to
scoring/boosting, but would be related to analysis per se, (e.g.,
source token text, thesaurus, degree of confidence, etc.); support
could be selectively implemented by TokenFilters, and optionally
leveraged by query builders (e.g., translated to boosts) or even
recorded to index Payloads by a final custom analysis component 

"You can look at any attribute on the tokenstream you want", "rely on
abstract attributes (type, ...) then it should be easy to sub-class
the query builder to access them".  Obviously that works iff analysis
components record the relevant information in attributes on the
tokenstream, which I think they currently don't (for much of the
information that has been discussed here) ... and I know of no
standard way to express the relevant information on the tokenstream.

I can see that such an Attribute would be out of place (too
specialized) in the context of the Attributes in lucene/core; but
there are lots of more specialized Attributes in the various
submodules under lucene/analysis/* (SynonymGraphFilter lives in
analysis-common, FWIW). Again, this doesn't strike me as terribly
specialized, if one thinks of it more generally as a
"derivation/relationship" Attribute.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-22 Thread jim ferenczi
My proposal was to tweak the boosting directly in the token filters through
a single Attribute but if we feel that it is too much to add to the
analysis chain I agree that we don't need to add any API. If you rely on
abstract attributes (type, ...) then it should be easy to sub-class the
query builder to access them and implement the logic you want there.

Le jeu. 22 nov. 2018 à 13:18, Robert Muir  a écrit :

> There is already analyzeBoolean/analyzeMultiBoolean there that you can
> use for this. You can look at any attribute on the tokenstream you
> want. I don't see any need to add any more API.
>
> On 11/21/18, Doug Turnbull  wrote:
> > I agree there is a tension between analysis and query parser
> > responsibilities (or external to how queries are constructed). I wonder
> > what you'd think of making QueryBuilder more easily subclassible by
> passing
> > more term metadata to newSynonymQuery (such as types etc). This would let
> > you select an alt strategy (such as some of the scoring systems used in
> the
> > query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> > something with a term labeled a hyponym/hypernym in a QueryBuilder
> > subclass..
> >
> > -Doug
> >
> > On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:
> >
> >> I don't think we should put scoring stuff into the analysis chain like
> >> this. It already has a laundry list of responsibilities.
> >>
> >> Analysis chain can tell you the term is stacked or its a certain type
> >> or occurs a certain number of times, but it shouldn't be supplying
> >> things such as floating point boosts. That kind of scoring
> >> manipulation needs to really happen in query parsing/somewhere else.
> >>
> >> On 11/20/18, jim ferenczi  wrote:
> >> > Sorry for the late reply,
> >> >
> >> >> So perhaps one way forward to contribute this sort of thing into
> >> >> Lucene
> >> > is we could implement additional QueryBuilder implementations that
> >> provide
> >> > such functionality?
> >> >
> >> > I am not sure, I mentioned Solr and ES because I thought it was about
> >> > adding taxonomies and complex expansion mechanisms to query builders
> >> > but
> >> I
> >> > wonder if we can have a simple
> >> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
> >> > could
> >> be
> >> > a new attribute that token filters would use when they produce stacked
> >> > tokens and that the QueryBuilder checks when he builds the
> >> > SynonymQuery.
> >> We
> >> > already have a TermFrequencyAttribute to alter the frequency of a term
> >> when
> >> > indexing so we could have the same mechanism for query term boosting ?
> >> >
> >> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> >> > dturnb...@opensourceconnections.com> a écrit :
> >> >
> >> >> Thanks Jim
> >> >>
> >> >> Yeah, now that I think about it - I agree that perhaps the simplest
> >> >> option
> >> >> would to create alternate query builders. I think there's a couple of
> >> >> enhancement to the base class that would be nice, such as
> >> >> - Some additional token attributes passed to newSynonymQuery, such as
> >> the
> >> >> type (was this a synonym or hyponym or something else...)
> >> >> - The ability to differentiate between the original query term and
> the
> >> >> generated synonym terms
> >> >> - Consistent support for phrases
> >> >>
> >> >> I think part of my goal too is to help people without the use of
> >> plugins.
> >> >> As we often are in scenarios at OpenSource Connections where people
> >> won't
> >> >> be able to use a plugin. In this case alternate expansions around
> >> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
> >> >> have
> >> >> using Solr/Lucene/ES.
> >> >>
> >> >> So perhaps one way forward to contribute this sort of thing into
> >> >> Lucene
> >> >> is
> >> >> we could implement additional QueryBuilder implementations that
> >> >> provide
> >> >> such functionality?
> >> >>
> >> >> Thanks
> >> >> -Doug
> >> >>
> >> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi  >
> >> >> wrote:
> >> >>
> >> >>> You can easily customize the query that is used for synonyms in a
> >> custom
> >> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
> >> >>> intended for subclasses that wish to customize the generated
> >> >>> queries."
> >> so
> >> >>> I
> >> >>> don't think we need to do anything there. I agree that it is
> >> >>> sometimes
> >> >>> better to use something different than the SynonymQuery but in the
> >> >>> general
> >> >>> case it works as expected and can be combined with other terms
> >> >>> naturally.
> >> >>> The kind of customization you want to achieve could be done in a
> >> >>> plugin
> >> >>> (or
> >> >>> in Solr or ES) that extends the QueryBuilder, you can also use
> custom
> >> >>> token
> >> >>> filters and alter the query the way you want. My point here is that
> >> >>> the
> >> >>> QueryBuilder should remain simple, you can add the complexity you
> >> >>> want
> >> in
> >> >>> a
> >> >>> 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-22 Thread Robert Muir
There is already analyzeBoolean/analyzeMultiBoolean there that you can
use for this. You can look at any attribute on the tokenstream you
want. I don't see any need to add any more API.

On 11/21/18, Doug Turnbull  wrote:
> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi  wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> > but
>> I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> > could
>> be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> > SynonymQuery.
>> We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > dturnb...@opensourceconnections.com> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> >> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into
>> >> Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that
>> >> provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> >>> queries."
>> so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is
>> >>> sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> >>> plugin
>> >>> (or
>> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>> >>> token
>> >>> filters and alter the query the way you want. My point here is that
>> >>> the
>> >>> QueryBuilder should remain simple, you can add the complexity you
>> >>> want
>> in
>> >>> a
>> >>> subclass.
>> >>> However I think there is another area we need to fix, the scoring of
>> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and
>> >>> could
>> >>> be
>> >>> improved so we need something similar than the SynonymQuery that
>> handles
>> >>> multi phrases.
>> >>>
>> >>>
>> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> >>> dturnb...@opensourceconnections.com> a écrit :
>> >>>
>>  Yes that is another good area (there are many). Although of course
>>  embeddings have their own challenges and complexities. (they often
>>  capture
>>  shared context, but not shared meaning).
>> 
>>  It's a data 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
There's a lot of different topics here and ideas, so we captured the use
cases we see being discussed here as in this google doc
https://docs.google.com/document/d/1w4G9bEICJ1aarr3l7OodwR5aecPkbFTISOgymErpZfQ/edit#heading=h.pszpx5dpxq7a

Basically, we've seen 5 high-level use cases discussed
- Alt Labels (what SynonymQuery does well now)
- Synonyms (looser synonyms with close meaning that need to be scored
somehow - `notebook,laptop`)
- Taxonomies (hierarchies of concepts/terms `dress shoes\oxfords`)
- Ontologies / Knowledge Graphs (networks of concepts)
- Embeddings (distributed representations of a term)

It's a doc in progress, embeddings needs more work, and is probably the
hardest thing on the list. There's possible other

The goal isn't so much to make Lucene implement all of these (it would
create a lot of maintenance headaches to shove this all in), but some of it
is just defining practices / patterns / tools that enable these things in
Lucene-based search. Some may require no work, or some may require
supporting functionality.

-Doug

On Wed, Nov 21, 2018 at 9:23 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi  wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> but I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> could be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> SynonymQuery. We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > dturnb...@opensourceconnections.com> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> queries." so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> plugin
>> >>> (or
>> >>> in Solr or ES) that extends the QueryBuilder, you can 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Michael Gibney
On the analysis chain side, could the desired functionality be scoped to:
providing a framework (Attribute?) to express information about the
relationship between a derived token and its corresponding input? For
example, one might include information about:
1. corresponding input token (i.e., input token text?)
2. relationship between derived token and input (e.g., synonym, hyponym,
hypernym ... but perhaps not limited to these)
3. degree of confidence/weight in the derived token? This would represent a
concept distinct from "weight" for the purpose of scoring, and could thus
be appropriate to the analysis chain.
4. source/reason of token derivation relationship (e.g., specific ontology,
taxonomy, etc...)
5. 

This could provide all the information necessary to support custom indexing
strategies and/or query strategies, while remaining strictly focused on
analysis per se. This type of approach (if relationship info were recorded
in index, e.g. via Payload) could also support explicitly navigable facets
that are ontology-aware, or other potentially interesting things ...

Michael


On Wed, Nov 21, 2018 at 9:24 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> I agree there is a tension between analysis and query parser
> responsibilities (or external to how queries are constructed). I wonder
> what you'd think of making QueryBuilder more easily subclassible by passing
> more term metadata to newSynonymQuery (such as types etc). This would let
> you select an alt strategy (such as some of the scoring systems used in the
> query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
> something with a term labeled a hyponym/hypernym in a QueryBuilder
> subclass..
>
> -Doug
>
> On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:
>
>> I don't think we should put scoring stuff into the analysis chain like
>> this. It already has a laundry list of responsibilities.
>>
>> Analysis chain can tell you the term is stacked or its a certain type
>> or occurs a certain number of times, but it shouldn't be supplying
>> things such as floating point boosts. That kind of scoring
>> manipulation needs to really happen in query parsing/somewhere else.
>>
>> On 11/20/18, jim ferenczi  wrote:
>> > Sorry for the late reply,
>> >
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> > is we could implement additional QueryBuilder implementations that
>> provide
>> > such functionality?
>> >
>> > I am not sure, I mentioned Solr and ES because I thought it was about
>> > adding taxonomies and complex expansion mechanisms to query builders
>> but I
>> > wonder if we can have a simple
>> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It
>> could be
>> > a new attribute that token filters would use when they produce stacked
>> > tokens and that the QueryBuilder checks when he builds the
>> SynonymQuery. We
>> > already have a TermFrequencyAttribute to alter the frequency of a term
>> when
>> > indexing so we could have the same mechanism for query term boosting ?
>> >
>> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> > dturnb...@opensourceconnections.com> a écrit :
>> >
>> >> Thanks Jim
>> >>
>> >> Yeah, now that I think about it - I agree that perhaps the simplest
>> >> option
>> >> would to create alternate query builders. I think there's a couple of
>> >> enhancement to the base class that would be nice, such as
>> >> - Some additional token attributes passed to newSynonymQuery, such as
>> the
>> >> type (was this a synonym or hyponym or something else...)
>> >> - The ability to differentiate between the original query term and the
>> >> generated synonym terms
>> >> - Consistent support for phrases
>> >>
>> >> I think part of my goal too is to help people without the use of
>> plugins.
>> >> As we often are in scenarios at OpenSource Connections where people
>> won't
>> >> be able to use a plugin. In this case alternate expansions around
>> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>> have
>> >> using Solr/Lucene/ES.
>> >>
>> >> So perhaps one way forward to contribute this sort of thing into Lucene
>> >> is
>> >> we could implement additional QueryBuilder implementations that provide
>> >> such functionality?
>> >>
>> >> Thanks
>> >> -Doug
>> >>
>> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> >> wrote:
>> >>
>> >>> You can easily customize the query that is used for synonyms in a
>> custom
>> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> >>> intended for subclasses that wish to customize the generated
>> queries." so
>> >>> I
>> >>> don't think we need to do anything there. I agree that it is sometimes
>> >>> better to use something different than the SynonymQuery but in the
>> >>> general
>> >>> case it works as expected and can be combined with other terms
>> >>> naturally.
>> >>> The kind of customization you want to achieve could be done in a
>> plugin
>> >>> (or
>> >>> in Solr or ES) 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
I agree there is a tension between analysis and query parser
responsibilities (or external to how queries are constructed). I wonder
what you'd think of making QueryBuilder more easily subclassible by passing
more term metadata to newSynonymQuery (such as types etc). This would let
you select an alt strategy (such as some of the scoring systems used in the
query expansion paper https://arxiv.org/pdf/1708.00247.pdf). Or doing
something with a term labeled a hyponym/hypernym in a QueryBuilder
subclass..

-Doug

On Wed, Nov 21, 2018 at 8:09 AM Robert Muir  wrote:

> I don't think we should put scoring stuff into the analysis chain like
> this. It already has a laundry list of responsibilities.
>
> Analysis chain can tell you the term is stacked or its a certain type
> or occurs a certain number of times, but it shouldn't be supplying
> things such as floating point boosts. That kind of scoring
> manipulation needs to really happen in query parsing/somewhere else.
>
> On 11/20/18, jim ferenczi  wrote:
> > Sorry for the late reply,
> >
> >> So perhaps one way forward to contribute this sort of thing into Lucene
> > is we could implement additional QueryBuilder implementations that
> provide
> > such functionality?
> >
> > I am not sure, I mentioned Solr and ES because I thought it was about
> > adding taxonomies and complex expansion mechanisms to query builders but
> I
> > wonder if we can have a simple
> > mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
> be
> > a new attribute that token filters would use when they produce stacked
> > tokens and that the QueryBuilder checks when he builds the SynonymQuery.
> We
> > already have a TermFrequencyAttribute to alter the frequency of a term
> when
> > indexing so we could have the same mechanism for query term boosting ?
> >
> > Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> > dturnb...@opensourceconnections.com> a écrit :
> >
> >> Thanks Jim
> >>
> >> Yeah, now that I think about it - I agree that perhaps the simplest
> >> option
> >> would to create alternate query builders. I think there's a couple of
> >> enhancement to the base class that would be nice, such as
> >> - Some additional token attributes passed to newSynonymQuery, such as
> the
> >> type (was this a synonym or hyponym or something else...)
> >> - The ability to differentiate between the original query term and the
> >> generated synonym terms
> >> - Consistent support for phrases
> >>
> >> I think part of my goal too is to help people without the use of
> plugins.
> >> As we often are in scenarios at OpenSource Connections where people
> won't
> >> be able to use a plugin. In this case alternate expansions around
> >> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
> >> using Solr/Lucene/ES.
> >>
> >> So perhaps one way forward to contribute this sort of thing into Lucene
> >> is
> >> we could implement additional QueryBuilder implementations that provide
> >> such functionality?
> >>
> >> Thanks
> >> -Doug
> >>
> >> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
> >> wrote:
> >>
> >>> You can easily customize the query that is used for synonyms in a
> custom
> >>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
> >>> intended for subclasses that wish to customize the generated queries."
> so
> >>> I
> >>> don't think we need to do anything there. I agree that it is sometimes
> >>> better to use something different than the SynonymQuery but in the
> >>> general
> >>> case it works as expected and can be combined with other terms
> >>> naturally.
> >>> The kind of customization you want to achieve could be done in a plugin
> >>> (or
> >>> in Solr or ES) that extends the QueryBuilder, you can also use custom
> >>> token
> >>> filters and alter the query the way you want. My point here is that the
> >>> QueryBuilder should remain simple, you can add the complexity you want
> in
> >>> a
> >>> subclass.
> >>> However I think there is another area we need to fix, the scoring of
> >>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
> >>> be
> >>> improved so we need something similar than the SynonymQuery that
> handles
> >>> multi phrases.
> >>>
> >>>
> >>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
> >>> dturnb...@opensourceconnections.com> a écrit :
> >>>
>  Yes that is another good area (there are many). Although of course
>  embeddings have their own challenges and complexities. (they often
>  capture
>  shared context, but not shared meaning).
> 
>  It's a data point though of something we'd want to include in such a
>  framework, though not sure where it would go on the roadmap...
> 
>  On Sat, Nov 17, 2018 at 1:15 AM J. Delgado  >
>  wrote:
> 
> > What about the use of word embeddings (see
> >
> >
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa
> )
> > to compute word similarity?
> >
> > On 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Robert Muir
I don't think we should put scoring stuff into the analysis chain like
this. It already has a laundry list of responsibilities.

Analysis chain can tell you the term is stacked or its a certain type
or occurs a certain number of times, but it shouldn't be supplying
things such as floating point boosts. That kind of scoring
manipulation needs to really happen in query parsing/somewhere else.

On 11/20/18, jim ferenczi  wrote:
> Sorry for the late reply,
>
>> So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
> a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> dturnb...@opensourceconnections.com> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option
>> would to create alternate query builders. I think there's a couple of
>> enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is
>> we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so
>>> I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the
>>> general
>>> case it works as expected and can be combined with other terms
>>> naturally.
>>> The kind of customization you want to achieve could be done in a plugin
>>> (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom
>>> token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in
>>> a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could
>>> be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> dturnb...@opensourceconnections.com> a écrit :
>>>
 Yes that is another good area (there are many). Although of course
 embeddings have their own challenges and complexities. (they often
 capture
 shared context, but not shared meaning).

 It's a data point though of something we'd want to include in such a
 framework, though not sure where it would go on the roadmap...

 On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
 wrote:

> What about the use of word embeddings (see
>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
> to compute word similarity?
>
> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Hey folks,
>>
>> I wanted to open up a discussion about a change to the usage of
>> SynonymQuery. The goal here is to have a broader library of queries
>> that
>> can address other cases where related terms occupy the same position
>> but
>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> ambiguous terms, and other query expansion situations).
>>
>>
>> I bring this up because we've noticed (as I'm sure many of you have)
>> the pattern of clients jamming any related term into a synonyms file
>> and
>> being surprised with odd results. I like the idea of enforcing
>> "synonyms"
>> means 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
Alessandro reading your post, I realized I made a mistake in that you'd
need to go both up and down the hierarchy when blending. When a user
searches for dress shoes, going down a level (or two) is just as important.
If a user searches for 'dress shoes' you also need hyponym terms.

This works out if you do an index time expansion (child terms get parent
terms injected) but doesn't work out if you want a 100% query time blending.

In this case, I think I would revise my blending idea to

- Search for the term 'wingtips' (lowest doc freq, smallest set)
- Search for the term 'wingtips' blended with all child terms
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
highest df)

In this case, I don't *think* need any special weighting, as the true doc
freq of each concept recreates the priority ordering you guys came up with.
That's pretty neat!

-Doug

On Wed, Nov 21, 2018 at 7:20 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Great thoughts Jim - +1 to your idea
>
> One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I
> think would lead to a different blending strategy for taxonomies than
> synonyms.
>
> If you have a taxonomy:
>
> \shoes\dress_shoes\oxfords
> \shoes\dress_shoes\wingtips
> \shoes\lazy_shoes\loafers
> \shoes\lazy_shoes\sketchers
>
> This taxonomy states - if a document mentions 'oxfords', it's also
> discussing the concept of dress shoes. If it only mentions 'wingtips' it
> also is discussing dress shoes.
>
> Thus ideally, the true document frequency of the parent concept 'dress
> shoes' is the combination of the children. This is the number of documents
> that discuss this concept.
>
> You can repeat this for grandparent concepts. The number of documents with
> 'shoes' really is all the documents mentioning oxfords, wingtips, loafers,
> sketchers, and the like...
>
> We have implemented this idea at index time, with index-time semantic
> expansion to inject the parent concepts. (manually put dress_shoes into
> documents that just mention wingtips). This is mentioned in this blog post
> https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/
>  and
> conference talk https://www.youtube.com/watch?v=90F30PS-884 This is
> annoying and requires reindexing. Though it's the most accurate.
>
> BUT I think a blended query-time query would capture the same semantics.
> You basically want to score a taxonomy like the following. Imagine a user
> query of wingtips, you could imagine 3 should clauses that blend at
> different levels
>
> - Search for the term 'wingtips' (lowest doc freq, smallest set)
> - Search for parent & sibling concepts (the set of all dress shoes)
> - Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
> highest df)
>
> text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR
> Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers,
> text:loafers, ...)
>
> Right now this can be accomplished by just issuing 3 SHOULD queries with 3
> different query-time analyzers each with different synonym expansions
> (exact user term, child => parent/sibling, child => parent, grandparent,
> etc...). And maybe it should stay that way.
>
> But this is why I think it's a 'yes AND', yes I think it would be a great
> addition to have synonym weighting. AND I think there are blending
> strategies that are specific to the use case.
>
> -Doug
>
>
>
> On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov 
> wrote:
>
>> This is a great idea. It would also be compelling to modify the term
>> frequency using this deboosting so that stacked indexed terms can be
>> weighted according to their closeness to the original term.
>>
>> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi >
> Sorry for the late reply,
>>>
>>> > So perhaps one way forward to contribute this sort of thing into
>>> Lucene is we could implement additional QueryBuilder implementations that
>>> provide such functionality?
>>>
>>> I am not sure, I mentioned Solr and ES because I thought it was about
>>> adding taxonomies and complex expansion mechanisms to query builders but I
>>> wonder if we can have a simple
>>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>>> be a new attribute that token filters would use when they produce stacked
>>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>>> already have a TermFrequencyAttribute to alter the frequency of a term when
>>> indexing so we could have the same mechanism for query term boosting ?
>>>
>>> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>>> dturnb...@opensourceconnections.com> a écrit :
>>>
>> Thanks Jim

 Yeah, now that I think about it - I agree that perhaps the simplest
 option would to create alternate query builders. I think there's a couple
 of enhancement to the base class that would be nice, such as
 - 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Doug Turnbull
Great thoughts Jim - +1 to your idea

One brainstorm I had, is taxonomies have a kind of 'ideal scoring' that I
think would lead to a different blending strategy for taxonomies than
synonyms.

If you have a taxonomy:

\shoes\dress_shoes\oxfords
\shoes\dress_shoes\wingtips
\shoes\lazy_shoes\loafers
\shoes\lazy_shoes\sketchers

This taxonomy states - if a document mentions 'oxfords', it's also
discussing the concept of dress shoes. If it only mentions 'wingtips' it
also is discussing dress shoes.

Thus ideally, the true document frequency of the parent concept 'dress
shoes' is the combination of the children. This is the number of documents
that discuss this concept.

You can repeat this for grandparent concepts. The number of documents with
'shoes' really is all the documents mentioning oxfords, wingtips, loafers,
sketchers, and the like...

We have implemented this idea at index time, with index-time semantic
expansion to inject the parent concepts. (manually put dress_shoes into
documents that just mention wingtips). This is mentioned in this blog post
https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/
and
conference talk https://www.youtube.com/watch?v=90F30PS-884 This is
annoying and requires reindexing. Though it's the most accurate.

BUT I think a blended query-time query would capture the same semantics.
You basically want to score a taxonomy like the following. Imagine a user
query of wingtips, you could imagine 3 should clauses that blend at
different levels

- Search for the term 'wingtips' (lowest doc freq, smallest set)
- Search for parent & sibling concepts (the set of all dress shoes)
- Search for grandparent, aunt, uncle, cousins... (the set of all shoes,
highest df)

text:wingtips OR Blended(text:wingtips, text:oxfords, text:dress_shoes) OR
Blended(text:wingtips, text:oxfords, text:dress_shoes, text:sketchers,
text:loafers, ...)

Right now this can be accomplished by just issuing 3 SHOULD queries with 3
different query-time analyzers each with different synonym expansions
(exact user term, child => parent/sibling, child => parent, grandparent,
etc...). And maybe it should stay that way.

But this is why I think it's a 'yes AND', yes I think it would be a great
addition to have synonym weighting. AND I think there are blending
strategies that are specific to the use case.

-Doug



On Tue, Nov 20, 2018 at 9:34 PM Michael Sokolov  wrote:

> This is a great idea. It would also be compelling to modify the term
> frequency using this deboosting so that stacked indexed terms can be
> weighted according to their closeness to the original term.
>
> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi 
Sorry for the late reply,
>>
>> > So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> I am not sure, I mentioned Solr and ES because I thought it was about
>> adding taxonomies and complex expansion mechanisms to query builders but I
>> wonder if we can have a simple
>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>> be a new attribute that token filters would use when they produce stacked
>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>> already have a TermFrequencyAttribute to alter the frequency of a term when
>> indexing so we could have the same mechanism for query term boosting ?
>>
>> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
>> dturnb...@opensourceconnections.com> a écrit :
>>
> Thanks Jim
>>>
>>> Yeah, now that I think about it - I agree that perhaps the simplest
>>> option would to create alternate query builders. I think there's a couple
>>> of enhancement to the base class that would be nice, such as
>>> - Some additional token attributes passed to newSynonymQuery, such as
>>> the type (was this a synonym or hyponym or something else...)
>>> - The ability to differentiate between the original query term and the
>>> generated synonym terms
>>> - Consistent support for phrases
>>>
>>> I think part of my goal too is to help people without the use of
>>> plugins. As we often are in scenarios at OpenSource Connections where
>>> people won't be able to use a plugin. In this case alternate expansions
>>> around hypernyms/hyponyms/?... are a pretty frequent gap that search teams
>>> have using Solr/Lucene/ES.
>>>
>>> So perhaps one way forward to contribute this sort of thing into Lucene
>>> is we could implement additional QueryBuilder implementations that provide
>>> such functionality?
>>>
>>> Thanks
>>> -Doug
>>>
>>
>>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>>> wrote:
>>>
 You can easily customize the query that is used for synonyms in a
 custom QueryBuilder. The javadocs of the *newSynonymQuery* says "This
 is intended for subclasses that wish to customize the generated queries."
 so I don't think we need to do anything there. I agree that it is 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-21 Thread Alessandro Benedetti
Hi all,
last sunday we spent a bit on this topic, our considerations follow:

N.B. we didn't check the state of the art (thanks Doug for the nice survey
shared, I will definitely take a look later on) .
So we just wanted to figure out an initial improvement, that can later be
advanced following advanced state of the art formulas.
It is kinda related to Jim idea.
This was the output of our brainstorming:

*Introduction*
Currently in Apache Solr (and Elastic Search) there is no supported way to
manage true synonyms, hypernyms and hyponyms at query time.
A first attempt to add the support for that was done by Doug Turnbull with
the approach in the following Pull Requests [1].
We think that approach was a good starting point, but we do believe it
could be improved.

*Weaknesses of Current Approach*
The current approach in our opinion presents the following weaknesses :
- try to guess the hypernym/hyponym/synonym relation from the DF of the
terms
- doesn't favour the original query term necessary
- favour rarer hypernym/hyponym/synonym and don't differentiate them.

*Proposed Improvements*

   - Nym Class Priority Order
   - Nyms within a Class Ranked by Popularity


*1 - Onym Class Priority Order*
We believe it should be possible to give different priority to different
class of nyms (hypernym/hyponym/synonym).
Specifically we do believe that should be possible to model this priority
in scoring:

*Original Query Term > True Synonym > Hyponym > Hypernym .*

Additional benefit could be gained if such inequality could be customised
based on user requirements.
*i.e.*
Adding different shades of nyms and slighly different ordering :
Original Query Term > True Synonym > Hyponym > 2 level hyponym > Hypernym .

*2 - Onyms within a Class Ranked by Popularity*

Within the same class we believe we need to favour the most popular
(highest Document Frequency) onyms.
i.e. within true synonyms we'll favour the most popular one.
The same within hyponyms or hypernyms.
Generally within an Onym class we want to rank higher the terms with higher
document frequency.

*Proposed Solution*
The proposed solution is to score the different onyms in this way :

*Original Query Term -> *IDFQueryTerm
*True Synonym (boost: 1.0)* ->  IDFQueryTerm * 1/(1+IDFSynonym)
*Hyponym (boost<1.0)*->  IDFQueryTerm * 1/(1+IDFHyponym)
*Hypernym (boost<1.0)* ->  IDFQueryTerm * 1/(1+IDFHypernym)

You may noticed the introduction of the boost factor.
This is the key point of the Onym classification.
All the onyms with the same boost will belong to the same class.
This gives the user the flexibility of ranking the different Onyms classes
based on their preference.
The boost solves the problem 1 (*Onym Class Priority Order*).
Multiplying the original term IDF with the second part of the formula fixes
problem 2 (*Onyms within a Class Ranked by Popularity*) and guarantee the
original term to win anyway.

*Implementation*
The suggested implementation will cover different areas :
- implement the scoring logic through blended DF/ proxy term stats/ proxy
similarity (it must be investigated the best path to implement the designed
scoring)
- Give the user a configuration file to model the Onyms. A first modality
is already available through [2]. A first improvement could be to implement
the support for taxonomies such as :
/big cats/lion-panthera leo/simba-kimba.
A final solution will allow an integration with custom knowledge bases,
wordnet, ect ect
- what about performance ? you could add a configuration parameters that
cut the query expansion based on a boost threshold. We can imagine the
boost as the distance from the original concept, so the user should be able
to cut down the expanded terms to favour performances.

[1] https://issues.apache.org/jira/browse/SOLR-11662,
https://github.com/elastic/elasticsearch/pull/35422

[2] https://issues.apache.org/jira/browse/SOLR-12238
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Wed, Nov 21, 2018 at 2:34 AM Michael Sokolov  wrote:

> This is a great idea. It would also be compelling to modify the term
> frequency using this deboosting so that stacked indexed terms can be
> weighted according to their closeness to the original term.
>
> On Tue, Nov 20, 2018, 2:19 PM jim ferenczi 
>> Sorry for the late reply,
>>
>> > So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> I am not sure, I mentioned Solr and ES because I thought it was about
>> adding taxonomies and complex expansion mechanisms to query builders but I
>> wonder if we can have a simple
>> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
>> be a new attribute that token filters would use when they produce stacked
>> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
>> already have a TermFrequencyAttribute to alter the 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-20 Thread Michael Sokolov
This is a great idea. It would also be compelling to modify the term
frequency using this deboosting so that stacked indexed terms can be
weighted according to their closeness to the original term.

On Tue, Nov 20, 2018, 2:19 PM jim ferenczi  Sorry for the late reply,
>
> > So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
> be a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> dturnb...@opensourceconnections.com> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option would to create alternate query builders. I think there's a couple
>> of enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the general
>>> case it works as expected and can be combined with other terms naturally.
>>> The kind of customization you want to achieve could be done in a plugin (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> dturnb...@opensourceconnections.com> a écrit :
>>>
 Yes that is another good area (there are many). Although of course
 embeddings have their own challenges and complexities. (they often capture
 shared context, but not shared meaning).

 It's a data point though of something we'd want to include in such a
 framework, though not sure where it would go on the roadmap...

 On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
 wrote:

> What about the use of word embeddings (see
>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
> to compute word similarity?
>
> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Hey folks,
>>
>> I wanted to open up a discussion about a change to the usage of
>> SynonymQuery. The goal here is to have a broader library of queries that
>> can address other cases where related terms occupy the same position but
>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> ambiguous terms, and other query expansion situations).
>>
>>
>> I bring this up because we've noticed (as I'm sure many of you have)
>> the pattern of clients jamming any related term into a synonyms file and
>> being surprised with odd results. I like the idea of enforcing "synonyms"
>> means exactly-the-same in Lucene-land. It's an easy thing to tell a 
>> client
>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery 
>> in
>> place works great.
>>
>> But I feel if that's the rule, we need to open up 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-20 Thread David Smiley
+1 great idea Jim!

On Tue, Nov 20, 2018 at 2:19 PM jim ferenczi  wrote:

> Sorry for the late reply,
>
> > So perhaps one way forward to contribute this sort of thing into Lucene
> is we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> I am not sure, I mentioned Solr and ES because I thought it was about
> adding taxonomies and complex expansion mechanisms to query builders but I
> wonder if we can have a simple
> mechanism to just (de)boost stacked tokens in the QueryBuilder. It could
> be a new attribute that token filters would use when they produce stacked
> tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
> already have a TermFrequencyAttribute to alter the frequency of a term when
> indexing so we could have the same mechanism for query term boosting ?
>
> Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
> dturnb...@opensourceconnections.com> a écrit :
>
>> Thanks Jim
>>
>> Yeah, now that I think about it - I agree that perhaps the simplest
>> option would to create alternate query builders. I think there's a couple
>> of enhancement to the base class that would be nice, such as
>> - Some additional token attributes passed to newSynonymQuery, such as the
>> type (was this a synonym or hyponym or something else...)
>> - The ability to differentiate between the original query term and the
>> generated synonym terms
>> - Consistent support for phrases
>>
>> I think part of my goal too is to help people without the use of plugins.
>> As we often are in scenarios at OpenSource Connections where people won't
>> be able to use a plugin. In this case alternate expansions around
>> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
>> using Solr/Lucene/ES.
>>
>> So perhaps one way forward to contribute this sort of thing into Lucene
>> is we could implement additional QueryBuilder implementations that provide
>> such functionality?
>>
>> Thanks
>> -Doug
>>
>> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
>> wrote:
>>
>>> You can easily customize the query that is used for synonyms in a custom
>>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>>> intended for subclasses that wish to customize the generated queries." so I
>>> don't think we need to do anything there. I agree that it is sometimes
>>> better to use something different than the SynonymQuery but in the general
>>> case it works as expected and can be combined with other terms naturally.
>>> The kind of customization you want to achieve could be done in a plugin (or
>>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>>> filters and alter the query the way you want. My point here is that the
>>> QueryBuilder should remain simple, you can add the complexity you want in a
>>> subclass.
>>> However I think there is another area we need to fix, the scoring of
>>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>>> improved so we need something similar than the SynonymQuery that handles
>>> multi phrases.
>>>
>>>
>>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>>> dturnb...@opensourceconnections.com> a écrit :
>>>
 Yes that is another good area (there are many). Although of course
 embeddings have their own challenges and complexities. (they often capture
 shared context, but not shared meaning).

 It's a data point though of something we'd want to include in such a
 framework, though not sure where it would go on the roadmap...

 On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
 wrote:

> What about the use of word embeddings (see
>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
> to compute word similarity?
>
> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Hey folks,
>>
>> I wanted to open up a discussion about a change to the usage of
>> SynonymQuery. The goal here is to have a broader library of queries that
>> can address other cases where related terms occupy the same position but
>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> ambiguous terms, and other query expansion situations).
>>
>>
>> I bring this up because we've noticed (as I'm sure many of you have)
>> the pattern of clients jamming any related term into a synonyms file and
>> being surprised with odd results. I like the idea of enforcing "synonyms"
>> means exactly-the-same in Lucene-land. It's an easy thing to tell a 
>> client
>> and setup simple patterns. So for synonyms, I think leaving SynonymQuery 
>> in
>> place works great.
>>
>> But I feel if that's the rule, we need to open up discussion of other
>> methods of scoring conceptual 'related term' relationships that usually
>> comes up in the context of query expansion. This paper (
>> 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-20 Thread jim ferenczi
Sorry for the late reply,

> So perhaps one way forward to contribute this sort of thing into Lucene
is we could implement additional QueryBuilder implementations that provide
such functionality?

I am not sure, I mentioned Solr and ES because I thought it was about
adding taxonomies and complex expansion mechanisms to query builders but I
wonder if we can have a simple
mechanism to just (de)boost stacked tokens in the QueryBuilder. It could be
a new attribute that token filters would use when they produce stacked
tokens and that the QueryBuilder checks when he builds the SynonymQuery. We
already have a TermFrequencyAttribute to alter the frequency of a term when
indexing so we could have the same mechanism for query term boosting ?

Le dim. 18 nov. 2018 à 02:24, Doug Turnbull <
dturnb...@opensourceconnections.com> a écrit :

> Thanks Jim
>
> Yeah, now that I think about it - I agree that perhaps the simplest option
> would to create alternate query builders. I think there's a couple of
> enhancement to the base class that would be nice, such as
> - Some additional token attributes passed to newSynonymQuery, such as the
> type (was this a synonym or hyponym or something else...)
> - The ability to differentiate between the original query term and the
> generated synonym terms
> - Consistent support for phrases
>
> I think part of my goal too is to help people without the use of plugins.
> As we often are in scenarios at OpenSource Connections where people won't
> be able to use a plugin. In this case alternate expansions around
> hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
> using Solr/Lucene/ES.
>
> So perhaps one way forward to contribute this sort of thing into Lucene is
> we could implement additional QueryBuilder implementations that provide
> such functionality?
>
> Thanks
> -Doug
>
> On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi 
> wrote:
>
>> You can easily customize the query that is used for synonyms in a custom
>> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
>> intended for subclasses that wish to customize the generated queries." so I
>> don't think we need to do anything there. I agree that it is sometimes
>> better to use something different than the SynonymQuery but in the general
>> case it works as expected and can be combined with other terms naturally.
>> The kind of customization you want to achieve could be done in a plugin (or
>> in Solr or ES) that extends the QueryBuilder, you can also use custom token
>> filters and alter the query the way you want. My point here is that the
>> QueryBuilder should remain simple, you can add the complexity you want in a
>> subclass.
>> However I think there is another area we need to fix, the scoring of
>> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
>> improved so we need something similar than the SynonymQuery that handles
>> multi phrases.
>>
>>
>> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
>> dturnb...@opensourceconnections.com> a écrit :
>>
>>> Yes that is another good area (there are many). Although of course
>>> embeddings have their own challenges and complexities. (they often capture
>>> shared context, but not shared meaning).
>>>
>>> It's a data point though of something we'd want to include in such a
>>> framework, though not sure where it would go on the roadmap...
>>>
>>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
>>> wrote:
>>>
 What about the use of word embeddings (see

 https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
 to compute word similarity?

 On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
 dturnb...@opensourceconnections.com> wrote:

> Hey folks,
>
> I wanted to open up a discussion about a change to the usage of
> SynonymQuery. The goal here is to have a broader library of queries that
> can address other cases where related terms occupy the same position but
> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
> ambiguous terms, and other query expansion situations).
>
>
> I bring this up because we've noticed (as I'm sure many of you have)
> the pattern of clients jamming any related term into a synonyms file and
> being surprised with odd results. I like the idea of enforcing "synonyms"
> means exactly-the-same in Lucene-land. It's an easy thing to tell a client
> and setup simple patterns. So for synonyms, I think leaving SynonymQuery 
> in
> place works great.
>
> But I feel if that's the rule, we need to open up discussion of other
> methods of scoring conceptual 'related term' relationships that usually
> comes up in the context of query expansion. This paper (
> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
> surveys the current thinking for scoring various query expansion scenarios
> like those we deal with in the messy, ambiguous uses 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-17 Thread Doug Turnbull
Thanks Jim

Yeah, now that I think about it - I agree that perhaps the simplest option
would to create alternate query builders. I think there's a couple of
enhancement to the base class that would be nice, such as
- Some additional token attributes passed to newSynonymQuery, such as the
type (was this a synonym or hyponym or something else...)
- The ability to differentiate between the original query term and the
generated synonym terms
- Consistent support for phrases

I think part of my goal too is to help people without the use of plugins.
As we often are in scenarios at OpenSource Connections where people won't
be able to use a plugin. In this case alternate expansions around
hypernyms/hyponyms/?... are a pretty frequent gap that search teams have
using Solr/Lucene/ES.

So perhaps one way forward to contribute this sort of thing into Lucene is
we could implement additional QueryBuilder implementations that provide
such functionality?

Thanks
-Doug

On Sat, Nov 17, 2018 at 3:41 PM jim ferenczi  wrote:

> You can easily customize the query that is used for synonyms in a custom
> QueryBuilder. The javadocs of the *newSynonymQuery* says "This is
> intended for subclasses that wish to customize the generated queries." so I
> don't think we need to do anything there. I agree that it is sometimes
> better to use something different than the SynonymQuery but in the general
> case it works as expected and can be combined with other terms naturally.
> The kind of customization you want to achieve could be done in a plugin (or
> in Solr or ES) that extends the QueryBuilder, you can also use custom token
> filters and alter the query the way you want. My point here is that the
> QueryBuilder should remain simple, you can add the complexity you want in a
> subclass.
> However I think there is another area we need to fix, the scoring of
> multi-terms synonyms is broken (compared to the SynonymQuery) and could be
> improved so we need something similar than the SynonymQuery that handles
> multi phrases.
>
>
> Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
> dturnb...@opensourceconnections.com> a écrit :
>
>> Yes that is another good area (there are many). Although of course
>> embeddings have their own challenges and complexities. (they often capture
>> shared context, but not shared meaning).
>>
>> It's a data point though of something we'd want to include in such a
>> framework, though not sure where it would go on the roadmap...
>>
>> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
>> wrote:
>>
>>> What about the use of word embeddings (see
>>>
>>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>>> to compute word similarity?
>>>
>>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>>> dturnb...@opensourceconnections.com> wrote:
>>>
 Hey folks,

 I wanted to open up a discussion about a change to the usage of
 SynonymQuery. The goal here is to have a broader library of queries that
 can address other cases where related terms occupy the same position but
 don't have the same meaning (such as hypernyms, hyponyms, meronyms,
 ambiguous terms, and other query expansion situations).


 I bring this up because we've noticed (as I'm sure many of you have)
 the pattern of clients jamming any related term into a synonyms file and
 being surprised with odd results. I like the idea of enforcing "synonyms"
 means exactly-the-same in Lucene-land. It's an easy thing to tell a client
 and setup simple patterns. So for synonyms, I think leaving SynonymQuery in
 place works great.

 But I feel if that's the rule, we need to open up discussion of other
 methods of scoring conceptual 'related term' relationships that usually
 comes up in the context of query expansion. This paper (
 https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
 surveys the current thinking for scoring various query expansion scenarios
 like those we deal with in the messy, ambiguous uses of synonyms in prod
 systems (khakis aren't trousers, they're a kind-of trouser).


 The cool thing is many of the ideas in this paper seem doable with
 existing Lucene index stats. So one might imagine a 'related terms' token
 filter that injected some scoring based on how related it really is to
 the original query term using Jaccard, Dice, or other methods called out in
 this paper.


 Another insightful set of research is this article on concept scoring (
 https://usabilityetc.com/articles/information-retrieval-concept-matching/
 ), which prioritizes related terms by connectedness and other factors.

 Needless to say, it's an open area how two terms someone has asserted
 are related to a query term 'should be' scored. It's one of those things
 that likely will forever depend on a number of domain and application
 specific factors. It's possibly a big opportunity of 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-17 Thread jim ferenczi
You can easily customize the query that is used for synonyms in a custom
QueryBuilder. The javadocs of the *newSynonymQuery* says "This is intended
for subclasses that wish to customize the generated queries." so I don't
think we need to do anything there. I agree that it is sometimes better to
use something different than the SynonymQuery but in the general case it
works as expected and can be combined with other terms naturally. The kind
of customization you want to achieve could be done in a plugin (or in Solr
or ES) that extends the QueryBuilder, you can also use custom token filters
and alter the query the way you want. My point here is that the
QueryBuilder should remain simple, you can add the complexity you want in a
subclass.
However I think there is another area we need to fix, the scoring of
multi-terms synonyms is broken (compared to the SynonymQuery) and could be
improved so we need something similar than the SynonymQuery that handles
multi phrases.


Le sam. 17 nov. 2018 à 07:19, Doug Turnbull <
dturnb...@opensourceconnections.com> a écrit :

> Yes that is another good area (there are many). Although of course
> embeddings have their own challenges and complexities. (they often capture
> shared context, but not shared meaning).
>
> It's a data point though of something we'd want to include in such a
> framework, though not sure where it would go on the roadmap...
>
> On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
> wrote:
>
>> What about the use of word embeddings (see
>>
>> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
>> to compute word similarity?
>>
>> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
>> dturnb...@opensourceconnections.com> wrote:
>>
>>> Hey folks,
>>>
>>> I wanted to open up a discussion about a change to the usage of
>>> SynonymQuery. The goal here is to have a broader library of queries that
>>> can address other cases where related terms occupy the same position but
>>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>>> ambiguous terms, and other query expansion situations).
>>>
>>>
>>> I bring this up because we've noticed (as I'm sure many of you have) the
>>> pattern of clients jamming any related term into a synonyms file and being
>>> surprised with odd results. I like the idea of enforcing "synonyms" means
>>> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
>>> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
>>> place works great.
>>>
>>> But I feel if that's the rule, we need to open up discussion of other
>>> methods of scoring conceptual 'related term' relationships that usually
>>> comes up in the context of query expansion. This paper (
>>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2,
>>> surveys the current thinking for scoring various query expansion scenarios
>>> like those we deal with in the messy, ambiguous uses of synonyms in prod
>>> systems (khakis aren't trousers, they're a kind-of trouser).
>>>
>>>
>>> The cool thing is many of the ideas in this paper seem doable with
>>> existing Lucene index stats. So one might imagine a 'related terms' token
>>> filter that injected some scoring based on how related it really is to
>>> the original query term using Jaccard, Dice, or other methods called out in
>>> this paper.
>>>
>>>
>>> Another insightful set of research is this article on concept scoring (
>>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>>> ), which prioritizes related terms by connectedness and other factors.
>>>
>>> Needless to say, it's an open area how two terms someone has asserted
>>> are related to a query term 'should be' scored. It's one of those things
>>> that likely will forever depend on a number of domain and application
>>> specific factors. It's possibly a big opportunity of improvement for Lucene
>>> - but likely is about putting the right framework in place to allow for
>>> good default set of query-expansion scoring scenarios with options for
>>> customization.
>>>
>>> What I'm proposing is:
>>>
>>>
>>>-
>>>
>>>Submit a small patch that restricts SynonymQuery to tokens of type
>>>"SYNONYM" in the same posn, which allows some short term work to be done
>>>with the current Lucene QueryBuilder. Any additional non-synonym terms
>>>would be appended as a boolean query for now
>>>-
>>>
>>>Begin work on alternate 'related-term' scoring systems that also key
>>>off the token type in QueryBuilder to create custom scoring using 
>>> built-in
>>>term stats. The possibilities here are endless, up to weighted related
>>>terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, 
>>> etc
>>>
>>>
>>> I'm curious what folks would think of a patch for bullet one followed by
>>> other patches down the road for additional functionality?
>>>
>>> (related to discussion in this Elasticsearch PR
>>>
>>>
>>> 

Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread Doug Turnbull
Yes that is another good area (there are many). Although of course
embeddings have their own challenges and complexities. (they often capture
shared context, but not shared meaning).

It's a data point though of something we'd want to include in such a
framework, though not sure where it would go on the roadmap...

On Sat, Nov 17, 2018 at 1:15 AM J. Delgado 
wrote:

> What about the use of word embeddings (see
>
> https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
> to compute word similarity?
>
> On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
>
>> Hey folks,
>>
>> I wanted to open up a discussion about a change to the usage of
>> SynonymQuery. The goal here is to have a broader library of queries that
>> can address other cases where related terms occupy the same position but
>> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
>> ambiguous terms, and other query expansion situations).
>>
>>
>> I bring this up because we've noticed (as I'm sure many of you have) the
>> pattern of clients jamming any related term into a synonyms file and being
>> surprised with odd results. I like the idea of enforcing "synonyms" means
>> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
>> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
>> place works great.
>>
>> But I feel if that's the rule, we need to open up discussion of other
>> methods of scoring conceptual 'related term' relationships that usually
>> comes up in the context of query expansion. This paper (
>> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
>> the current thinking for scoring various query expansion scenarios like
>> those we deal with in the messy, ambiguous uses of synonyms in prod systems
>> (khakis aren't trousers, they're a kind-of trouser).
>>
>>
>> The cool thing is many of the ideas in this paper seem doable with
>> existing Lucene index stats. So one might imagine a 'related terms' token
>> filter that injected some scoring based on how related it really is to
>> the original query term using Jaccard, Dice, or other methods called out in
>> this paper.
>>
>>
>> Another insightful set of research is this article on concept scoring (
>> https://usabilityetc.com/articles/information-retrieval-concept-matching/
>> ), which prioritizes related terms by connectedness and other factors.
>>
>> Needless to say, it's an open area how two terms someone has asserted are
>> related to a query term 'should be' scored. It's one of those things that
>> likely will forever depend on a number of domain and application specific
>> factors. It's possibly a big opportunity of improvement for Lucene - but
>> likely is about putting the right framework in place to allow for good
>> default set of query-expansion scoring scenarios with options for
>> customization.
>>
>> What I'm proposing is:
>>
>>
>>-
>>
>>Submit a small patch that restricts SynonymQuery to tokens of type
>>"SYNONYM" in the same posn, which allows some short term work to be done
>>with the current Lucene QueryBuilder. Any additional non-synonym terms
>>would be appended as a boolean query for now
>>-
>>
>>Begin work on alternate 'related-term' scoring systems that also key
>>off the token type in QueryBuilder to create custom scoring using built-in
>>term stats. The possibilities here are endless, up to weighted related
>>terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, 
>> etc
>>
>>
>> I'm curious what folks would think of a patch for bullet one followed by
>> other patches down the road for additional functionality?
>>
>> (related to discussion in this Elasticsearch PR
>>
>> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
>> )
>>
>> --
>> CTO, OpenSource Connections
>> Author, Relevant Search
>> http://o19s.com/doug
>>
> --
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug


Re: SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread J. Delgado
What about the use of word embeddings (see
https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
to compute word similarity?

On Sat, Nov 17, 2018 at 5:52 AM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Hey folks,
>
> I wanted to open up a discussion about a change to the usage of
> SynonymQuery. The goal here is to have a broader library of queries that
> can address other cases where related terms occupy the same position but
> don't have the same meaning (such as hypernyms, hyponyms, meronyms,
> ambiguous terms, and other query expansion situations).
>
>
> I bring this up because we've noticed (as I'm sure many of you have) the
> pattern of clients jamming any related term into a synonyms file and being
> surprised with odd results. I like the idea of enforcing "synonyms" means
> exactly-the-same in Lucene-land. It's an easy thing to tell a client and
> setup simple patterns. So for synonyms, I think leaving SynonymQuery in
> place works great.
>
> But I feel if that's the rule, we need to open up discussion of other
> methods of scoring conceptual 'related term' relationships that usually
> comes up in the context of query expansion. This paper (
> https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
> the current thinking for scoring various query expansion scenarios like
> those we deal with in the messy, ambiguous uses of synonyms in prod systems
> (khakis aren't trousers, they're a kind-of trouser).
>
>
> The cool thing is many of the ideas in this paper seem doable with
> existing Lucene index stats. So one might imagine a 'related terms' token
> filter that injected some scoring based on how related it really is to
> the original query term using Jaccard, Dice, or other methods called out in
> this paper.
>
>
> Another insightful set of research is this article on concept scoring (
> https://usabilityetc.com/articles/information-retrieval-concept-matching/),
> which prioritizes related terms by connectedness and other factors.
>
> Needless to say, it's an open area how two terms someone has asserted are
> related to a query term 'should be' scored. It's one of those things that
> likely will forever depend on a number of domain and application specific
> factors. It's possibly a big opportunity of improvement for Lucene - but
> likely is about putting the right framework in place to allow for good
> default set of query-expansion scoring scenarios with options for
> customization.
>
> What I'm proposing is:
>
>
>-
>
>Submit a small patch that restricts SynonymQuery to tokens of type
>"SYNONYM" in the same posn, which allows some short term work to be done
>with the current Lucene QueryBuilder. Any additional non-synonym terms
>would be appended as a boolean query for now
>-
>
>Begin work on alternate 'related-term' scoring systems that also key
>off the token type in QueryBuilder to create custom scoring using built-in
>term stats. The possibilities here are endless, up to weighted related
>terms (ie Alessandro's patch), feeding back Rocchio relevance feedback, etc
>
>
> I'm curious what folks would think of a patch for bullet one followed by
> other patches down the road for additional functionality?
>
> (related to discussion in this Elasticsearch PR
>
> https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249
> )
>
> --
> CTO, OpenSource Connections
> Author, Relevant Search
> http://o19s.com/doug
>


SynonymQuery / Query Expansion Strategies Discussion

2018-11-16 Thread Doug Turnbull
Hey folks,

I wanted to open up a discussion about a change to the usage of
SynonymQuery. The goal here is to have a broader library of queries that
can address other cases where related terms occupy the same position but
don't have the same meaning (such as hypernyms, hyponyms, meronyms,
ambiguous terms, and other query expansion situations).


I bring this up because we've noticed (as I'm sure many of you have) the
pattern of clients jamming any related term into a synonyms file and being
surprised with odd results. I like the idea of enforcing "synonyms" means
exactly-the-same in Lucene-land. It's an easy thing to tell a client and
setup simple patterns. So for synonyms, I think leaving SynonymQuery in
place works great.

But I feel if that's the rule, we need to open up discussion of other
methods of scoring conceptual 'related term' relationships that usually
comes up in the context of query expansion. This paper (
https://arxiv.org/pdf/1708.00247.pdf), particularly section 3.2, surveys
the current thinking for scoring various query expansion scenarios like
those we deal with in the messy, ambiguous uses of synonyms in prod systems
(khakis aren't trousers, they're a kind-of trouser).


The cool thing is many of the ideas in this paper seem doable with existing
Lucene index stats. So one might imagine a 'related terms' token filter
that injected some scoring based on how related it really is to the
original query term using Jaccard, Dice, or other methods called out in
this paper.


Another insightful set of research is this article on concept scoring (
https://usabilityetc.com/articles/information-retrieval-concept-matching/),
which prioritizes related terms by connectedness and other factors.

Needless to say, it's an open area how two terms someone has asserted are
related to a query term 'should be' scored. It's one of those things that
likely will forever depend on a number of domain and application specific
factors. It's possibly a big opportunity of improvement for Lucene - but
likely is about putting the right framework in place to allow for good
default set of query-expansion scoring scenarios with options for
customization.

What I'm proposing is:


   -

   Submit a small patch that restricts SynonymQuery to tokens of type
   "SYNONYM" in the same posn, which allows some short term work to be done
   with the current Lucene QueryBuilder. Any additional non-synonym terms
   would be appended as a boolean query for now
   -

   Begin work on alternate 'related-term' scoring systems that also key off
   the token type in QueryBuilder to create custom scoring using built-in term
   stats. The possibilities here are endless, up to weighted related terms (ie
   Alessandro's patch), feeding back Rocchio relevance feedback, etc


I'm curious what folks would think of a patch for bullet one followed by
other patches down the road for additional functionality?

(related to discussion in this Elasticsearch PR

https://github.com/elastic/elasticsearch/pull/35422#issuecomment-439095249)

-- 
CTO, OpenSource Connections
Author, Relevant Search
http://o19s.com/doug