Re: Shingles behavior

2020-05-21 Thread Radu Gheorghe
Turns out, it’s down to setting enableGraphQueries=false in the field 
definition. I completely missed that :(

> On 21 May 2020, at 07:49, Radu Gheorghe  wrote:
> 
> Hi Alex, long time no see :)
> 
> I tried with sow, and that basically invalidates query-time shingles (it only 
> mathes mona OR lisa OR smile).
> 
> I'm using shingles at both index and query time as a substitute for pf2 and 
> pf3: the more shingles I match, the more relevant the document. Also, higher 
> order shingles naturally get lower frequencies, meaning they get a "natural" 
> boost.
> 
> Best regards,
> Radu
> 
> joi, 21 mai 2020, 00:28 Alexandre Rafalovitch  a scris:
> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
> 
> Regards,
>Alex.
> 
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe  
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed 
> > looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > >  > > maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
> > documents back, in that order. Because the first document matches all the 
> > terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only matches 
> > one.
> >
> > Instead, I only get the first document back. That’s because the query 
> > expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> > > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> > > +shingle_field:lisa smile) (+shingle_field:mona lisa 
> > > +shingle_field:smile) shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using 
> > “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the 
> > options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
> > default, and minimum_should_match works as expected. The only difference I 
> > see between the two, on the analysis side, is that tokens start at 0 in 
> > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see 
> > that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is 
> > there a workaround?
> >
> > Thanks and best regards,
> > Radu



Re: Shingles behavior

2020-05-20 Thread Radu Gheorghe
Hi Alex, long time no see :)

I tried with sow, and that basically invalidates query-time shingles (it
only mathes mona OR lisa OR smile).

I'm using shingles at both index and query time as a substitute for pf2 and
pf3: the more shingles I match, the more relevant the document. Also,
higher order shingles naturally get lower frequencies, meaning they get a
"natural" boost.

Best regards,
Radu

joi, 21 mai 2020, 00:28 Alexandre Rafalovitch  a scris:

> Did you try it with 'sow' parameter both ways? I am not sure I fully
> understand the question, especially with shingling on both passes
> rather than just indexing one. But at least it is something to try and
> is one of the difference areas between Solr and ES.
>
> Regards,
>Alex.
>
> On Tue, 19 May 2020 at 05:59, Radu Gheorghe 
> wrote:
> >
> > Hello Solr users,
> >
> > I’m quite puzzled about how shingles work. The way tokens are analysed
> looks fine to me, but the query seems too restrictive.
> >
> > Here’s the sample use-case. I have three documents:
> >
> > mona lisa smile
> > mona lisa
> > mona
> >
> > I have a shingle filter set up like this (both index- and query-time):
> >
> > >  maxShingleSize=“4”/>
> >
> > When I query for “Mona Lisa smile” (no quotes), I expect to get all
> three documents back, in that order. Because the first document matches all
> the terms:
> >
> > mona
> > mona lisa
> > mona lisa smile
> > lisa
> > lisa smile
> > smile
> >
> > And the second one matches only some, and the third document only
> matches one.
> >
> > Instead, I only get the first document back. That’s because the query
> expects all the “words” to match:
> >
> > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile)
> shingle_field:mona lisa smile)))”,
> >
> > The query above is generated by the Edismax query parser, when I’m using
> “shingle_field” as “df”.
> >
> > Is there a way to get “any of the words” to match? I’ve tried all the
> options I can think of:
> > - different query parsers
> > - q.OP=OR
> > - mm=0 (or 1 or 0% or 10% or…)
> >
> > Nothing seems to change the parsed query from the above.
> >
> > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR”
> by default, and minimum_should_match works as expected. The only difference
> I see between the two, on the analysis side, is that tokens start at 0 in
> Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see
> that the default “text_en”, for example, also starts at position 1.
> >
> > Is it just a bug that mm doesn’t work in the context of shingles? Or is
> there a workaround?
> >
> > Thanks and best regards,
> > Radu
>


Re: Shingles behavior

2020-05-20 Thread Alexandre Rafalovitch
Did you try it with 'sow' parameter both ways? I am not sure I fully
understand the question, especially with shingling on both passes
rather than just indexing one. But at least it is something to try and
is one of the difference areas between Solr and ES.

Regards,
   Alex.

On Tue, 19 May 2020 at 05:59, Radu Gheorghe  wrote:
>
> Hello Solr users,
>
> I’m quite puzzled about how shingles work. The way tokens are analysed looks 
> fine to me, but the query seems too restrictive.
>
> Here’s the sample use-case. I have three documents:
>
> mona lisa smile
> mona lisa
> mona
>
> I have a shingle filter set up like this (both index- and query-time):
>
> >  > maxShingleSize=“4”/>
>
> When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
> documents back, in that order. Because the first document matches all the 
> terms:
>
> mona
> mona lisa
> mona lisa smile
> lisa
> lisa smile
> smile
>
> And the second one matches only some, and the third document only matches one.
>
> Instead, I only get the first document back. That’s because the query expects 
> all the “words” to match:
>
> > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) 
> > shingle_field:mona lisa smile)))”,
>
> The query above is generated by the Edismax query parser, when I’m using 
> “shingle_field” as “df”.
>
> Is there a way to get “any of the words” to match? I’ve tried all the options 
> I can think of:
> - different query parsers
> - q.OP=OR
> - mm=0 (or 1 or 0% or 10% or…)
>
> Nothing seems to change the parsed query from the above.
>
> I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
> default, and minimum_should_match works as expected. The only difference I 
> see between the two, on the analysis side, is that tokens start at 0 in 
> Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see 
> that the default “text_en”, for example, also starts at position 1.
>
> Is it just a bug that mm doesn’t work in the context of shingles? Or is there 
> a workaround?
>
> Thanks and best regards,
> Radu


Shingles behavior

2020-05-19 Thread Radu Gheorghe
Hello Solr users,

I’m quite puzzled about how shingles work. The way tokens are analysed looks 
fine to me, but the query seems too restrictive.

Here’s the sample use-case. I have three documents:

mona lisa smile
mona lisa
mona

I have a shingle filter set up like this (both index- and query-time):

>  maxShingleSize=“4”/>

When I query for “Mona Lisa smile” (no quotes), I expect to get all three 
documents back, in that order. Because the first document matches all the terms:

mona
mona lisa
mona lisa smile
lisa
lisa smile
smile

And the second one matches only some, and the third document only matches one.

Instead, I only get the first document back. That’s because the query expects 
all the “words” to match:

> "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona 
> +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona 
> +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) 
> shingle_field:mona lisa smile)))”,

The query above is generated by the Edismax query parser, when I’m using 
“shingle_field” as “df”.

Is there a way to get “any of the words” to match? I’ve tried all the options I 
can think of:
- different query parsers
- q.OP=OR
- mm=0 (or 1 or 0% or 10% or…)

Nothing seems to change the parsed query from the above.

I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by 
default, and minimum_should_match works as expected. The only difference I see 
between the two, on the analysis side, is that tokens start at 0 in 
Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that 
the default “text_en”, for example, also starts at position 1.

Is it just a bug that mm doesn’t work in the context of shingles? Or is there a 
workaround?

Thanks and best regards,
Radu