Re: Shingles behavior
Turns out, it’s down to setting enableGraphQueries=false in the field definition. I completely missed that :( > On 21 May 2020, at 07:49, Radu Gheorghe wrote: > > Hi Alex, long time no see :) > > I tried with sow, and that basically invalidates query-time shingles (it only > mathes mona OR lisa OR smile). > > I'm using shingles at both index and query time as a substitute for pf2 and > pf3: the more shingles I match, the more relevant the document. Also, higher > order shingles naturally get lower frequencies, meaning they get a "natural" > boost. > > Best regards, > Radu > > joi, 21 mai 2020, 00:28 Alexandre Rafalovitch a scris: > Did you try it with 'sow' parameter both ways? I am not sure I fully > understand the question, especially with shingling on both passes > rather than just indexing one. But at least it is something to try and > is one of the difference areas between Solr and ES. > > Regards, >Alex. > > On Tue, 19 May 2020 at 05:59, Radu Gheorghe > wrote: > > > > Hello Solr users, > > > > I’m quite puzzled about how shingles work. The way tokens are analysed > > looks fine to me, but the query seems too restrictive. > > > > Here’s the sample use-case. I have three documents: > > > > mona lisa smile > > mona lisa > > mona > > > > I have a shingle filter set up like this (both index- and query-time): > > > > > > > maxShingleSize=“4”/> > > > > When I query for “Mona Lisa smile” (no quotes), I expect to get all three > > documents back, in that order. Because the first document matches all the > > terms: > > > > mona > > mona lisa > > mona lisa smile > > lisa > > lisa smile > > smile > > > > And the second one matches only some, and the third document only matches > > one. > > > > Instead, I only get the first document back. That’s because the query > > expects all the “words” to match: > > > > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona > > > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona > > > +shingle_field:lisa smile) (+shingle_field:mona lisa > > > +shingle_field:smile) shingle_field:mona lisa smile)))”, > > > > The query above is generated by the Edismax query parser, when I’m using > > “shingle_field” as “df”. > > > > Is there a way to get “any of the words” to match? I’ve tried all the > > options I can think of: > > - different query parsers > > - q.OP=OR > > - mm=0 (or 1 or 0% or 10% or…) > > > > Nothing seems to change the parsed query from the above. > > > > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by > > default, and minimum_should_match works as expected. The only difference I > > see between the two, on the analysis side, is that tokens start at 0 in > > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see > > that the default “text_en”, for example, also starts at position 1. > > > > Is it just a bug that mm doesn’t work in the context of shingles? Or is > > there a workaround? > > > > Thanks and best regards, > > Radu
Re: Shingles behavior
Hi Alex, long time no see :) I tried with sow, and that basically invalidates query-time shingles (it only mathes mona OR lisa OR smile). I'm using shingles at both index and query time as a substitute for pf2 and pf3: the more shingles I match, the more relevant the document. Also, higher order shingles naturally get lower frequencies, meaning they get a "natural" boost. Best regards, Radu joi, 21 mai 2020, 00:28 Alexandre Rafalovitch a scris: > Did you try it with 'sow' parameter both ways? I am not sure I fully > understand the question, especially with shingling on both passes > rather than just indexing one. But at least it is something to try and > is one of the difference areas between Solr and ES. > > Regards, >Alex. > > On Tue, 19 May 2020 at 05:59, Radu Gheorghe > wrote: > > > > Hello Solr users, > > > > I’m quite puzzled about how shingles work. The way tokens are analysed > looks fine to me, but the query seems too restrictive. > > > > Here’s the sample use-case. I have three documents: > > > > mona lisa smile > > mona lisa > > mona > > > > I have a shingle filter set up like this (both index- and query-time): > > > > > maxShingleSize=“4”/> > > > > When I query for “Mona Lisa smile” (no quotes), I expect to get all > three documents back, in that order. Because the first document matches all > the terms: > > > > mona > > mona lisa > > mona lisa smile > > lisa > > lisa smile > > smile > > > > And the second one matches only some, and the third document only > matches one. > > > > Instead, I only get the first document back. That’s because the query > expects all the “words” to match: > > > > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) > shingle_field:mona lisa smile)))”, > > > > The query above is generated by the Edismax query parser, when I’m using > “shingle_field” as “df”. > > > > Is there a way to get “any of the words” to match? I’ve tried all the > options I can think of: > > - different query parsers > > - q.OP=OR > > - mm=0 (or 1 or 0% or 10% or…) > > > > Nothing seems to change the parsed query from the above. > > > > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” > by default, and minimum_should_match works as expected. The only difference > I see between the two, on the analysis side, is that tokens start at 0 in > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see > that the default “text_en”, for example, also starts at position 1. > > > > Is it just a bug that mm doesn’t work in the context of shingles? Or is > there a workaround? > > > > Thanks and best regards, > > Radu >
Re: Shingles behavior
Did you try it with 'sow' parameter both ways? I am not sure I fully understand the question, especially with shingling on both passes rather than just indexing one. But at least it is something to try and is one of the difference areas between Solr and ES. Regards, Alex. On Tue, 19 May 2020 at 05:59, Radu Gheorghe wrote: > > Hello Solr users, > > I’m quite puzzled about how shingles work. The way tokens are analysed looks > fine to me, but the query seems too restrictive. > > Here’s the sample use-case. I have three documents: > > mona lisa smile > mona lisa > mona > > I have a shingle filter set up like this (both index- and query-time): > > > > maxShingleSize=“4”/> > > When I query for “Mona Lisa smile” (no quotes), I expect to get all three > documents back, in that order. Because the first document matches all the > terms: > > mona > mona lisa > mona lisa smile > lisa > lisa smile > smile > > And the second one matches only some, and the third document only matches one. > > Instead, I only get the first document back. That’s because the query expects > all the “words” to match: > > > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona > > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona > > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) > > shingle_field:mona lisa smile)))”, > > The query above is generated by the Edismax query parser, when I’m using > “shingle_field” as “df”. > > Is there a way to get “any of the words” to match? I’ve tried all the options > I can think of: > - different query parsers > - q.OP=OR > - mm=0 (or 1 or 0% or 10% or…) > > Nothing seems to change the parsed query from the above. > > I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by > default, and minimum_should_match works as expected. The only difference I > see between the two, on the analysis side, is that tokens start at 0 in > Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see > that the default “text_en”, for example, also starts at position 1. > > Is it just a bug that mm doesn’t work in the context of shingles? Or is there > a workaround? > > Thanks and best regards, > Radu
Shingles behavior
Hello Solr users, I’m quite puzzled about how shingles work. The way tokens are analysed looks fine to me, but the query seems too restrictive. Here’s the sample use-case. I have three documents: mona lisa smile mona lisa mona I have a shingle filter set up like this (both index- and query-time): > maxShingleSize=“4”/> When I query for “Mona Lisa smile” (no quotes), I expect to get all three documents back, in that order. Because the first document matches all the terms: mona mona lisa mona lisa smile lisa lisa smile smile And the second one matches only some, and the third document only matches one. Instead, I only get the first document back. That’s because the query expects all the “words” to match: > "parsedquery":"+DisjunctionMaxQuery+shingle_field:mona > +usage_query_view_tags:lisa +shingle_field:smile) (+shingle_field:mona > +shingle_field:lisa smile) (+shingle_field:mona lisa +shingle_field:smile) > shingle_field:mona lisa smile)))”, The query above is generated by the Edismax query parser, when I’m using “shingle_field” as “df”. Is there a way to get “any of the words” to match? I’ve tried all the options I can think of: - different query parsers - q.OP=OR - mm=0 (or 1 or 0% or 10% or…) Nothing seems to change the parsed query from the above. I’ve compared this to the behaviour of Elasticsearch. There, I get “OR” by default, and minimum_should_match works as expected. The only difference I see between the two, on the analysis side, is that tokens start at 0 in Elasticsearch and at 1 in Solr. I doubt that’s the problem, because I see that the default “text_en”, for example, also starts at position 1. Is it just a bug that mm doesn’t work in the context of shingles? Or is there a workaround? Thanks and best regards, Radu