Re: DisjunctionMinQuery

2023-11-09 Thread Marc D'Mello
Hi all,

Once again, thanks for the responses! After thinking about this a bit more,
I think Michael's response makes sense now. I do agree that partial matches
shouldn't be ranked higher than conjunctive matches, so I think it doesn't
make sense in my use case to use a DisjunctiveMinQuery (I think I would
need a AndMinQuery or something like that). This also answers my initial
question.

I did have a question about this though:

in that case you should use something like 1/x as your scoring function
> in the sub-clauses
>

Doesn't using 1/x as a scoring function, even in the subclauses, still
cause an issue where the output score will be inversely correlated to the
indexed term score? I think that would break BMW right? Or maybe I am
misunderstanding the suggestion.

Thanks,
Marc

On Thu, Nov 9, 2023 at 10:18 AM Uwe Schindler  wrote:

> Hi,
>
> in that case you should use something like 1/x as your scoring function
> in the sub-clauses. In Lucene scores should go up for more relevancy.
> This must also apply for function scoring.
>
> Uwe
>
> Am 09.11.2023 um 19:14 schrieb Marc D'Mello:
> > Hi Michael,
> >
> > Thanks for the response! So to answer your first question, yes this would
> > keep the lowest score from the matching sub-scorers. Our use case is that
> > we have a custom term-level score overriding term frequency and we want
> to
> > take the min of that as part of our scoring function. Maybe it's a niche
> > use case?
> >
> > Thanks,
> > Marc
> >
> > On Wed, Nov 8, 2023 at 3:19 PM Michael Froh  wrote:
> >
> >> Hi Marc,
> >>
> >> Can you clarify what the semantics of a DisjunctionMinQuery would be?
> Would
> >> you keep the score for the *lowest* scoring disjunct (plus some
> tiebreaker
> >> applied to the other matching disjuncts)?
> >>
> >> I'm trying to imagine how that would work compared to the classic DisMax
> >> use-case. Say I'm searching for "dalmatian" using a DisMax query over
> term
> >> queries against title and body. A match on title is probably going to
> score
> >> higher than a match against the body, just because the title has a
> shorter
> >> length (and the doc frequency of individual terms in the title is
> likely to
> >> be lower, since there are fewer terms overall). With DisMax, a match on
> >> title alone will score higher than a match on body, and the tie-break
> will
> >> tend to score a match on title and body higher than a match on title
> alone.
> >>
> >> With a DisMin (assuming you keep the lowest score), then a match on
> title
> >> and body would probably score lower than a match on title alone. That
> feels
> >> weird to me, but I might be missing the use-case.
> >>
> >> How would you use a DisMinQuery?
> >>
> >> Thanks,
> >> Froh
> >>
> >>
> >>
> >> On Wed, Nov 8, 2023 at 10:50 AM Marc D'Mello 
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I noticed we have a DisjunctionMaxQuery
> >>> <
> >>>
> >>
> https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java
> >>> but
> >>> not a corresponding DisjunctionMinQuery. I was just wondering if there
> >> was
> >>> a specific reason for that? Or is it just that it is not a common query
> >> to
> >>> use?
> >>>
> >>> Thanks!
> >>> Marc
> >>>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: DisjunctionMinQuery

2023-11-09 Thread Marc D'Mello
Hi Michael,

Thanks for the response! So to answer your first question, yes this would
keep the lowest score from the matching sub-scorers. Our use case is that
we have a custom term-level score overriding term frequency and we want to
take the min of that as part of our scoring function. Maybe it's a niche
use case?

Thanks,
Marc

On Wed, Nov 8, 2023 at 3:19 PM Michael Froh  wrote:

> Hi Marc,
>
> Can you clarify what the semantics of a DisjunctionMinQuery would be? Would
> you keep the score for the *lowest* scoring disjunct (plus some tiebreaker
> applied to the other matching disjuncts)?
>
> I'm trying to imagine how that would work compared to the classic DisMax
> use-case. Say I'm searching for "dalmatian" using a DisMax query over term
> queries against title and body. A match on title is probably going to score
> higher than a match against the body, just because the title has a shorter
> length (and the doc frequency of individual terms in the title is likely to
> be lower, since there are fewer terms overall). With DisMax, a match on
> title alone will score higher than a match on body, and the tie-break will
> tend to score a match on title and body higher than a match on title alone.
>
> With a DisMin (assuming you keep the lowest score), then a match on title
> and body would probably score lower than a match on title alone. That feels
> weird to me, but I might be missing the use-case.
>
> How would you use a DisMinQuery?
>
> Thanks,
> Froh
>
>
>
> On Wed, Nov 8, 2023 at 10:50 AM Marc D'Mello  wrote:
>
> > Hi all,
> >
> > I noticed we have a DisjunctionMaxQuery
> > <
> >
> https://github.com/apache/lucene/blob/branch_9_7/lucene/core/src/java/org/apache/lucene/search/DisjunctionMaxQuery.java
> > >
> > but
> > not a corresponding DisjunctionMinQuery. I was just wondering if there
> was
> > a specific reason for that? Or is it just that it is not a common query
> to
> > use?
> >
> > Thanks!
> > Marc
> >
>


DisjunctionMinQuery

2023-11-08 Thread Marc D'Mello
Hi all,

I noticed we have a DisjunctionMaxQuery

but
not a corresponding DisjunctionMinQuery. I was just wondering if there was
a specific reason for that? Or is it just that it is not a common query to
use?

Thanks!
Marc


Re: When to use StringField and when to use FacetField for categorization?

2023-10-20 Thread Marc D'Mello
Just following up on Mike's comment:


> It used to be that the "doc values" based faceting did not support
>
arbitrary hierarchy, but I think that was fixed at some point.


Yeah it was fixed a year or two ago, SortedSetDocValuesFacetField supports
hierarchical faceting, I think you just need to enable it in the
FacetsConfig. One thing to keep in mind is even though SSDV faceting
doesn't require a taxonomy index, it still requires a
SortedSetDocValuesReaderState to be maintained, which can be a little bit
expensive to create, but only needs to be done once. This benchmark code

serves as a pretty basic example of SSDV/hierarchical SSDV faceting.

On Fri, Oct 20, 2023 at 7:09 AM Michael Wechner 
wrote:

> cool, thank you very much!
>
> Michael
>
>
>
> Am 20.10.23 um 15:44 schrieb Michael McCandless:
> > You can use either the "doc values" implementation for facets
> > (SortedSetDocValuesFacetField), or the "taxonomy" implementation
> > (FacetField, in which case, yes, you need to create a TaxonomyWriter).
> >
> > It used to be that the "doc values" based faceting did not support
> > arbitrary hierarchy, but I think that was fixed at some point.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Fri, Oct 20, 2023 at 9:03 AM Michael Wechner <
> michael.wech...@wyona.com>
> > wrote:
> >
> >> Hi Mike
> >>
> >> Thanks for your feedback!
> >>
> >> IIUC in order to have the actual advantages of Facets one has to
> >> "connect" it with a TaxonomyWriter
> >>
> >> FacetsConfig config = new FacetsConfig();
> >> DirectoryTaxonomyWriter taxoWriter = new
> DirectoryTaxonomyWriter(taxoDir);
> >> indexWriter.addDocument(config.build(taxoWriter, doc));
> >>
> >> right?
> >>
> >> Thanks
> >>
> >> Michael
> >>
> >>
> >>
> >>
> >> Am 20.10.23 um 12:19 schrieb Michael McCandless:
> >>> There are some differences.
> >>>
> >>> StringField is indexed into the inverted index (postings) so you can do
> >>> efficient filtering.  You can also store in stored fields to retrieve.
> >>>
> >>> FacetField does everything StringField does (filtering, storing
> >> (maybe?)),
> >>> but in addition it stores data for faceting.  I.e. you can compute
> facet
> >>> counts or simple aggregations at search time.
> >>>
> >>> FacetField is also hierarchical: you can filter and facet by different
> >>> points/levels of your hierarchy.
> >>>
> >>> Mike McCandless
> >>>
> >>> http://blog.mikemccandless.com
> >>>
> >>>
> >>> On Fri, Oct 20, 2023 at 5:43 AM Michael Wechner <
> >> michael.wech...@wyona.com>
> >>> wrote:
> >>>
>  Hi
> 
>  I have found the following simple Facet Example
> 
> 
> 
> >>
> https://github.com/apache/lucene/blob/main/lucene/demo/src/java/org/apache/lucene/demo/facet/SimpleFacetsExample.java
>  whereas for a simple categorization of documents I currently use
>  StringField, e.g.
> 
>  doc1.add(new StringField("category", "book"));
>  doc1.add(new StringField("category", "quantum_physics"));
>  doc1.add(new StringField("category", "Neumann"))
>  doc1.add(new StringField("category", "Wheeler"))
> 
>  doc2.add(new StringField("category", "magazine"));
>  doc2.add(new StringField("category", "astro_physics"));
> 
>  which works well, but would it be better to use Facets for this, e.g.
> 
>  doc1.add(new FacetField("media-type", "book"));
>  doc1.add(new FacetField("topic", "physics", "quantum");
>  doc1.add(new FacetField("author", "Neumann");
>  doc1.add(new FacetField("author", "Wheeler");
> 
>  doc1.add(new FacetField("media-type", "magazine"));
>  doc1.add(new FacetField("topic", "physics", "astro");
> 
>  ?
> 
>  IIUC the StringField approach is more general, whereas the FacetField
>  approach allows to do a more specific categorization / search.
>  Or do I misunderstand this?
> 
>  Thanks
> 
>  Michael
> 
> 
> 
>  -
>  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>  For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Disjunctively scoring non-matching conjunctive clauses

2023-07-20 Thread Marc D'Mello
Hi all,

I'm an engineer on Amazon Product Search and I've recently come upon a
situation where I've required conjunctive matching but disjunctive scoring.
As a concrete example, let's say I have a query like this:

(+title:"a" +title:"b" +title:"c") (product_id:1)

This is saying I want to conjunctively match on the title OR I want to
match a specific product document where the product_id is 1.

Let's say the document where product_id = 1 has a title of "a b", so it
doesn't match the title query. In this case, the score for the title clause
will be 0 since to my understanding, Lucene doesn't count scores for
non-matching clauses. However for my use case, I would like to take into
account that several keywords did in fact match, so as I stated earlier,
disjunctive scoring even though I still want to match conjunctively,

My way of working around this right now is to reconstruct the query as the
following (forgive my made-up Lucene query syntax, hopefully it's still
readable):

+(ConstantScoreQuery: 0 ((+title:"a" +title:"b" +title:"c")
(product_id:1))) (title:"a" title:"b" title:"c")

Pretty much, I separate this into a matching query that is wrapped by a
ConstantScore query so it has no score and a scoring query that will
provide a disjunctive score.

My approach feels a bit convoluted, so I was wondering if there were any
cleaner ways to do this? And if not, are there any drawbacks to my
workaround performance wise?

Thanks!
Marc D'Mello


RangeFacetsCount Question

2022-04-21 Thread Marc D'Mello
Hi,

I had a quick question about RangeFacetsCounts
<https://github.com/apache/lucene/blob/a071180a806d1bb7ae11ae30a07e43e452bea810/lucene/facet/src/java/org/apache/lucene/facet/range/RangeFacetCounts.java#L65>,
I'm a bit confused by the fastMatchQuery param. Specifically, I was
wondering why we need this when we can provide hits from a FacetCollector
directly without having to run a query? I realize that the fastMatchQuery
is used for filtering provided hits further, but it seems redundant when we
can do all the matching we need before providing the FacetCollector object
to RangeFacetCounts. SortedSetDocValuesFacetCounts only has FacetCollector
as a param for example
<https://github.com/apache/lucene/blob/a071180a806d1bb7ae11ae30a07e43e452bea810/lucene/facet/src/java/org/apache/lucene/facet/sortedset/SortedSetDocValuesFacetCounts.java#L89>
without
having the fastMatchQuery param. Maybe I'm misunderstanding something here?
If anyone has an explanation that would be super helpful!

Thanks!
Marc D'Mello


Re: Issue with Japanese User Dictionary

2022-01-13 Thread Marc D'Mello
Hi Mike,

Thanks for the response! I'm actually not super familiar with
UserDictionaries, but looking at the code, it seems like a single line in
the user provided user dictionary corresponds to a single entry? In that
case, here is the line (or entry) that does have both widths that I believe
is causing the problem:

レコーダー,レコーダー,レコーダー,JA名詞

I'm guess here the surface is レコーダー and the concatentated segment is the
first occurrence of レコーダー. I'm what surface or concatenated segment means
though, and what it would mean semantically to replace the surface with the
full width version or the concatenated segment with the half width version.

Thanks,
Marc


On Thu, Jan 13, 2022 at 7:18 AM Michael Sokolov  wrote:

> HI Marc, I wonder if there is a workaround for this issue: eg, could
> we have entries for both widths? I wonder if there is some interaction
> with an analysis chain that is doing half-width -> full-width
> conversion (or vice versa)? I think the UserDictionary has to operate
> on pre-analyzed tokens ... although maybe *after* char filtering,
> which presumably could handle width conversions. A bunch of rambling,
> but maybe the point is - can you share some more information -- what
> is the full entry in the dictionary that causes the problem?
>
> On Wed, Jan 12, 2022 at 7:04 PM Marc D'Mello  wrote:
> >
> > Hi,
> >
> > I had a question about the Japanese user dictionary. We have a user
> > dictionary that used to work but after attempting to upgrade Lucene, it
> > fails with the following error:
> >
> > Caused by: java.lang.RuntimeException: Illegal user dictionary entry
> レコーダー
> > - the concatenated segmentation (レコーダー) does not match the surface form
> > (レコーダー)
> > at
> >
> org.apache.lucene.analysis.ja.dict.UserDictionary.(UserDictionary.java:123)
> >
> > The specific commit causing this error is here
> > <
> https://github.com/apache/lucene/commit/73ba88a50dec64f367caa88d277c26dfd1d8883b#diff-75fd48fadfd3d011e9c34c4310ef66e9009edfbc738fd82deb5661a8edb5c5d9
> >.
> > The only thing that seems to differ is that the characters are full-width
> > vs half-width, so I was wondering if this is intended behavior or a
> bug/too
> > restrictive. Any suggestions for fixing this would be greatly
> appreciated!
> > Thanks!
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Issue with Japanese User Dictionary

2022-01-12 Thread Marc D'Mello
Hi,

I had a question about the Japanese user dictionary. We have a user
dictionary that used to work but after attempting to upgrade Lucene, it
fails with the following error:

Caused by: java.lang.RuntimeException: Illegal user dictionary entry レコーダー
- the concatenated segmentation (レコーダー) does not match the surface form
(レコーダー)
at
org.apache.lucene.analysis.ja.dict.UserDictionary.(UserDictionary.java:123)

The specific commit causing this error is here
.
The only thing that seems to differ is that the characters are full-width
vs half-width, so I was wondering if this is intended behavior or a bug/too
restrictive. Any suggestions for fixing this would be greatly appreciated!
Thanks!