Re: [VOTE] Release Lucene/Solr 5.4.1 RC2

2016-01-21 Thread Michael Froh
Should the Solr release notes reference the additional fixes that went in there? >From your email to start the thread: - SOLR-8496: multi-select faceting and getDocSet(List) can match deleted docs - SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with MLTHandler and

Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

2020-10-02 Thread Michael Froh
I am currently working on migrating a project from an old version of Solr to Elasticsearch, and came across a funny (to me at least) difference in the "default" behavior of JapanesePartOfSpeechStopFilterFactory. If JapanesePartOfSpeechStopFilterFactory is given empty args, it does nothing. It

Re: Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

2020-10-07 Thread Michael Froh
't fix it in 8.x releases... not sure). > > Mike McCandless > > http://blog.mikemccandless.com > > > On Fri, Oct 2, 2020 at 12:10 PM Michael Froh wrote: > >> I am currently working on migrating a project from an old version of Solr >> to Elasticsearch, and cam

Re: [VOTE] Release Lucene/Solr 8.6.0 RC1

2020-07-14 Thread Michael Froh
+1 (Non-binding) Upgraded Amazon Product Search to this RC and found no issues. On Fri, Jul 10, 2020 at 5:03 AM Namgyu Kim wrote: > +1 SUCCESS! [1:25:53.314724] > > On Fri, Jul 10, 2020 at 2:22 PM Tomás Fernández Löbbe < > tomasflo...@gmail.com> wrote: > >> +1 >> >> SUCCESS! [1:04:02.550893]

Re: Processing query clause combinations at indexing time

2020-12-15 Thread Michael Froh
ES/Solr > layer (which I know you don't use, but hypothetically speaking), I'm > dubious there as well. > >> > >> ~ David Smiley > >> Apache Lucene/Solr Search Developer > >> http://www.linkedin.com/in/davidwsmiley > >> > >> > >>

Processing query clause combinations at indexing time

2020-12-14 Thread Michael Froh
My team at work has a neat feature that we've built on top of Lucene that has provided a substantial (20%+) increase in maximum qps and some reduction in query latency. Basically, we run a training process that looks at historical queries to find frequently co-occurring combinations of required

Re: Processing query clause combinations at indexing time

2020-12-15 Thread Michael Froh
the index is smaller? Now that we have > > ConditionalTokenFilter (for branching), can the feature be implemented > > cleanly? > > > > Ideally it wouldn't require a lot of new code, something like checking > > a "set" + conditionaltokenfilter + shinglefilter? > >

Re: Processing query clause combinations at indexing time

2020-12-15 Thread Michael Froh
doing this for > non-scoring cases maybe something is off? > > On Tue, Dec 15, 2020 at 3:19 PM Michael Froh wrote: > > > > It's conceptually similar to CommonGrams in the single-field case, > though it doesn't require terms to appear in any particular positions. > >

Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()

2020-11-18 Thread Michael Froh
I have some code that is kind of abusing IndexWriter.deleteAll(). In short, I'm basically experimenting with using tiny (one block of joined parent/child documents) indexes as a serialized format to index on one fleet and then merge these tiny indexes on another fleet. I'm doing this by indexing a

Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()

2020-11-18 Thread Michael Froh
IndexWriter instances. On Wed, Nov 18, 2020 at 12:25 PM Michael Sokolov wrote: > I'm curious if you tried creating a new IndexWriter for each batch? > > On Wed, Nov 18, 2020 at 1:18 PM Michael Froh wrote: > > > > I have some code that is kind of abusing IndexWriter.deleteA

Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()

2020-11-18 Thread Michael Froh
r > http://www.linkedin.com/in/davidwsmiley > > > On Wed, Nov 18, 2020 at 1:17 PM Michael Froh wrote: > >> I have some code that is kind of abusing IndexWriter.deleteAll(). In >> short, I'm basically experimenting with using tiny (one block of joined >> parent/chi

Re: Multimodal search

2023-10-12 Thread Michael Froh
We recently added multimodal search in OpenSearch: https://github.com/opensearch-project/neural-search/pull/359 Since Lucene ultimately just cares about embeddings, does Lucene itself really need to be multimodal? Wherever the embeddings come from, Lucene can index the vectors and combine with

Boolean field type

2023-11-08 Thread Michael Froh
Hey, I've been musing about ideas for a "clever" Boolean field type on Lucene for a while, and I think I might have an idea that could work. That said, this popped into my head this afternoon and has not been fully-baked. It may not be very clever at all. My experience is that Boolean fields

Re: Boolean field type

2023-11-10 Thread Michael Froh
the posting list goes like >> dense sequentially increasing numbers 1,2,3,4,5.. May it already be >> compressed by codecs like >> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html >> ? >> >> On Thu, Nov 9, 2023 at 3:31 AM Mic

Dense union of doc IDs

2022-11-03 Thread Michael Froh
Hi, I was recently poking around in the createWeight implementation for MultiTermQueryConstantScoreWrapper to get to the bottom of some slow queries, and I realized that the worst-case performance could be pretty bad, but (maybe) possible to optimize for. Imagine if we have a segment with N docs

Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Michael Froh
Hi all, I was looking into a customer issue where they noticed some increased GC time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps from both systems, the big difference was tracked down to the float[256] allocated (as a norms cache) when creating a BM25Scorer (in

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Michael Froh
paring heap dumps from production hosts so far, so I'll try measuring in an environment where I can see what's going on. On Tue, May 2, 2023 at 1:14 PM Robert Muir wrote: > On Tue, May 2, 2023 at 3:24 PM Michael Froh wrote: > > > > > This seems ok if it isn't invasive. I s

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Michael Froh
ions in the demoscene. I could try inlining those calculations and measuring the impact with the luceneutil benchmarks. On Tue, May 2, 2023 at 11:34 AM Robert Muir wrote: > On Tue, May 2, 2023 at 12:49 PM Michael Froh wrote: > > > > Hi all, > > > > I was looking

UTF-8 well-formedness for SimpleTextCodec

2023-12-18 Thread Michael Froh
Hi there, I was recently writing up a short Lucene file format tutorial ( https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html), using SimpleTextCodec for educational purposes. I found that SimpleTextSegmentInfo tries to output the segment ID as raw bytes, which will often

Computing weight.count() cheaply in the face of deletes?

2024-02-02 Thread Michael Froh
Hi, On OpenSearch, we've been taking advantage of the various O(1) Weight#count() implementations to quickly compute various aggregations without needing to iterate over all the matching documents (at least when the top-level query is functionally a match-all at the segment level). Of course,

Re: Computing weight.count() cheaply in the face of deletes?

2024-02-05 Thread Michael Froh
ration on the bit set? > > I don't think we can fold it into Weight#count since there is an > expectation that it is negligible compared with the cost of a naive count, > but we may be able to do it in IndexSearcher#count or on the OpenSearch > side. > > Le ven. 2 févr. 20

Re: Improve testing

2024-05-24 Thread Michael Froh
Is your new test uncommitted? The Gradle check will fail if you have uncommitted files, to avoid the situation where it "works on my machine (because of a file that I forgot to commit)". The rough workflow is: 1. Develop stuff (code and/or tests). 2. Commit it. 3. Gradle check. 4. If Gradle

Re: Can we import an HNSW graph into lucene index ?

2024-06-14 Thread Michael Froh
Hi Anand, Interesting that you should bring this up! There was a talk just this week at Berlin Buzzwords talking about using cuVS with Lucene: https://www.youtube.com/watch?v=qiW7iIDFJC0 >From that talk, it sounds like the folks at SearchScale have managed to integrate cuVS as a custom codec

[jira] [Created] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2012-06-08 Thread Michael Froh (JIRA)
Michael Froh created SOLR-3526: -- Summary: Remove classfile dependency on ZooKeeper from CoreContainer Key: SOLR-3526 URL: https://issues.apache.org/jira/browse/SOLR-3526 Project: Solr Issue

[jira] [Commented] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2012-06-11 Thread Michael Froh (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13292689#comment-13292689 ] Michael Froh commented on SOLR-3526: Oh, thanks a lot for pointing that out, Hoss! I

[jira] [Created] (LUCENE-4185) CharFilters being added twice in Solr

2012-07-02 Thread Michael Froh (JIRA)
Michael Froh created LUCENE-4185: Summary: CharFilters being added twice in Solr Key: LUCENE-4185 URL: https://issues.apache.org/jira/browse/LUCENE-4185 Project: Lucene - Java Issue Type

[jira] [Updated] (LUCENE-4185) CharFilters being added twice in Solr

2012-07-02 Thread Michael Froh (JIRA)
[ https://issues.apache.org/jira/browse/LUCENE-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Froh updated LUCENE-4185: - Affects Version/s: (was: 4.0) 4.0-ALPHA CharFilters being added

[jira] [Created] (SOLR-5330) PerSegmentSingleValuedFaceting overwrites facet values

2013-10-10 Thread Michael Froh (JIRA)
Michael Froh created SOLR-5330: -- Summary: PerSegmentSingleValuedFaceting overwrites facet values Key: SOLR-5330 URL: https://issues.apache.org/jira/browse/SOLR-5330 Project: Solr Issue Type

[jira] [Updated] (SOLR-5330) PerSegmentSingleValuedFaceting overwrites facet values

2013-10-10 Thread Michael Froh (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Froh updated SOLR-5330: --- Attachment: solr-5330.patch Patch attached PerSegmentSingleValuedFaceting overwrites facet values

[jira] [Commented] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2015-11-19 Thread Michael Froh (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014134#comment-15014134 ] Michael Froh commented on SOLR-3526: 3.5 years later, I decided to try taking a stab at this myself

[jira] [Comment Edited] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2015-11-19 Thread Michael Froh (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014134#comment-15014134 ] Michael Froh edited comment on SOLR-3526 at 11/19/15 6:53 PM: -- 3.5 years later

[jira] [Commented] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2015-11-19 Thread Michael Froh (JIRA)
[ https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15014167#comment-15014167 ] Michael Froh commented on SOLR-3526: Also worth highlighting -- the significant part of the change