Re: Multiple merge-runs from same set of segments

2021-05-27 Thread Patrick Zhai
Sorry for the delayed response, as for caching termDict data across
threads, I do not aware of any existing lucene mechanism could do that (and
it might be tricky since it is across threads), but maybe worth trying to
see whether we can get some extra speed based on that!

Patrick

Ravikumar Govindarajan  于2021年5月24日周一
上午11:49写道:

> Thanks Patrick for the help!
>
> May I know what lucene version you're using?
> >
>
> We are using an older version of lucene as of now (4.7.x) and I believe the
> FilterCodecReader of current version is akin to FilterAtomicReader & should
> do the job for us!
>
> If it is not available, I'm not sure whether the merge will happen via
> merge
> > policy, maybe you could check the source code and see?
> >
>
> Checked & AFAIK, our old version isn't supporting it. But I guess it should
> be fine to wrap a SortingAtomicReader and pass it to the API. Guess, it can
> be done!
>
> But I think the current default directory implementation is MMapDirectory,
> > which delegate the caching to the system and should have
> > already optimized this situation
> >
>
> We do use the default MMap-dir but I was actually thinking about
> unpacking/walking Term-Dict data (FST) repeatedly from various
> threads, even if via MMap. Are there optimizations here (caching unpacked
> blocks etc..) that we could tap into?
>
> --
> Ravi
>
> On Mon, May 24, 2021 at 11:09 PM Patrick Zhai  wrote:
>
> > Hi Ravi,
> >
> > 1. May I know what lucene version you're using? As far as I know the
> > SortingMergePolicy has been deprecated and replaced by
> > IndexWriterConfig.setIndexSort in newer lucene version. So if the
> > "setIndexSort" is available I would suggest using that to achieve the
> > sorted index (as you might have already figured out, the IndexRearranger
> > let you pass in an IndexWriterConfig so that you could set it there). If
> it
> > is not available, I'm not sure whether the merge will happen via merge
> > policy, maybe you could check the source code and see?
> > 2. Yeah it's a good observation, we're doing multiple passes over one
> > segment! But I think the current default directory implementation is
> > MMapDirectory, which delegate the caching to the system and should have
> > already optimized this situation. Here's a great blog explaining the
> > MMapDirectory in lucene:
> > https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Best
> > Patrick
> >
> > Ravikumar Govindarajan  于2021年5月24日周一
> > 上午9:54写道:
> >
> > > Thanks Michael!
> > >
> > > This was just what I was looking for!!. Just a couple of questions.
> > >
> > >
> > >- When we call addIndexes(IndexReader...), does the merge happen via
> > >MergePolicy? We use a SortingMergePolicy and would like to maintain
> > the
> > >sort-order in newly created segments too
> > >- Concurrency is a cool-trick here. But if I understand the patch
> > >correctly, don't we end-up doing multiple passes over the Term Dict,
> > one
> > >for each Selector? Loading it fully in memory could help here,
> > possibly?
> > >
> > > --
> > > Ravi
> > >
> > > On Mon, May 24, 2021 at 7:37 PM Michael McCandless <
> > > luc...@mikemccandless.com> wrote:
> > >
> > > > Are you trying to rewrite your already created index into a different
> > > > segment geometry?
> > > >
> > > > Maybe have a look at the new IndexRearranger tool
> > > > ?  It is already
> > > doing
> > > > something like what you enumerated below, including mocking LiveDocs
> to
> > > get
> > > > the right documents into the right segments.
> > > >
> > > > Mike McCandless
> > > >
> > > > http://blog.mikemccandless.com
> > > >
> > > >
> > > > On Sat, May 22, 2021 at 3:50 PM Ravikumar Govindarajan <
> > > > ravikumar.govindara...@gmail.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> We have a use-case for index-rewrite on a "frozen index" where no
> new
> > > >> documents are added. It goes like this..
> > > >>
> > > >>1. Get all segments for the index (base-segment-list)
> > > >>2. Create a new segment from base-segment-list with unique set of
> > > docs
> > > >>(LiveDocs)
> > > >>3. Repeat step 2, for a fixed count. Like say 5 or 10 times
> > > >>
> > > >> Is something like this achievable via Merge Policy? We can disable
> > > commits
> > > >> too, till the full run is completed.
> > > >>
> > > >> Any help is appreciated
> > > >>
> > > >> Regards,
> > > >> Ravi
> > > >>
> > > >
> > >
> >
>


Re: Lucene/Solr and BERT

2021-05-27 Thread Julie Tibshirani
Your summary sounds right to me. There are some ideas (being discussed on
the issue), but I don't think we have a detailed understanding yet of the
performance difference.

It would be great to get more eyes on the benchmark if you're interested in
double-checking the results. Mike mentioned that he saw a similar
performance difference in search (7-8x) when he ran his own benchmarks.

Julie




On Thu, May 27, 2021 at 12:55 AM Michael Wechner 
wrote:

> Thank you very much for having done these benchmarks!
>
> IIUC one could state
>
> - Indexing:
>Lucene is slower than hnswlib/C++, very roughly 10x performance
> difference
> - Searching (Queries per second):
>Lucene is slower than hnswlib/C++, very roughly 8x performance
> difference
>
> right, but we should double-check these results?
>
> Also it is not clear at the moment why there is this performance
> difference, right?
>
>
> Am 27.05.21 um 03:33 schrieb Julie Tibshirani:
> > These JIRA issues contain results against two ann-benchmarks datasets.
> It'd
> > be great to get your thoughts/ feedback if you have any:
> > * Searching: https://issues.apache.org/jira/browse/LUCENE-9937
> > * Indexing: https://issues.apache.org/jira/browse/LUCENE-9941
> >
> > The benchmarks are based on the setup here:
> > https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you
> run
> > into issues with it.
> >
> > A note: my motivation for running ann-benchmarks was to understand how
> the
> > current performance compares to other approaches, and to research ideas
> for
> > improvements. The setup in the PR doesn't feel solid/ maintainable as a
> > long term approach to development benchmarks. My personal plan is to
> focus
> > on enhancing luceneutil and our nightly benchmarks (
> > https://github.com/mikemccand/luceneutil) instead of putting a lot of
> > effort into the ann-benchmarks setup.
> >
> > Julie
> >
> > On Wed, May 26, 2021 at 1:04 PM Alex K  wrote:
> >
> >> Thanks Michael. IIRC, the thing that was taking so long was merging
> into a
> >> single segment. Is there already benchmarking code for HNSW
> >> available somewhere? I feel like I remember someone posting benchmarking
> >> results on one of the Jira tickets.
> >>
> >> Thanks,
> >> Alex
> >>
> >> On Wed, May 26, 2021 at 3:41 PM Michael Sokolov 
> >> wrote:
> >>
> >>> This java implementation will be slower than the C implementation. I
> >>> believe the algorithm is essentially the same, however this is new and
> >>> there may be bugs!  I (and I think Julie had similar results IIRC)
> >>> measured something like 8x slower than hnswlib (using ann-benchmarks).
> >>> It is also surprising (to me) though how this varies with
> >>> differently-learned vectors so YMMV. I still think there is value
> >>> here, and look forward to improved performance, especially as JDK16
> >>> has some improved support for vectorized instructions.
> >>>
> >>> Please also understand that the HNSW algorithm interacts with Lucene's
> >>> segmented architecture in a tricky way. Because we built a graph
> >>> *per-segment* when flushing/merging, these must be rebuilt whenever
> >>> segments are merged. So your indexing performance can be heavily
> >>> influenced by how often you flush, as well as by your merge policy
> >>> settings. Also, when searching, there is a bigger than usual benefit
> >>> for searching across fewer segments, since the cost of searching an
> >>> HNSW graph scales more or less with log N (so searching a single large
> >>> graph is cheaper than searching the same documents divided among
> >>> smaller graphs). So I do recommend using a multithreaded collector in
> >>> order to get best latency with HNSW-based search. To get the best
> >>> indexing, and searching, performance, you should generally index as
> >>> large a number of documents as possible before flushing.
> >>>
> >>> -Mike
> >>>
> >>> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
> >>>  wrote:
>  Hi Alex
> 
>  Thank you very much for your feedback and the various insights!
> 
>  Am 26.05.21 um 04:41 schrieb Alex K:
> > Hi Michael and others,
> >
> > Sorry just now getting back to you. For your three original
> >> questions:
> > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > thorough response.
> > - As far as I know Opendistro is calling out to a C/C++ binary to run
> >>> the
> > actual HNSW algorithm and store the HNSW part of the index. When they
> > implemented it about a year ago, Lucene did not have this yet. I
> >>> assume the
> > Lucene HNSW implementation is solid, but would not be surprised if
> >> it's
> > slower than the C/C++ based implementation, given the JVM has some
> > disadvantages for these kinds of CPU-bound/number crunching algos.
> > - I just haven't had much time to invest into my benchmark recently.
> >> In
> > particular, I got stuck on why indexing was taking extremely long.
> >> Just
> > 

Re: Hierarchical facet select a subtree but one child

2021-05-27 Thread nbuso

Hi,

yes my aim was to introduce the functionality on the facet package.
I created a ticket and added a simple patch; the use case seems to apply 
only to hierarchical facets, maybe we can add a validation to avoid to 
use the method in other cases.


I'm happy to make modifications to the patch if you have any comment.

https://issues.apache.org/jira/browse/LUCENE-9979

Nicola Buso

On 2020-08-17 15:14, Michael McCandless wrote:

I think this is a missing API in DrillDownQuery?

Nicola, could you open an issue?

The filtering is as Mike Sokolov described, but I think we should add
a sugar method, e.g. DrillDownQuery.remove or something, to add a
negated query clause.

And until this API is added and you can upgrade to it, you can
construct your own TermQuery and then add it as a MUST_NOT clause.
Look at how DrillDownQuery.add converts incoming facet paths to terms
and use that public DrillDownQuery.term method it exposes to create
your own negated TermQuery.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Aug 15, 2020 at 11:55 AM Michael Sokolov 
wrote:


If you are trying to show documents that have facet value V1
excluding
those with facet value V1.1, then you would need to issue a query
like:

+f:V1 -f:V1.1

assuming your facet values are indexed in a field called "f". I
don't
think this really has anything to do with faceting; it's just a
filtering problem.

On Tue, Aug 4, 2020 at 4:47 AM nbuso  wrote:


Hi,

is there someone that can point me in the right API to negate

facet

values?
May be this DrillDownQuery#add(dim, query) the API to permit this

use

case?




https://lucene.apache.org/core/8_5_2/facet/org/apache/lucene/facet/DrillDownQuery.html#add-java.lang.String-org.apache.lucene.search.Query-



Nicola


On 2020-07-29 10:27, nbuso wrote:

Hi,

I'm a bit rusty with Lucene facets API and I have a common use

case

that I would like to solve.
Suppose the following facet values tree:

Facet
 - V1
   - V1.1
   - V1.2
   - V1.3
   - V1.4
   - (not topK values)
 - V2
   - V2.1
   - V2.2
   - V2.3
   - V2.4
   - (not topK values)

With (not topK values) I mean values you are not showing in the

UI

because of space/visualization problems. You usually see them

with the

links "More ..."

Use case:
1 - select V1 => all V1.x are selected
2 - de-select V1.1

How can I achieve this? from the search results I know the

values

V1.[1-4] but I don't know the values that are not in topK. How

can I

select all the V1 subtree but V1.1?

Please let me know if you need more info.


Nicola Buso - EBI





-

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail:

java-user-h...@lucene.apache.org






-

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-27 Thread Michael Wechner

good point! I have changed it accordingly

https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?

Hope it is clear now :-)

Am 27.05.21 um 16:39 schrieb Michael Sokolov:

LGTM, but perhaps also should state that if possible you *should*
update because the 8.x index may not be able to be read by the
eventual 10 release.

On Thu, May 27, 2021 at 7:52 AM Michael Wechner
 wrote:

I have added a QnA

https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?

Hope that makes sense, otherwise let me know and I can correct/update :-)



Am 26.05.21 um 23:56 schrieb Michael Wechner:

using lucene-backward-codecs-9.0.0-SNAPSHOT.jar makes it work :-)

Thank you very much!

But IIUC it is recommended to reindex when upgrading, right? I guess
similar to what Solr is recommending

https://solr.apache.org/guide/8_0/reindexing.html


Am 26.05.21 um 21:26 schrieb Michael Sokolov:

I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
to read 8.x indexes.

On Wed, May 26, 2021 at 9:27 AM Michael Wechner
 wrote:

Hi

I am using Lucene 8.8.2 in production and I am currently doing some
tests using 9.0.0-SNAPSHOT, whereas I have included
lucene-backward-codecs, because in the log files it was asking me
whether I have forgotten to include lucene-backward-codecs.jar

   
   org.apache.lucene
   lucene-core
   9.0.0-SNAPSHOT
   
   
   org.apache.lucene
lucene-queryparser
   9.0.0-SNAPSHOT
   
   
   org.apache.lucene
lucene-backward-codecs
   8.8.2
   

But when querying index directories created with Lucene 8.8.2, then I
receive the following error

java.lang.NoClassDefFoundError: Could not initialize class
org.apache.lucene.codecs.Codec$Holder

I am not sure whether I understand the backwards compatibility page
correctly

https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility


but I guess version 9 will not be backwards compatible to version 8? Or
should I do something different?

Thanks

Michael

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-27 Thread Michael Sokolov
... should *reindex*  ( not update )

On Thu, May 27, 2021 at 10:39 AM Michael Sokolov  wrote:
>
> LGTM, but perhaps also should state that if possible you *should*
> update because the 8.x index may not be able to be read by the
> eventual 10 release.
>
> On Thu, May 27, 2021 at 7:52 AM Michael Wechner
>  wrote:
> >
> > I have added a QnA
> >
> > https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?
> >
> > Hope that makes sense, otherwise let me know and I can correct/update :-)
> >
> >
> >
> > Am 26.05.21 um 23:56 schrieb Michael Wechner:
> > > using lucene-backward-codecs-9.0.0-SNAPSHOT.jar makes it work :-)
> > >
> > > Thank you very much!
> > >
> > > But IIUC it is recommended to reindex when upgrading, right? I guess
> > > similar to what Solr is recommending
> > >
> > > https://solr.apache.org/guide/8_0/reindexing.html
> > >
> > >
> > > Am 26.05.21 um 21:26 schrieb Michael Sokolov:
> > >> I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
> > >> to read 8.x indexes.
> > >>
> > >> On Wed, May 26, 2021 at 9:27 AM Michael Wechner
> > >>  wrote:
> > >>> Hi
> > >>>
> > >>> I am using Lucene 8.8.2 in production and I am currently doing some
> > >>> tests using 9.0.0-SNAPSHOT, whereas I have included
> > >>> lucene-backward-codecs, because in the log files it was asking me
> > >>> whether I have forgotten to include lucene-backward-codecs.jar
> > >>>
> > >>>   
> > >>>   org.apache.lucene
> > >>>   lucene-core
> > >>>   9.0.0-SNAPSHOT
> > >>>   
> > >>>   
> > >>>   org.apache.lucene
> > >>> lucene-queryparser
> > >>>   9.0.0-SNAPSHOT
> > >>>   
> > >>>   
> > >>>   org.apache.lucene
> > >>> lucene-backward-codecs
> > >>>   8.8.2
> > >>>   
> > >>>
> > >>> But when querying index directories created with Lucene 8.8.2, then I
> > >>> receive the following error
> > >>>
> > >>> java.lang.NoClassDefFoundError: Could not initialize class
> > >>> org.apache.lucene.codecs.Codec$Holder
> > >>>
> > >>> I am not sure whether I understand the backwards compatibility page
> > >>> correctly
> > >>>
> > >>> https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility
> > >>>
> > >>>
> > >>> but I guess version 9 will not be backwards compatible to version 8? Or
> > >>> should I do something different?
> > >>>
> > >>> Thanks
> > >>>
> > >>> Michael
> > >>>
> > >>> -
> > >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>>
> > >> -
> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >>
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-27 Thread Michael Sokolov
LGTM, but perhaps also should state that if possible you *should*
update because the 8.x index may not be able to be read by the
eventual 10 release.

On Thu, May 27, 2021 at 7:52 AM Michael Wechner
 wrote:
>
> I have added a QnA
>
> https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?
>
> Hope that makes sense, otherwise let me know and I can correct/update :-)
>
>
>
> Am 26.05.21 um 23:56 schrieb Michael Wechner:
> > using lucene-backward-codecs-9.0.0-SNAPSHOT.jar makes it work :-)
> >
> > Thank you very much!
> >
> > But IIUC it is recommended to reindex when upgrading, right? I guess
> > similar to what Solr is recommending
> >
> > https://solr.apache.org/guide/8_0/reindexing.html
> >
> >
> > Am 26.05.21 um 21:26 schrieb Michael Sokolov:
> >> I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
> >> to read 8.x indexes.
> >>
> >> On Wed, May 26, 2021 at 9:27 AM Michael Wechner
> >>  wrote:
> >>> Hi
> >>>
> >>> I am using Lucene 8.8.2 in production and I am currently doing some
> >>> tests using 9.0.0-SNAPSHOT, whereas I have included
> >>> lucene-backward-codecs, because in the log files it was asking me
> >>> whether I have forgotten to include lucene-backward-codecs.jar
> >>>
> >>>   
> >>>   org.apache.lucene
> >>>   lucene-core
> >>>   9.0.0-SNAPSHOT
> >>>   
> >>>   
> >>>   org.apache.lucene
> >>> lucene-queryparser
> >>>   9.0.0-SNAPSHOT
> >>>   
> >>>   
> >>>   org.apache.lucene
> >>> lucene-backward-codecs
> >>>   8.8.2
> >>>   
> >>>
> >>> But when querying index directories created with Lucene 8.8.2, then I
> >>> receive the following error
> >>>
> >>> java.lang.NoClassDefFoundError: Could not initialize class
> >>> org.apache.lucene.codecs.Codec$Holder
> >>>
> >>> I am not sure whether I understand the backwards compatibility page
> >>> correctly
> >>>
> >>> https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility
> >>>
> >>>
> >>> but I guess version 9 will not be backwards compatible to version 8? Or
> >>> should I do something different?
> >>>
> >>> Thanks
> >>>
> >>> Michael
> >>>
> >>> -
> >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >>> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >>
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Index backwards compatibility

2021-05-27 Thread Michael Wechner

I have added a QnA

https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-WhenIupradeLucene,forexamplefrom8.8.2to9.0.0,doIhavetoreindex?

Hope that makes sense, otherwise let me know and I can correct/update :-)



Am 26.05.21 um 23:56 schrieb Michael Wechner:

using lucene-backward-codecs-9.0.0-SNAPSHOT.jar makes it work :-)

Thank you very much!

But IIUC it is recommended to reindex when upgrading, right? I guess 
similar to what Solr is recommending


https://solr.apache.org/guide/8_0/reindexing.html


Am 26.05.21 um 21:26 schrieb Michael Sokolov:

I think you need backward-codecs-9.0.0-SNAPSHOT there. It enables 9.0
to read 8.x indexes.

On Wed, May 26, 2021 at 9:27 AM Michael Wechner
 wrote:

Hi

I am using Lucene 8.8.2 in production and I am currently doing some
tests using 9.0.0-SNAPSHOT, whereas I have included
lucene-backward-codecs, because in the log files it was asking me
whether I have forgotten to include lucene-backward-codecs.jar

  
  org.apache.lucene
  lucene-core
  9.0.0-SNAPSHOT
  
  
  org.apache.lucene
lucene-queryparser
  9.0.0-SNAPSHOT
  
  
  org.apache.lucene
lucene-backward-codecs
  8.8.2
  

But when querying index directories created with Lucene 8.8.2, then I
receive the following error

java.lang.NoClassDefFoundError: Could not initialize class
org.apache.lucene.codecs.Codec$Holder

I am not sure whether I understand the backwards compatibility page
correctly

https://cwiki.apache.org/confluence/display/LUCENE/BackwardsCompatibility 



but I guess version 9 will not be backwards compatible to version 8? Or
should I do something different?

Thanks

Michael

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Lucene/Solr and BERT

2021-05-27 Thread Michael Wechner

Thank you very much for having done these benchmarks!

IIUC one could state

- Indexing:
  Lucene is slower than hnswlib/C++, very roughly 10x performance 
difference

- Searching (Queries per second):
  Lucene is slower than hnswlib/C++, very roughly 8x performance 
difference


right, but we should double-check these results?

Also it is not clear at the moment why there is this performance 
difference, right?



Am 27.05.21 um 03:33 schrieb Julie Tibshirani:

These JIRA issues contain results against two ann-benchmarks datasets. It'd
be great to get your thoughts/ feedback if you have any:
* Searching: https://issues.apache.org/jira/browse/LUCENE-9937
* Indexing: https://issues.apache.org/jira/browse/LUCENE-9941

The benchmarks are based on the setup here:
https://github.com/jtibshirani/lucene/pull/1. I am happy to help if you run
into issues with it.

A note: my motivation for running ann-benchmarks was to understand how the
current performance compares to other approaches, and to research ideas for
improvements. The setup in the PR doesn't feel solid/ maintainable as a
long term approach to development benchmarks. My personal plan is to focus
on enhancing luceneutil and our nightly benchmarks (
https://github.com/mikemccand/luceneutil) instead of putting a lot of
effort into the ann-benchmarks setup.

Julie

On Wed, May 26, 2021 at 1:04 PM Alex K  wrote:


Thanks Michael. IIRC, the thing that was taking so long was merging into a
single segment. Is there already benchmarking code for HNSW
available somewhere? I feel like I remember someone posting benchmarking
results on one of the Jira tickets.

Thanks,
Alex

On Wed, May 26, 2021 at 3:41 PM Michael Sokolov 
wrote:


This java implementation will be slower than the C implementation. I
believe the algorithm is essentially the same, however this is new and
there may be bugs!  I (and I think Julie had similar results IIRC)
measured something like 8x slower than hnswlib (using ann-benchmarks).
It is also surprising (to me) though how this varies with
differently-learned vectors so YMMV. I still think there is value
here, and look forward to improved performance, especially as JDK16
has some improved support for vectorized instructions.

Please also understand that the HNSW algorithm interacts with Lucene's
segmented architecture in a tricky way. Because we built a graph
*per-segment* when flushing/merging, these must be rebuilt whenever
segments are merged. So your indexing performance can be heavily
influenced by how often you flush, as well as by your merge policy
settings. Also, when searching, there is a bigger than usual benefit
for searching across fewer segments, since the cost of searching an
HNSW graph scales more or less with log N (so searching a single large
graph is cheaper than searching the same documents divided among
smaller graphs). So I do recommend using a multithreaded collector in
order to get best latency with HNSW-based search. To get the best
indexing, and searching, performance, you should generally index as
large a number of documents as possible before flushing.

-Mike

On Wed, May 26, 2021 at 9:43 AM Michael Wechner
 wrote:

Hi Alex

Thank you very much for your feedback and the various insights!

Am 26.05.21 um 04:41 schrieb Alex K:

Hi Michael and others,

Sorry just now getting back to you. For your three original

questions:

- Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
thorough response.
- As far as I know Opendistro is calling out to a C/C++ binary to run

the

actual HNSW algorithm and store the HNSW part of the index. When they
implemented it about a year ago, Lucene did not have this yet. I

assume the

Lucene HNSW implementation is solid, but would not be surprised if

it's

slower than the C/C++ based implementation, given the JVM has some
disadvantages for these kinds of CPU-bound/number crunching algos.
- I just haven't had much time to invest into my benchmark recently.

In

particular, I got stuck on why indexing was taking extremely long.

Just

indexing the vectors would have easily exceeded the current time
limitations in the ANN-benchmarks project. Maybe I had some naive

mistake

in my implementation, but I profiled and dug pretty deep to make it

fast.

I am trying to get Julie's branch running

https://github.com/jtibshirani/lucene/tree/hnsw-bench

Maybe this will help and is comparable



I'm assuming you want to use Lucene, but not necessarily via

Elasticsearch?

Yes, for more simple setups I would like to use Lucene standalone, but
for setups which have to scale I would use either Elasticsearch or

Solr.

Thanks

Michael




If so, another option you might try for ANN is the elastiknn-models
and elastiknn-lucene packages. elastiknn-models contains the Locality
Sensitive Hashing implementations of ANN used by Elastiknn, and
elastiknn-lucene contains the Lucene queries used by Elastiknn.The

Lucene

query is the MatchHashesAndScoreQuery
<