Re: Best strategy migrate indexes

2022-11-07 Thread Michael Sokolov
The error you got

BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)

indicates that the index you are reading was written by Lucene 9, so
things are not set up the way you described (writing using Lucene 7)


> Thanks TX for your response.
>
> I would check that the Luke version matches the Lucene version - if
> > the two match, it shouldn't be possible to get issues like this.
> > That is, the precise versions of Lucene each is using.
>
>
> Yes, I am using https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0
>
> It works ok with my new generated indexes, but it does not with the
> "migrated" ones.
>
> El lun, 7 nov 2022 a las 12:18, Trejkaz () escribió:
>
> > The process itself sounds like it should work (it's basically a
> > reindex so it should be safer than trying to migrate directly.)
> >
> > I would check that the Luke version matches the Lucene version - if
> > the two match, it shouldn't be possible to get issues like this.
> > That is, the precise versions of Lucene each is using.
> >
> > TX
> >
> >
> > On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez 
> > wrote:
> > >
> > > Hi!
> > >
> > > > I am trying to create a tool to read docs from a lucene5 index and
> > > generate lucene9 documents from them (with docValues). That might work,
> > > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> > >
> > > I am doing the following steps:
> > >
> > > - create IndexReader with lucene5 package over a lucene5 index
> > > - create IndexWriter with lucene7 package
> > > - iterate over reader.numDocs() to process each Document (lucene5)
> > > - convert each Document (lucene5) to lucene7 Document
> > > - for each IndexableField (lucene5) from Document (lucene5)
> > convert
> > > it to create an IndexableField (lucene7)
> > > - create a SortedDocValuesField (lucene7) and add it to the
> > > Document (lucene7)
> > > - add the field to the Document (lucene7)
> > > - add each converted Document to the writer
> > > - close  IndexReader and IndexWriter
> > >
> > > When I open the resulting migrated lucene7 index with Luke I got an
> > error:
> > > org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> > > supported (resource
> > >
> > BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> > > 9 (needs to be between 6 and 7)
> > >
> > > When I use the tool "luceneupgrader
> > > ", I got:
> > > java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> > > tests_small_index-7.x-migrator
> > > Lucene index version: 7
> > >
> > > What am I doing wrong or misleading?
> > >
> > > Thanks!
> > >
> > > El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<
> > pabl...@gmail.com>)
> > > escribió:
> > >
> > > > Hi,
> > > >
> > > > Luckily we were already using lucenemigrator
> > > >
> > > >
> > > > What do you mean with "lucenemigrator"? Is it a public tool?
> > > >
> > > > I am trying to create a tool to read docs from a lucene5 index and
> > > > generate lucene9 documents from them (with docValues). That might work,
> > > > right? I am shading both lucene5 and lucene9 to avoid package
> > conflicts.
> > > >
> > > > Thanks!
> > > >
> > > > El mar, 1 nov 2022 a las 0:35, Trejkaz ()
> > escribió:
> > > >
> > > >> Well...
> > > >>
> > > >> There's a way, but I wouldn't necessarily recommend it.
> > > >>
> > > >> You can write custom migration code against some version of Lucene
> > > >> which supports doc values, to create doc values fields. It's going to
> > > >> involve writing a FilterCodecReader which wraps your real index and
> > > >> then pretends to also have doc values, which you'll build in a custom
> > > >> class which works similarly to UninvertingReader. Then you pass those
> > > >> CodecReaders to IndexWriter.addIndexes to create a new index which
> > > >> really has those doc values.
> > > >>
> > > >> We did that ourselves when we had the same issue. The only painful
> > > >> thing about it is having to keep around older versions of lucene to do
> > > >> that migration. Forever. Luckily we were already using lucenemigrator,
> > > >> which has the older versions baked into it with package prefixes. So
> > > >> that library will get fatter and fatter over time but at least our own
> > > >> code only gets fatter at the rate migrations are added.
> > > >>
> > > >> The same approach works for any other kind of ad-hoc migration you
> > > >> might want to perform. e.g., you might want to create points. Or
> > > >> remove an index for a field. Or add an index for a field.
> > > >>
> > > >> TX
> > > >>
> > > >>
> > > >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <
> > pabl...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > Hi all,
> > > >> >
> > > >> > Thank you all for your responses.
> > > >> >
> > > >> > So, when updating to a newer (major) Lucene version that 

RE: [EXT] Re: Efficient sort on SortedDocValues

2022-11-07 Thread Solodin, Andrei (TR Technology)
Ah, of course. Thanks Mikhail. I realized it was a silly question that only 
made sense to me since my query was MatchAll docs.


-Original Message-
From: Mikhail Khludnev  
Sent: Monday, November 7, 2022 2:44 AM
To: java-user@lucene.apache.org
Subject: [EXT] Re: Efficient sort on SortedDocValues

External Email: Use caution with links and attachments.

Hello, Andrei.
Docs are scored in-order (see Weight.scoreAll(), scoreRange()), just because 
underneath postings API is in-order. There are a few shortcuts/optimizations, 
but they only omit some iterations/segments like checking competitive scores 
and so one.

On Sun, Nov 6, 2022 at 1:35 AM Solodin, Andrei (TR Technology) 
 wrote:

> One more thing. While the test case passes now, it still iterates in 
> index order. Which means that it still collects ~6.4K docs out of 10k matches.
> This is an improvement, but I am still wondering why it's not possible 
> to iterate in the field older. Seems like that would provide 
> substantial improvement.
>
> From: Solodin, Andrei (TR Technology)
> Sent: Saturday, November 5, 2022 5:18 PM
> To: java-user@lucene.apache.org
> Subject: RE: Efficient sort on SortedDocValues
>
> I just realized that the problem is that the field needs to be indexed 
> as well. Now it works. But I noticed that this only works in Lucene 9. 
> Does not work in Lucene 8 (specifically 8.11.2). This must be new 
> functionality in Lucene 9?
>
> Thanks
>
>
> From: Solodin, Andrei (TR Technology)
> Sent: Saturday, November 5, 2022 1:07 PM
> To: java-user@lucene.apache.org
> Subject: Efficient sort on SortedDocValues
>
> Hello Lucene community, while looking into how to efficiently sort on 
> a field value, I came across a couple of things that I don't quite 
> understand. My assumption was that if I execute a search and sort on a 
> SortedDocValues field, lucene would only iterate over the docs in the 
> order of the field values or at least collect only competitive docs 
> (docs that made it into the topN queue). Neither of those things seems 
> to be happening. Instead, the iteration is happening in index order 
> and all matched docs are collected. Looking at the code, I see that 
> the optimizations are only possible if the index is sorted in the 
> field order to begin with, which is not possible for our use case. We 
> may have dozens of such fields in our index, thus there isn't any one 
> field that can be used to sort the index. So I guess my question if 
> what I am trying to achieve is possible? I tried to look though Solr 
> codebase, but so far couldn't come up with anything. Code example is here
> https://urldefense.com/v3/__https://pastebin.com/i05E2wZy__;!!GFN0sa3rsbfR8OLyAw!c95-xVshQzkFU6u9GzqxAvTcVnO4j12lbVHEgMELOpR9AM2IQAtbX5NMIKynd2k2IkdjRAhIWUx2uOnWEm-cjg$
> . I am using 9.4.1. Thanks in advance.
>
> Andrei
>
>

--
Sincerely yours
Mikhail Khludnev


Re: Best strategy migrate indexes

2022-11-07 Thread Pablo Vázquez Blázquez
Thanks TX for your response.

I would check that the Luke version matches the Lucene version - if
> the two match, it shouldn't be possible to get issues like this.
> That is, the precise versions of Lucene each is using.


Yes, I am using https://github.com/DmitryKey/luke/releases/tag/luke-7.1.0

It works ok with my new generated indexes, but it does not with the
"migrated" ones.

El lun, 7 nov 2022 a las 12:18, Trejkaz () escribió:

> The process itself sounds like it should work (it's basically a
> reindex so it should be safer than trying to migrate directly.)
>
> I would check that the Luke version matches the Lucene version - if
> the two match, it shouldn't be possible to get issues like this.
> That is, the precise versions of Lucene each is using.
>
> TX
>
>
> On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez 
> wrote:
> >
> > Hi!
> >
> > > I am trying to create a tool to read docs from a lucene5 index and
> > generate lucene9 documents from them (with docValues). That might work,
> > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> >
> > I am doing the following steps:
> >
> > - create IndexReader with lucene5 package over a lucene5 index
> > - create IndexWriter with lucene7 package
> > - iterate over reader.numDocs() to process each Document (lucene5)
> > - convert each Document (lucene5) to lucene7 Document
> > - for each IndexableField (lucene5) from Document (lucene5)
> convert
> > it to create an IndexableField (lucene7)
> > - create a SortedDocValuesField (lucene7) and add it to the
> > Document (lucene7)
> > - add the field to the Document (lucene7)
> > - add each converted Document to the writer
> > - close  IndexReader and IndexWriter
> >
> > When I open the resulting migrated lucene7 index with Luke I got an
> error:
> > org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> > supported (resource
> >
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> > 9 (needs to be between 6 and 7)
> >
> > When I use the tool "luceneupgrader
> > ", I got:
> > java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> > tests_small_index-7.x-migrator
> > Lucene index version: 7
> >
> > What am I doing wrong or misleading?
> >
> > Thanks!
> >
> > El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<
> pabl...@gmail.com>)
> > escribió:
> >
> > > Hi,
> > >
> > > Luckily we were already using lucenemigrator
> > >
> > >
> > > What do you mean with "lucenemigrator"? Is it a public tool?
> > >
> > > I am trying to create a tool to read docs from a lucene5 index and
> > > generate lucene9 documents from them (with docValues). That might work,
> > > right? I am shading both lucene5 and lucene9 to avoid package
> conflicts.
> > >
> > > Thanks!
> > >
> > > El mar, 1 nov 2022 a las 0:35, Trejkaz ()
> escribió:
> > >
> > >> Well...
> > >>
> > >> There's a way, but I wouldn't necessarily recommend it.
> > >>
> > >> You can write custom migration code against some version of Lucene
> > >> which supports doc values, to create doc values fields. It's going to
> > >> involve writing a FilterCodecReader which wraps your real index and
> > >> then pretends to also have doc values, which you'll build in a custom
> > >> class which works similarly to UninvertingReader. Then you pass those
> > >> CodecReaders to IndexWriter.addIndexes to create a new index which
> > >> really has those doc values.
> > >>
> > >> We did that ourselves when we had the same issue. The only painful
> > >> thing about it is having to keep around older versions of lucene to do
> > >> that migration. Forever. Luckily we were already using lucenemigrator,
> > >> which has the older versions baked into it with package prefixes. So
> > >> that library will get fatter and fatter over time but at least our own
> > >> code only gets fatter at the rate migrations are added.
> > >>
> > >> The same approach works for any other kind of ad-hoc migration you
> > >> might want to perform. e.g., you might want to create points. Or
> > >> remove an index for a field. Or add an index for a field.
> > >>
> > >> TX
> > >>
> > >>
> > >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <
> pabl...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi all,
> > >> >
> > >> > Thank you all for your responses.
> > >> >
> > >> > So, when updating to a newer (major) Lucene version that modifies
> its
> > >> > codecs, there is no way to ensure everything keeps working properly,
> > >> unless
> > >> > re-indexing, right?
> > >> >
> > >> > Apart from not having some original sources that were indexed
> (which I
> > >> will
> > >> > try to solve by using the *IndexUpgrader *tool), I have another
> > >> problem: I
> > >> > was using the org.apache.lucene.uninverting.UninvertingReader to
> perform
> > >> > queries against the index, mainly using the grouping api. But
> > >> currently, it
> > >> > was removed 

Re: Best strategy migrate indexes

2022-11-07 Thread Trejkaz
The process itself sounds like it should work (it's basically a
reindex so it should be safer than trying to migrate directly.)

I would check that the Luke version matches the Lucene version - if
the two match, it shouldn't be possible to get issues like this.
That is, the precise versions of Lucene each is using.

TX


On Mon, 7 Nov 2022 at 22:09, Pablo Vázquez Blázquez  wrote:
>
> Hi!
>
> > I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> I am doing the following steps:
>
> - create IndexReader with lucene5 package over a lucene5 index
> - create IndexWriter with lucene7 package
> - iterate over reader.numDocs() to process each Document (lucene5)
> - convert each Document (lucene5) to lucene7 Document
> - for each IndexableField (lucene5) from Document (lucene5) convert
> it to create an IndexableField (lucene7)
> - create a SortedDocValuesField (lucene7) and add it to the
> Document (lucene7)
> - add the field to the Document (lucene7)
> - add each converted Document to the writer
> - close  IndexReader and IndexWriter
>
> When I open the resulting migrated lucene7 index with Luke I got an error:
> org.apache.lucene.index.IndexFormatTooNewException: Format version is not
> supported (resource
> BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
> 9 (needs to be between 6 and 7)
>
> When I use the tool "luceneupgrader
> ", I got:
> java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
> tests_small_index-7.x-migrator
> Lucene index version: 7
>
> What am I doing wrong or misleading?
>
> Thanks!
>
> El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez ()
> escribió:
>
> > Hi,
> >
> > Luckily we were already using lucenemigrator
> >
> >
> > What do you mean with "lucenemigrator"? Is it a public tool?
> >
> > I am trying to create a tool to read docs from a lucene5 index and
> > generate lucene9 documents from them (with docValues). That might work,
> > right? I am shading both lucene5 and lucene9 to avoid package conflicts.
> >
> > Thanks!
> >
> > El mar, 1 nov 2022 a las 0:35, Trejkaz () escribió:
> >
> >> Well...
> >>
> >> There's a way, but I wouldn't necessarily recommend it.
> >>
> >> You can write custom migration code against some version of Lucene
> >> which supports doc values, to create doc values fields. It's going to
> >> involve writing a FilterCodecReader which wraps your real index and
> >> then pretends to also have doc values, which you'll build in a custom
> >> class which works similarly to UninvertingReader. Then you pass those
> >> CodecReaders to IndexWriter.addIndexes to create a new index which
> >> really has those doc values.
> >>
> >> We did that ourselves when we had the same issue. The only painful
> >> thing about it is having to keep around older versions of lucene to do
> >> that migration. Forever. Luckily we were already using lucenemigrator,
> >> which has the older versions baked into it with package prefixes. So
> >> that library will get fatter and fatter over time but at least our own
> >> code only gets fatter at the rate migrations are added.
> >>
> >> The same approach works for any other kind of ad-hoc migration you
> >> might want to perform. e.g., you might want to create points. Or
> >> remove an index for a field. Or add an index for a field.
> >>
> >> TX
> >>
> >>
> >> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez 
> >> wrote:
> >> >
> >> > Hi all,
> >> >
> >> > Thank you all for your responses.
> >> >
> >> > So, when updating to a newer (major) Lucene version that modifies its
> >> > codecs, there is no way to ensure everything keeps working properly,
> >> unless
> >> > re-indexing, right?
> >> >
> >> > Apart from not having some original sources that were indexed (which I
> >> will
> >> > try to solve by using the *IndexUpgrader *tool), I have another
> >> problem: I
> >> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
> >> > queries against the index, mainly using the grouping api. But
> >> currently, it
> >> > was removed (since Lucene 7.0). So, again, do I have any other
> >> alternative,
> >> > apart from re-indexing to use docValues?
> >> >
> >> > To give you more context, I am a developer of a tool that multiple
> >> > customers can use to index their data (currently, with Lucene 5.5.5). We
> >> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
> >> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
> >> down
> >> > well :(
> >> >
> >> > Regards,
> >> >
> >> > El sáb, 29 oct 2022 a las 23:31, Matt Davis ()
> >> > escribió:
> >> >
> >> > > Inside of Zulia search engine, the object being indexed is always a
> >> > > JSON/BSON object and we store the BSON as a stored byte field in the
> 

Re: Best strategy migrate indexes

2022-11-07 Thread Pablo Vázquez Blázquez
Hi!

> I am trying to create a tool to read docs from a lucene5 index and
generate lucene9 documents from them (with docValues). That might work,
right? I am shading both lucene5 and lucene9 to avoid package conflicts.

I am doing the following steps:

- create IndexReader with lucene5 package over a lucene5 index
- create IndexWriter with lucene7 package
- iterate over reader.numDocs() to process each Document (lucene5)
- convert each Document (lucene5) to lucene7 Document
- for each IndexableField (lucene5) from Document (lucene5) convert
it to create an IndexableField (lucene7)
- create a SortedDocValuesField (lucene7) and add it to the
Document (lucene7)
- add the field to the Document (lucene7)
- add each converted Document to the writer
- close  IndexReader and IndexWriter

When I open the resulting migrated lucene7 index with Luke I got an error:
org.apache.lucene.index.IndexFormatTooNewException: Format version is not
supported (resource
BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)

When I use the tool "luceneupgrader
", I got:
java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
tests_small_index-7.x-migrator
Lucene index version: 7

What am I doing wrong or misleading?

Thanks!

El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez ()
escribió:

> Hi,
>
> Luckily we were already using lucenemigrator
>
>
> What do you mean with "lucenemigrator"? Is it a public tool?
>
> I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> Thanks!
>
> El mar, 1 nov 2022 a las 0:35, Trejkaz () escribió:
>
>> Well...
>>
>> There's a way, but I wouldn't necessarily recommend it.
>>
>> You can write custom migration code against some version of Lucene
>> which supports doc values, to create doc values fields. It's going to
>> involve writing a FilterCodecReader which wraps your real index and
>> then pretends to also have doc values, which you'll build in a custom
>> class which works similarly to UninvertingReader. Then you pass those
>> CodecReaders to IndexWriter.addIndexes to create a new index which
>> really has those doc values.
>>
>> We did that ourselves when we had the same issue. The only painful
>> thing about it is having to keep around older versions of lucene to do
>> that migration. Forever. Luckily we were already using lucenemigrator,
>> which has the older versions baked into it with package prefixes. So
>> that library will get fatter and fatter over time but at least our own
>> code only gets fatter at the rate migrations are added.
>>
>> The same approach works for any other kind of ad-hoc migration you
>> might want to perform. e.g., you might want to create points. Or
>> remove an index for a field. Or add an index for a field.
>>
>> TX
>>
>>
>> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez 
>> wrote:
>> >
>> > Hi all,
>> >
>> > Thank you all for your responses.
>> >
>> > So, when updating to a newer (major) Lucene version that modifies its
>> > codecs, there is no way to ensure everything keeps working properly,
>> unless
>> > re-indexing, right?
>> >
>> > Apart from not having some original sources that were indexed (which I
>> will
>> > try to solve by using the *IndexUpgrader *tool), I have another
>> problem: I
>> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
>> > queries against the index, mainly using the grouping api. But
>> currently, it
>> > was removed (since Lucene 7.0). So, again, do I have any other
>> alternative,
>> > apart from re-indexing to use docValues?
>> >
>> > To give you more context, I am a developer of a tool that multiple
>> > customers can use to index their data (currently, with Lucene 5.5.5). We
>> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
>> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
>> down
>> > well :(
>> >
>> > Regards,
>> >
>> > El sáb, 29 oct 2022 a las 23:31, Matt Davis ()
>> > escribió:
>> >
>> > > Inside of Zulia search engine, the object being indexed is always a
>> > > JSON/BSON object and we store the BSON as a stored byte field in the
>> > > index.  This allows easy internal reindexing when the searchable
>> fields
>> > > change but also allows us to update to the latest lucene version.
>> > >  Combined with using lucene-backward-codecs an older index than the
>> current
>> > > major version can be opened and reindexed.  If you have stored all the
>> > > fields (or a json/bson) in the index, it would be easy to reindex in
>> the
>> > > new format.  If you have not, maybe opening with
>> lucene-backward-codecs
>> > > will be enough for your use case.
>> > >
>> > > Thanks,
>> > > Matt
>> > >
>> > > On Sat, Oct 29, 2022 at 2:30 PM Baris 

Re: Learning Lucene from ground up

2022-11-07 Thread Adrien Grand
+1 to MyCoy's suggestion.

To answer your most immediate questions:
 - Lucene mostly loads metadata in memory at the time of opening a segment
(dvm, tmd, fdm, vem, nvm, kdm files), other files are memory-mapped and
Lucene relies on the filesystem cache to have their data efficiently
available. This allows Lucene to have a very small memory footprint for
searching.
 - Finite state machines are mostly used for suggesters and for the terms
index (tip file), which essentially stores all prefixes that are shared by
25-40 terms in a FST.

On Sun, Nov 6, 2022 at 2:12 AM MyCoy Z  wrote:

> I just started learning Lucene HNSW source code last months.
>
> I find the most effective way is to start with the testcases, set debugging
> break points in the code you're interested in, and walk through the code
>
> Regards
> MyCoy
>
> On Fri, Nov 4, 2022 at 9:24 PM Rahul Goswami 
> wrote:
>
> > Hello,
> > I have been working with Lucene and Solr for quite some time and have a
> > good understanding of a lot of moving parts at the code level. However I
> > wish to learn Lucene  internals from the ground up and want to
> familiarize
> > myself with all the dirty details. I would like to know what would be the
> > best way to go about it.
> >
> > To kick things off, I have been thinking about picking up “Lucene in
> > Action”, but have been hesitant (and possibly wrongly) since it is based
> on
> > Lucene 3.0 and we have come a long way since then. To give an example of
> > the level of detail I wish to learn (among other things) would be what
> > parts of a segment (.tim, .tip, etc) get loaded in memory at search time,
> > which part uses finite state machines and why, etc
> >
> > I would really appreciate any thoughts/inputs on how I can go about this.
> > Thanks in advance!
> >
> > Regards,
> > Rahul
> >
>


-- 
Adrien


Re: Efficient sort on SortedDocValues

2022-11-07 Thread Adrien Grand
Hi Andrei,

The case that you are describing got optimized in Lucene 9.4.0 in the case
when your field is also indexed with a StringField:
https://github.com/apache/lucene/pull/1023. See annotation ER at
http://people.apache.org/~mikemccand/lucenebench/TermMonthSort.html.

The way it works is that Lucene will automatically leverage the inverted
index in order to only look at documents that compare better than the
current k-th document in the priority queue.

To make it work with your test case, you will need to:
 - index a StringField with the same name and same value
 - change values to be less random if possible, since this optimization
works better on low-cardinality fields than on high-cardinality fields





On Mon, Nov 7, 2022 at 9:45 AM Mikhail Khludnev  wrote:

> Hello, Andrei.
> Docs are scored in-order (see Weight.scoreAll(), scoreRange()), just
> because underneath postings API is in-order. There are a few
> shortcuts/optimizations, but they only omit some iterations/segments like
> checking competitive scores and so one.
>
> On Sun, Nov 6, 2022 at 1:35 AM Solodin, Andrei (TR Technology)
>  wrote:
>
> > One more thing. While the test case passes now, it still iterates in
> index
> > order. Which means that it still collects ~6.4K docs out of 10k matches.
> > This is an improvement, but I am still wondering why it's not possible to
> > iterate in the field older. Seems like that would provide substantial
> > improvement.
> >
> > From: Solodin, Andrei (TR Technology)
> > Sent: Saturday, November 5, 2022 5:18 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: Efficient sort on SortedDocValues
> >
> > I just realized that the problem is that the field needs to be indexed as
> > well. Now it works. But I noticed that this only works in Lucene 9. Does
> > not work in Lucene 8 (specifically 8.11.2). This must be new
> functionality
> > in Lucene 9?
> >
> > Thanks
> >
> >
> > From: Solodin, Andrei (TR Technology)
> > Sent: Saturday, November 5, 2022 1:07 PM
> > To: java-user@lucene.apache.org
> > Subject: Efficient sort on SortedDocValues
> >
> > Hello Lucene community, while looking into how to efficiently sort on a
> > field value, I came across a couple of things that I don't quite
> > understand. My assumption was that if I execute a search and sort on a
> > SortedDocValues field, lucene would only iterate over the docs in the
> order
> > of the field values or at least collect only competitive docs (docs that
> > made it into the topN queue). Neither of those things seems to be
> > happening. Instead, the iteration is happening in index order and all
> > matched docs are collected. Looking at the code, I see that the
> > optimizations are only possible if the index is sorted in the field order
> > to begin with, which is not possible for our use case. We may have dozens
> > of such fields in our index, thus there isn't any one field that can be
> > used to sort the index. So I guess my question if what I am trying to
> > achieve is possible? I tried to look though Solr codebase, but so far
> > couldn't come up with anything. Code example is here
> > https://pastebin.com/i05E2wZy  . I am using 9.4.1. Thanks in advance.
> >
> > Andrei
> >
> >
>
> --
> Sincerely yours
> Mikhail Khludnev
>


-- 
Adrien


Re: Efficient sort on SortedDocValues

2022-11-07 Thread Mikhail Khludnev
Hello, Andrei.
Docs are scored in-order (see Weight.scoreAll(), scoreRange()), just
because underneath postings API is in-order. There are a few
shortcuts/optimizations, but they only omit some iterations/segments like
checking competitive scores and so one.

On Sun, Nov 6, 2022 at 1:35 AM Solodin, Andrei (TR Technology)
 wrote:

> One more thing. While the test case passes now, it still iterates in index
> order. Which means that it still collects ~6.4K docs out of 10k matches.
> This is an improvement, but I am still wondering why it's not possible to
> iterate in the field older. Seems like that would provide substantial
> improvement.
>
> From: Solodin, Andrei (TR Technology)
> Sent: Saturday, November 5, 2022 5:18 PM
> To: java-user@lucene.apache.org
> Subject: RE: Efficient sort on SortedDocValues
>
> I just realized that the problem is that the field needs to be indexed as
> well. Now it works. But I noticed that this only works in Lucene 9. Does
> not work in Lucene 8 (specifically 8.11.2). This must be new functionality
> in Lucene 9?
>
> Thanks
>
>
> From: Solodin, Andrei (TR Technology)
> Sent: Saturday, November 5, 2022 1:07 PM
> To: java-user@lucene.apache.org
> Subject: Efficient sort on SortedDocValues
>
> Hello Lucene community, while looking into how to efficiently sort on a
> field value, I came across a couple of things that I don't quite
> understand. My assumption was that if I execute a search and sort on a
> SortedDocValues field, lucene would only iterate over the docs in the order
> of the field values or at least collect only competitive docs (docs that
> made it into the topN queue). Neither of those things seems to be
> happening. Instead, the iteration is happening in index order and all
> matched docs are collected. Looking at the code, I see that the
> optimizations are only possible if the index is sorted in the field order
> to begin with, which is not possible for our use case. We may have dozens
> of such fields in our index, thus there isn't any one field that can be
> used to sort the index. So I guess my question if what I am trying to
> achieve is possible? I tried to look though Solr codebase, but so far
> couldn't come up with anything. Code example is here
> https://pastebin.com/i05E2wZy  . I am using 9.4.1. Thanks in advance.
>
> Andrei
>
>

-- 
Sincerely yours
Mikhail Khludnev