I'd also love to understand this:
> using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on
Windows for our index sizes which commonly run north of 1 TB)
Is this a known problem on certain versions of Windows? Normally memory
mapped IO can scale to very large sizes (well beyond
I agree it's worth discussing. I opened
https://github.com/apache/lucene/issues/12355 and
https://github.com/apache/lucene/issues/12356.
On Tue, Jun 6, 2023 at 9:17 PM Rahul Goswami wrote:
>
> Thanks Adrien. I spent some time trying to understand the readByte() in
> ReverseRandomAccessReader
Thanks Adrien. I spent some time trying to understand the readByte() in
ReverseRandomAccessReader (through FST) and compare with 7.x. Although I
don't understand ALL of the details and reasoning for always loading the
FST (and in turn the term index) off-heap (as discussed in
Yes, this changed in 8.x:
- 8.0 moved the terms index off-heap for non-PK fields with
MMapDirectory. https://github.com/apache/lucene/issues/9681
- Then in 8.6 the FST was moved off-heap all the time.
https://github.com/apache/lucene/issues/10297
More generally, there's a few files that are no
Thanks Adrien. Is this behavior of FST something that has changed in Lucene
8.x (from 7.x)?
Also, is the terms index not loaded into memory anymore in 8.x?
To your point on MMapDirectoryFactory, it is much faster as you
anticipated, but the indexes commonly being >1 TB makes the Windows machine
+Alan Woodward helped me better understand what is going on here.
BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory)
doesn't play well with the fact that the FST reads bytes backwards:
every call to readByte() triggers a refill of 1kB because it wants to
read the byte that is just
My best guess based on your description of the issue is that
SimpleFSDirectory doesn't like the fact that the terms index now reads
data directly from the directory instead of loading the terms index in
heap. Would you be able to run the same benchmark with MMapDirectory
to check if it addresses