Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-09 Thread Michael McCandless
I'd also love to understand this: > using SimpleFSDirectoryFactory (since Mmap doesn't quite work well on Windows for our index sizes which commonly run north of 1 TB) Is this a known problem on certain versions of Windows? Normally memory mapped IO can scale to very large sizes (well beyond

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-07 Thread Adrien Grand
I agree it's worth discussing. I opened https://github.com/apache/lucene/issues/12355 and https://github.com/apache/lucene/issues/12356. On Tue, Jun 6, 2023 at 9:17 PM Rahul Goswami wrote: > > Thanks Adrien. I spent some time trying to understand the readByte() in > ReverseRandomAccessReader

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Rahul Goswami
Thanks Adrien. I spent some time trying to understand the readByte() in ReverseRandomAccessReader (through FST) and compare with 7.x. Although I don't understand ALL of the details and reasoning for always loading the FST (and in turn the term index) off-heap (as discussed in

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
Yes, this changed in 8.x: - 8.0 moved the terms index off-heap for non-PK fields with MMapDirectory. https://github.com/apache/lucene/issues/9681 - Then in 8.6 the FST was moved off-heap all the time. https://github.com/apache/lucene/issues/10297 More generally, there's a few files that are no

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Rahul Goswami
Thanks Adrien. Is this behavior of FST something that has changed in Lucene 8.x (from 7.x)? Also, is the terms index not loaded into memory anymore in 8.x? To your point on MMapDirectoryFactory, it is much faster as you anticipated, but the indexes commonly being >1 TB makes the Windows machine

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
+Alan Woodward helped me better understand what is going on here. BufferedIndexInput (used by NIOFSDirectory and SimpleFSDirectory) doesn't play well with the fact that the FST reads bytes backwards: every call to readByte() triggers a refill of 1kB because it wants to read the byte that is just

Re: Performance regression in getting doc by id in Lucene 8 vs Lucene 7

2023-06-06 Thread Adrien Grand
My best guess based on your description of the issue is that SimpleFSDirectory doesn't like the fact that the terms index now reads data directly from the directory instead of loading the terms index in heap. Would you be able to run the same benchmark with MMapDirectory to check if it addresses