Re: Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)
Thought I would try some thread necromancy here, because nobody replied about this a year ago. Now we're on 5.4.1 and the numbers changed a bit again. Recording best times for each operation. Indexing: 5.723 s SpanQuery: 25.13 s MultiPhraseQuery: (waited 10 minutes and it hasn't completed yet) TermAutomatonQuery: 19.72 s So it seems like span query performance is slightly better than it was in 5.2, but MultiPhraseQuery is still no good, and TermAutomatonQuery might be better than both. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)
I spent some time carving out a quick test of the bits that matter and put them up here: https://gist.github.com/trejkaz/a72b87277b1aec800c2e The tests index 1,000,000 docs with just one instance of the field/sub-field trick we're using, plus one unique value. So it's a bit of an artificial test, but benchmarks tend to be like that. Times for Lucene 3.6: Indexing: 3.365 s SpanQuery: 20.48 s MultiPhraseQuery: 9.641 s Times for Lucene 5.2: Indexing: 4.423 s SpanQuery: 31.94 s MultiPhraseQuery: (never completes due to OOME) An aside which is totally a red herring: it seems there is quite a bit of slowdown on indexing and SpanQuery as well, which makes me wonder whether I have incorrectly configured the FieldType when compared with how the same field was indexed for 3.6. You can also see from these numbers how MultiPhraseQuery used to be much faster than SpanQuery, which was why we stopped using SpanQuery for this particular query in the first place. Timings aside, MultiPhraseQuery used to complete but now gets an OOME when provided 2GB of RAM for this particular case. I also tried hacking together a TermAutomatonQuery to see what happened with that, and it gets an OOME as well. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)
There is a MultiPhraseQuery we use which looks a bit like: MultiPhraseQuery query = new MultiPhraseQuery(); query.add(new Term[] { first }); query.add(new Term[] { second1, second2, ... }); The actual number of terms in this particular case is 207087. The size of the index itself is 21GB or so, with around 1,300,000 docs. Large but not gigantic. I ran the test with 2GB of RAM which was certainly enough for Lucene 3. Although I do think that this is abusing MultiPhraseQuery and that SpanQuery is probably a better fit, I think that back in Lucene 3, there were problems with SpanQuery performance which resulted in switching to this as a performance hack. Anyway, we now get an OOME when running this query and the heap histogram comes out sort of like this: int[] 995,093 (5.2%) 617,539,592 (31.6%) byte[] 1,065,597 (5.6%) 434,990,616 (22.3%) DocIdSet[] 777,620 (4.1%) 149,303,040 (7.6%) Lucene50PostingsReader$BlockPostingsEnum 326,022 (1.7%) 67,486,554 (3.5%) Lucene50PostingsFormat$IntBlockTermState 621,265 (3.2%) 57,777,645 (3%) I went looking for the owner of these int arrays and it turns out to be a postings reader which is ultimately (unsurprisingly) being held by the MultiPhraseQuery. What I'm wondering is: - Why the increase in memory cost? - Is our performance hack of using MultiPhraseQuery over SpanQuery really warranted anymore? - Is there a better way to do this particular query? Also, just in case this is an X-Y problem, what we're actually implementing here is simulating a large number of integer fields without using a large number of fields. We index the name of the sub-field followed by the value and then use this as a proximity query to say find values in range X to Y with the sub-field immediately in front. This was done because there was some conventional wisdom saying that having a large number of fields in Lucene is problematic, although whether this still applies is unknown. TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org