[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501358#comment-17501358 ] Greg Miller commented on LUCENE-10302: -- As I worked on this a little more, it occurs to me that I think we might want to implement two separate abstractions here. It feels a bit like overloading the idea of a PQ builder to provide sorted results. I started sketching out the idea of a "top-N collector" (naming isn't great...) that abstracts the idea of just "collecting" the top-n provided elements, either sorted or unsorted. It encapsulates details about whether-or-not it ever uses a heap, and just exposes methods to get the top-n, either sorted or unsorted. I think it's separately still very useful to have an explicit PQ builder for cases where the user knows they need a PQ but can build it in bulk up-front. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > Attachments: LUCENE_PriorityQueue_Builder_with_heapify.patch > > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501354#comment-17501354 ] Greg Miller commented on LUCENE-10302: -- [~vigyas] ah, it looks like the URL pulled in a trailing dot ('.'). If you remove that, it should work. Here it is properly: https://github.com/gsmiller/lucene/tree/LUCENE-10302-pq-builder-sketch > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > Attachments: LUCENE_PriorityQueue_Builder_with_heapify.patch > > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501101#comment-17501101 ] Vigya Sharma commented on LUCENE-10302: --- [~gsmiller] Is the link you pasted working? I get a 404 when I try to open it. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > Attachments: LUCENE_PriorityQueue_Builder_with_heapify.patch > > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500898#comment-17500898 ] Greg Miller commented on LUCENE-10302: -- Thanks [~dsmiley]. I'd sketched something out as well and have it sitting over here on a branch: [https://github.com/gsmiller/lucene/tree/LUCENE-10302-pq-builder-sketch.] I'll have a look at your patch file and see where ideas are similar/different. [~vigyas] sounds good! > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > Attachments: LUCENE_PriorityQueue_Builder_with_heapify.patch > > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500803#comment-17500803 ] David Smiley commented on LUCENE-10302: --- I attached my WIP as a patch file. Looking back, I started with defining the Builder code. I didn't yet implement "heapify". Nobody calls any of it. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > Attachments: LUCENE_PriorityQueue_Builder_with_heapify.patch > > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500565#comment-17500565 ] Vigya Sharma commented on LUCENE-10302: --- +1. I'd also like to look at it and maybe pick up a piece. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17500239#comment-17500239 ] Greg Miller commented on LUCENE-10302: -- {quote}I have the start of some code in progress I can share to anyone interested. I'm not sure when I'll get back to this. {quote} [~dsmiley] I'd be interested in exploring this a little bit as I have some spare time. I'm not going to assign it to myself since I'm not sure how much time I'll have for it, but I'd be curious to take a look at the progress you've made if you don't mind posting. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456448#comment-17456448 ] David Smiley commented on LUCENE-10302: --- This builder could provide a buildList method in addition to buildQueue since some consumers don't really need a PriorityQueue specifically -- PQ is an implementation detail to get a top-N, typically want sorted top-N, and this list can be sorted quickly & easily (be it the unsorted input array or the heap array). I'm not sure if it's actually faster to pop() the heap array one-by-one; I suspect not. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456443#comment-17456443 ] David Smiley commented on LUCENE-10302: --- Fun fact: there are 56 usages of Lucene's PriorityQueue in Lucene. I counted via adding the find-usages of both constructors (55 + 1). > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10302) PriorityQueue: optimize where we collect then iterate by using O(N) heapify
[ https://issues.apache.org/jira/browse/LUCENE-10302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456441#comment-17456441 ] David Smiley commented on LUCENE-10302: --- As an aside; it's weird/annoying to have to subclass this PriorityQueue just to supply the comparator. Can't we do a Comparator? I have the start of some code in progress I can share to anyone interested. I'm not sure when I'll get back to this. > PriorityQueue: optimize where we collect then iterate by using O(N) heapify > --- > > Key: LUCENE-10302 > URL: https://issues.apache.org/jira/browse/LUCENE-10302 > Project: Lucene - Core > Issue Type: Improvement >Reporter: David Smiley >Priority: Major > > Looking at LUCENE-8875 (LargeNumHitsTopDocsCollector.java ) I got to > wondering if there was faster-than O(N*log(N)) way of loading a PriorityQueue > when we provide a bulk array to initialize the heap/PriorityQueue. It turns > out there is: the JDK's PriorityQueue supports this in its constructors, > referring to "This classic algorithm due to Floyd (1964) is known to be > O(size)" -- heapify() method. There's > [another|https://www.geeksforgeeks.org/building-heap-from-array/] that may > or may not be the same; I didn't look too closely yet. I see a number of > uses of Lucene's PriorityQueue that first collects values and only after > collecting want to do something with the results (typical / unsurprising). > This lends itself to a builder pattern that can look similar to > LargeNumHitsTopDocsCollector in terms of first having an array used like a > list and then move over to the PriorityQueue if/when it gets full (it may > not). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org