Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
> > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: > > > Hello, > > > > > > I recently got a trouble on development of my extension that > > > utilizes the shared buffer when it released each buffer page. > > > > > > This extension transfers contents of the shared buffers to GPU > > > device using DMA feature, then kicks a device kernel code. > > > > Wow, that sounds crazy. > > > > > Once backend/extension calls ReadBuffer(), resowner.c tracks which > > > buffer was referenced by the current resource owner, to ensure these > > > buffers being released at end of the transaction. > > > However, it seems to me implementation of resowner.c didn't assume > > > many buffers are referenced by a particular resource owner > simultaneously. > > > It manages the buffer index using an expandable array, then looks up > > > the target buffer by sequential walk but from the tail because > > > recently pinned buffer tends to be released first. > > > It made a trouble in my case. My extension pinned multiple thousands > > > buffers, so owner->buffers[] were enlarged and takes expensive cost > > > to walk on. > > > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in > > > total during hash-joining 2M rows; even though hash-joining itself > > > takes less than 16 seconds. > > > > > > What is the best way to solve the problem? > > > > How about creating a separate ResourceOwner for these buffer pins, and > > doing a wholesale ResourceOwnerRelease() on it when you're done? > > > Let me clarify your idea. > > 1. Create a separate ResourceOwner under the CurrentResourceOwner. > 2. Switch CurrentResourceOwner to the self-constructed resource owner >during ReadBuffer() on thousands buffers. > 3. Switch back to the original CurrentResourceOwner. > 4. Kick DMA and run GPU device kernel (actual job running...) 5. Switch > CurrentResourceOwner to the self-constructed resource owner >again, during ReleaseBuffer() in reverse order. > 6. Switch back to the original CurrentResourceOwner, then calls >ResourceOwnerDelete(). > > Hmm... at this moment, I cannot find something not easy to implement. > I'd like to try this idea, then report it later. > Let me share the result. The above approach could eliminated my problem. Individual resource-owner for each set of shared-buffer and ReleaseBuffer() in reverse order reduced time to release pinned buffers from 36sec->67.3msec when I run hash-joining on 20M records; that involves 59168 buffers are pinned concurrently in maximum (32 asynchronous requests, a request packs 1849 buffers). Thanks for the suggestion! postgres=# explain analyze select * from t0 natural join t1 natural join t2; INFO: time to release buffers: 67.289ms QUERY PLAN --- Custom (GpuHashJoin) (cost=3468.00..471640.11 rows=19740924 width=139) (actual time=193.086..5913.459 rows=2000 loops=1) pseudo scan tlist: 1:(t0.bid), 2:(t0.aid), 3:, 4:, 5:, 6:, 7:, 8:, 9:, 11:, 12:[t2.bid], 10:, 13:[t1.aid] hash clause 1: (t0.bid = t2.bid) hash clause 2: (t0.aid = t1.aid) -> Custom (GpuScan) on t0 (cost=500.00..467167.24 rows=2024 width=73) (actual time=7.757..1056.426 rows=2000 loops=1) -> Custom (MultiHash) (cost=734.00..734.00 rows=4 width=37) (actual time=23.382..23.382 rows=4 loops=1) hash keys: bid -> Seq Scan on t2 (cost=0.00..734.00 rows=4 width=37) (actual time=0.007..5.124 rows=4 loops=1) -> Custom (MultiHash) (cost=734.00..734.00 rows=4 width=37) (actual time=11.919..11.919 rows=4 loops=1) hash keys: aid -> Seq Scan on t1 (cost=0.00..734.00 rows=4 width=37) (actual time=0.010..5.299 rows=4 loops=1) Planning time: 0.904 ms Execution time: 6667.986 ms (13 rows) -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
> Heikki Linnakangas writes: > > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: > >> What is the best way to solve the problem? > > > How about creating a separate ResourceOwner for these buffer pins, and > > doing a wholesale ResourceOwnerRelease() on it when you're done? > > That's a thought. Another point is that if you can release the buffer pins > in reverse order of acquisition, the existing ResourceOwner code works just > fine. > Yep, I'm now trying. > I have a larger question though: how is it useful to transfer raw contents > of shared buffers to a GPU in the first place? > Surely you're not going to be putting tasks like tuple visibility > verification into the GPU. So it seems like this whole thread is based > on a dubious architectural assumption. > In my implementation, it is a job of GPU to check tuple visibility, then it also send GPU an array of visible item index, in addition to the buffer contents itself. GPU code can pick up only visible tuples using this index. OLAP workloads tends to have all-visible pages, so cost of visibility check was not expensive. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
> On 03/10/2014 05:53, Kouhei Kaigai wrote: > > Yep, that's my pain. Even though usual query does not take many > > buffers pinned, my use case needs to fetch megabytes scale data at > > once because of performance reason; page-by-page synchronous scan makes > GPU being idle. > Doesn't your GPU have an async queue and exec mechanism? Then you could > do an asyn DMA to the GPU with an event, use that event in he GPU to start > the kernel and in the DB to release the pin? > That is exactly what I'm doing now. Problem is, it needs to keep multiple buffers (likely, more than a few dozen thousands) in the queue not to make GPU waiting for the data to be calculated. Thus, I want to pin relatively many buffers than usual query workloads. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
Heikki Linnakangas writes: > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: >> What is the best way to solve the problem? > How about creating a separate ResourceOwner for these buffer pins, and > doing a wholesale ResourceOwnerRelease() on it when you're done? That's a thought. Another point is that if you can release the buffer pins in reverse order of acquisition, the existing ResourceOwner code works just fine. I have a larger question though: how is it useful to transfer raw contents of shared buffers to a GPU in the first place? Surely you're not going to be putting tasks like tuple visibility verification into the GPU. So it seems like this whole thread is based on a dubious architectural assumption. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
On 03/10/2014 05:53, Kouhei Kaigai wrote: > Yep, that's my pain. Even though usual query does not take many buffers > pinned, > my use case needs to fetch megabytes scale data at once because of performance > reason; page-by-page synchronous scan makes GPU being idle. Doesn't your GPU have an async queue and exec mechanism? Then you could do an asyn DMA to the GPU with an event, use that event in he GPU to start the kernel and in the DB to release the pin? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
> On 2014-10-03 10:35:42 +0300, Heikki Linnakangas wrote: > > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: > > > Hello, > > > > > > I recently got a trouble on development of my extension that > > > utilizes the shared buffer when it released each buffer page. > > > > > > This extension transfers contents of the shared buffers to GPU > > > device using DMA feature, then kicks a device kernel code. > > > > Wow, that sounds crazy. > > Agreed. I doubt that pinning that many buffers is a sane thing to do. At > the very least you'll heavily interfere with vacuum and such. > My assumption is, this extension is used to handle OLAP type workload, thus relatively less amount of write traffic to the database. Sorry, I missed to mention about. > > > Once backend/extension calls ReadBuffer(), resowner.c tracks which > > > buffer was referenced by the current resource owner, to ensure these > > > buffers being released at end of the transaction. > > > However, it seems to me implementation of resowner.c didn't assume > > > many buffers are referenced by a particular resource owner > simultaneously. > > > It manages the buffer index using an expandable array, then looks up > > > the target buffer by sequential walk but from the tail because > > > recently pinned buffer tends to be released first. > > > It made a trouble in my case. My extension pinned multiple thousands > > > buffers, so owner->buffers[] were enlarged and takes expensive cost > > > to walk on. > > > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in > > > total during hash-joining 2M rows; even though hash-joining itself > > > takes less than 16 seconds. > > > > What is the best way to solve the problem? > > > > How about creating a separate ResourceOwner for these buffer pins, and > > doing a wholesale ResourceOwnerRelease() on it when you're done? > > Or even just unpinning them in reverse order? That should already fix the > performance issues? > In case when multiple chunks (note: a chunk contains thousands buffers as a unit of device kernel execution) are running asynchronously, order of GPU job's completion is not predictable. So, it does not help my situation if one resource-owner tracks all the buffers. Probably, Heikki suggested to create a separate resource-owner per chunk. In this case, all the buffers in a particular chunk shall be released on the same time, so ReleaseBuffer() in reverse order makes sense. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
> On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: > > Hello, > > > > I recently got a trouble on development of my extension that utilizes > > the shared buffer when it released each buffer page. > > > > This extension transfers contents of the shared buffers to GPU device > > using DMA feature, then kicks a device kernel code. > > Wow, that sounds crazy. > > > Once backend/extension calls ReadBuffer(), resowner.c tracks which > > buffer was referenced by the current resource owner, to ensure these > > buffers being released at end of the transaction. > > However, it seems to me implementation of resowner.c didn't assume > > many buffers are referenced by a particular resource owner simultaneously. > > It manages the buffer index using an expandable array, then looks up > > the target buffer by sequential walk but from the tail because > > recently pinned buffer tends to be released first. > > It made a trouble in my case. My extension pinned multiple thousands > > buffers, so owner->buffers[] were enlarged and takes expensive cost to > > walk on. > > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in > > total during hash-joining 2M rows; even though hash-joining itself > > takes less than 16 seconds. > > > > What is the best way to solve the problem? > > How about creating a separate ResourceOwner for these buffer pins, and doing > a wholesale ResourceOwnerRelease() on it when you're done? > Let me clarify your idea. 1. Create a separate ResourceOwner under the CurrentResourceOwner. 2. Switch CurrentResourceOwner to the self-constructed resource owner during ReadBuffer() on thousands buffers. 3. Switch back to the original CurrentResourceOwner. 4. Kick DMA and run GPU device kernel (actual job running...) 5. Switch CurrentResourceOwner to the self-constructed resource owner again, during ReleaseBuffer() in reverse order. 6. Switch back to the original CurrentResourceOwner, then calls ResourceOwnerDelete(). Hmm... at this moment, I cannot find something not easy to implement. I'd like to try this idea, then report it later. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
On 2014-10-03 10:35:42 +0300, Heikki Linnakangas wrote: > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: > > Hello, > > > > I recently got a trouble on development of my extension that utilizes > > the shared buffer when it released each buffer page. > > > > This extension transfers contents of the shared buffers to GPU device > > using DMA feature, then kicks a device kernel code. > > Wow, that sounds crazy. Agreed. I doubt that pinning that many buffers is a sane thing to do. At the very least you'll heavily interfere with vacuum and such. > > Once backend/extension calls ReadBuffer(), resowner.c tracks which > > buffer was referenced by the current resource owner, to ensure these > > buffers being released at end of the transaction. > > However, it seems to me implementation of resowner.c didn't assume > > many buffers are referenced by a particular resource owner simultaneously. > > It manages the buffer index using an expandable array, then looks up > > the target buffer by sequential walk but from the tail because recently > > pinned buffer tends to be released first. > > It made a trouble in my case. My extension pinned multiple thousands > > buffers, so owner->buffers[] were enlarged and takes expensive cost > > to walk on. > > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in > > total during hash-joining 2M rows; even though hash-joining itself > > takes less than 16 seconds. > > What is the best way to solve the problem? > > How about creating a separate ResourceOwner for these buffer pins, and > doing a wholesale ResourceOwnerRelease() on it when you're done? Or even just unpinning them in reverse order? That should already fix the performance issues? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
On 10/03/2014 07:08 AM, Kouhei Kaigai wrote: > Hello, > > I recently got a trouble on development of my extension that utilizes > the shared buffer when it released each buffer page. > > This extension transfers contents of the shared buffers to GPU device > using DMA feature, then kicks a device kernel code. Wow, that sounds crazy. > Once backend/extension calls ReadBuffer(), resowner.c tracks which > buffer was referenced by the current resource owner, to ensure these > buffers being released at end of the transaction. > However, it seems to me implementation of resowner.c didn't assume > many buffers are referenced by a particular resource owner simultaneously. > It manages the buffer index using an expandable array, then looks up > the target buffer by sequential walk but from the tail because recently > pinned buffer tends to be released first. > It made a trouble in my case. My extension pinned multiple thousands > buffers, so owner->buffers[] were enlarged and takes expensive cost > to walk on. > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in > total during hash-joining 2M rows; even though hash-joining itself > takes less than 16 seconds. > > What is the best way to solve the problem? How about creating a separate ResourceOwner for these buffer pins, and doing a wholesale ResourceOwnerRelease() on it when you're done? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
> Kouhei Kaigai writes: > > Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2). > > The source of problem come from data structure in ResourceOwnerData, > > so a straightforward way is to apply O(1) logic based on hashing, > > instead of the linear search. > > I will bet that this is a dead loss for all normal usage patterns, because > queries seldom have more than a few buffers pinned. More than that: I do > not think we should encourage coding patterns that pin lots of buffers. > There is no way that holding pins on thousands of buffers to do one operation > is a sane design. > Yep, that's my pain. Even though usual query does not take many buffers pinned, my use case needs to fetch megabytes scale data at once because of performance reason; page-by-page synchronous scan makes GPU being idle. It is also a reason why I cared about whether the O(1) logic is harmless to existing (major portion of) workloads. BTW, what is the reason why we assume small amount of pages are pined at same time? Does it come from hardware assumption that has much less RAM capacity than the data-set to be processed? Probably, once we pay attention a hardware that has more than several dozen GB RAM which can keep the data-set on memory, it may lead different results. At least, it seems to me valuable to support both of the styles of buffer usages, as long as it is harmless to the existing workloads. I'd like to dig down the deeper reason of this. Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
Kouhei Kaigai writes: > Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2). > The source of problem come from data structure in ResourceOwnerData, > so a straightforward way is to apply O(1) logic based on hashing, > instead of the linear search. I will bet that this is a dead loss for all normal usage patterns, because queries seldom have more than a few buffers pinned. More than that: I do not think we should encourage coding patterns that pin lots of buffers. There is no way that holding pins on thousands of buffers to do one operation is a sane design. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale
Hello, I recently got a trouble on development of my extension that utilizes the shared buffer when it released each buffer page. This extension transfers contents of the shared buffers to GPU device using DMA feature, then kicks a device kernel code. Usually 8KB (= BLCKSZ) is too small as a unit size for calculation, so this extension pins multiple pages prior to DMA transfer, then it releases after the device kernel execution. For the performance reason, 16MB-64MB is a preferable data size per a device kernel execution. DMA transfer of 16MB-64MB needs 2048-8192 pages being pinned simultaneously. Once backend/extension calls ReadBuffer(), resowner.c tracks which buffer was referenced by the current resource owner, to ensure these buffers being released at end of the transaction. However, it seems to me implementation of resowner.c didn't assume many buffers are referenced by a particular resource owner simultaneously. It manages the buffer index using an expandable array, then looks up the target buffer by sequential walk but from the tail because recently pinned buffer tends to be released first. It made a trouble in my case. My extension pinned multiple thousands buffers, so owner->buffers[] were enlarged and takes expensive cost to walk on. In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in total during hash-joining 2M rows; even though hash-joining itself takes less than 16 seconds. What is the best way to solve the problem? Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2). The source of problem come from data structure in ResourceOwnerData, so a straightforward way is to apply O(1) logic based on hashing, instead of the linear search. An issue is how beneficial or harmless to the core code, not only my extension. Probably, it "potentially" beneficial to the core backend also. However, its effect is not easy to observe right now because usual workload takes enough small amount of buffers at the same time. The attached patch applies O(1) logic on ResourceOwnerForgetBuffer(). It makes time consumption 36sec->0.09sec during 20M rows joinning based on hash-logic. Idea-2) track shared buffer being referenced by extension itself One other, but not preferable, option is to call ResourceOwnerForgetBuffer() just after ReadBuffer() on the extension side. Once resource-owner forget it, extension shall be responsible to release the buffer at end of the transaction, even if it aborted. It also makes us unavailable to use ReleaseBuffer(), so extension has to have duplication of ReleaseBuffer() but no ResourceOwnerForgetBuffer(). This idea has few advantage towards the idea-1, but only advantage is to avoid changes to the core PostgreSQL. Any comments? Thanks, -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei pgsql-v9.5-resowner-forget-buffer-o1.v1.patch Description: pgsql-v9.5-resowner-forget-buffer-o1.v1.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers