Re: Multiple hosts in connection string failed to failover in non-hot standby mode
Hi Tom, I agree to get detailed error message for each failed host as your patch 0001. As for patch 0004, find ':' after "could not connect to" may failed when error message like: "could not connect to host "localhost" (::1), port 12345: Connection refused", where p2 will point to "::1" instead of ": Connection refused". But since it's only used for test case, we don't need to filter the error message precisely. ``` ecpg_filter_stderr(const char *resultfile, const char *tmpfile) { .. char *p1 = strstr(linebuf.data, "could not connect to "); if (p1) { char *p2 = strchr(p1, ':'); if (p2) memmove(p1 + 17, p2, strlen(p2) + 1); } } ``` Thanks, Hubert From: Tom Lane Sent: Monday, January 11, 2021 10:56 AM To: Hubert Zhang Cc: tsunakawa.ta...@fujitsu.com ; pgsql-hack...@postgresql.org Subject: Re: Multiple hosts in connection string failed to failover in non-hot standby mode I wrote: > I feel pretty good about 0001: it might be committable as-is. 0002 is > probably subject to bikeshedding, plus it has a problem in the ECPG tests. > Two of the error messages are now unstable because they expose > chosen-at-random socket paths: > ... > I don't have any non-hacky ideas what to do about that. The extra detail > seems useful to end users, but we don't have any infrastructure that > would allow filtering it out in the ECPG tests. So far the only solution that comes to mind is to introduce some infrastructure to do that filtering. 0001-0003 below are unchanged, 0004 patches up the ecpg test framework with a rather ad-hoc filtering function. I'd feel worse about this if there weren't already a very ad-hoc filtering function there ;-) This set passes check-world for me; we'll soon see what the cfbot thinks. regards, tom lane
Recv-Q buffer is filled up due to bgwriter continue sending statistics to un-launched stat collector
Hi hackers, Bgwriter process will call `pgstat_send_bgwriter` to send statistics to stat collector, but stat collector is not started when standby is not in hot standby mode. ``` if (pmState == PM_RUN || pmState == PM_HOT_STANDBY) PgStatPID = pgstat_start(); ``` This would lead to the kernel Recv-Q buffer being filled up by checking `netstat -du`. I think we should disable to send statistics in `pgstat_send_bgwriter` when stat collector is not running. So here are three questions about how to fix it. 1. Is there any way we could effectively check the status of stat collector in bgwriter process? 2. Is there any way we check the bgwriter is running on a standby but not in hot stanby mode? 3. Is there any other process will send statistics except the bgwriter in standby? We should fix it one by one or add a check in `pgstat_send` directly? Thanks, Hubert Zhang
Re: Multiple hosts in connection string failed to failover in non-hot standby mode
Was the primary running and accepting connections when you encountered this error? That is, if you specified host="host1 host2", host1 was the non-hot standby and host2 was a running primary? Or only the non-hot standby was running? If a primary was running, I'd say it's a bug... Perhaps the following part in libpq gives up connection attempts wen the above FATAL error is returned from the server. Maybe libpq should differentiate errors using SQLSTATE and continue connection attempts on other hosts. Yes, the primary was running, but non-hot standby is in front of the primary in connection string. Hao Wu and I wrote a patch to fix this problem. Client side libpq should try another hosts in connection string when it is rejected by a non-hot standby, or the first host encounter some n/w problems during the libpq handshake. Please send emails in text format. Your email was in HTML, and I changed this reply to text format. Thanks. Is this email in text format now? I just use outlook in chrome. Let me know if it still in html format. Hubert & Hao Wu From: tsunakawa.ta...@fujitsu.com Sent: Tuesday, October 27, 2020 5:30 PM To: Hubert Zhang Cc: pgsql-hack...@postgresql.org Subject: RE: Multiple hosts in connection string failed to failover in non-hot standby mode Please send emails in text format. Your email was in HTML, and I changed this reply to text format. From: Hubert Zhang > Libpq has supported to specify multiple hosts in connection string and enable > auto failover when the previous PostgreSQL instance cannot be accessed. > But when I tried to enable this feature for a non-hot standby, it cannot do > the failover with the following messages. > > psql: error: could not connect to server: FATAL: the database system is > starting up Was the primary running and accepting connections when you encountered this error? That is, if you specified host="host1 host2", host1 was the non-hot standby and host2 was a running primary? Or only the non-hot standby was running? If a primary was running, I'd say it's a bug... Perhaps the following part in libpq gives up connection attempts wen the above FATAL error is returned from the server. Maybe libpq should differentiate errors using SQLSTATE and continue connection attempts on other hosts. [fe-connect.c] /* Handle errors. */ if (beresp == 'E') { if (PG_PROTOCOL_MAJOR(conn->pversion) >= 3) ... #endif goto error_return; } /* It is an authentication request. */ conn->auth_req_received = true; /* Get the type of request. */ Regards Takayuki Tsunakawa 0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch Description: 0001-Enhance-libpq-to-support-multiple-host-for-non-hot-s.patch
Multiple hosts in connection string failed to failover in non-hot standby mode
Hi hackers, Libpq has supported to specify multiple hosts in connection string and enable auto failover when the previous PostgreSQL instance cannot be accessed. But when I tried to enable this feature for a non-hot standby, it cannot do the failover with the following messages. psql: error: could not connect to server: FATAL: the database system is starting up Document says ' If a connection is established successfully, but authentication fails, the remaining hosts in the list are not tried.' I'm wondering is it a feature by design or a bug? If it's a bug, I plan to fix it. Thanks, Hubert Zhang
Re: Print physical file path when checksum check fails
I have updated the patch based on the previous comments. Sorry for the late patch. I removed `SetZeroDamagedPageInChecksum` and add `zeroDamagePage` flag in smgrread to determine whether we should zero damage page when an error happens. It depends on the caller. `GetRelationFilePath` is removed as well. We print segno on the fly. On Thu, Feb 20, 2020 at 2:33 PM Hubert Zhang wrote: > Thanks, > > On Thu, Feb 20, 2020 at 11:36 AM Andres Freund wrote: > >> Hi, >> >> On 2020-02-19 16:48:45 +0900, Michael Paquier wrote: >> > On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote: >> > > I have had support requests related to broken block several times, and >> > > (I think) most of *them* had hard time to locate the broken block or >> > > even broken file. I don't think it is useles at all, but I'm not sure >> > > it is worth the additional complexity. >> > >> > I handle stuff like that from time to time, and any reports usually >> > go down to people knowledgeable about PostgreSQL enough to know the >> > difference. My point is that in order to know where a broken block is >> > physically located on disk, you need to know four things: >> > - The block number. >> > - The physical location of the relation. >> > - The size of the block. >> > - The length of a file segment. >> > The first two items are printed in the error message, and you can >> > guess easily the actual location (file, offset) with the two others. >> >> > I am not necessarily against improving the error message here, but >> > FWIW I think that we need to consider seriously if the code >> > complications and the maintenance cost involved are really worth >> > saving from one simple calculation. >> >> I don't think it's that simple for most. >> >> And if we e.g. ever get the undo stuff merged, it'd get more >> complicated, because they segment entirely differently. Similar, if we >> ever manage to move SLRUs into the buffer pool and checksummed, it'd >> again work differently. >> >> Nor is it architecturally appealing to handle checksums in multiple >> places above the smgr layer: For one, that requires multiple places to >> compute verify them. But also, as the way checksums are computed depends >> on the page format etc, it'll likely change for things like undo/slru - >> which then again will require additional smarts if done above the smgr >> layer. >> > > So considering undo staff, it's better to move checksum logic into md.c > I will keep it in the new patch. > > On 2020-02-19 16:48:45 +0900, Michael Paquier wrote: > > > Particularly, quickly reading through the patch, I am rather unhappy >> > about the shape of the second patch which pushes down the segment >> > number knowledge into relpath.c, and creates more complication around >> > the handling of zero_damaged_pages and zero'ed pages. -- Michael >> >> I do not like the SetZeroDamagedPageInChecksum stuff at all however. >> >> > I'm +1 on the first concern, and I will delete the new added function > `GetRelationFilePath` > in relpath.c and append the segno directly in error message inside > function `VerifyPage` > > As for SetZeroDamagedPageInChecksum, the reason why I introduced it is > that when we are doing > smgrread() and one of the damaged page failed to pass the checksum check, > we could not directly error > out, since the caller of smgrread() may tolerate this error and just zero > all the damaged page plus a warning message. > Also, we could not just use GUC zero_damaged_pages to do this branch, > since we also have ReadBufferMode(i.e. RBM_ZERO_ON_ERROR) to control it. > > To get rid of SetZeroDamagedPageInChecksum, one idea is to pass > zero_damaged_page flag into smgrread(), something like below: > == > > extern void smgrread(SMgrRelation reln, ForkNumber forknum, > > BlockNumber blocknum, char *buffer, int flag); > > === > > > Any comments? > > > > -- > Thanks > > Hubert Zhang > -- Thanks Hubert Zhang 0003-Print-physical-file-path-when-verify-checksum-failed.patch Description: Binary data
Re: Yet another vectorized engine
Hi Konstantin, I also vimdiff nodeAgg.c in your PG13 branch with nodeAgg.c in pg's main repo. Many functions has changed from PG96 to PG13, e.g. 'advance_aggregates', 'lookup_hash_entry' The vectorized nodeAgg seems still follow the PG96 way of implementing these functions. In general, I think we'd better port executor of PG13 to vectorized executor of PG13 instead of merge some PG13 code into vectorized executor of PG96 to make it works. Because It's hard to determine which functions need to be merged and it's buggy if the executor code of both PG13 and PG96 exist in one branch. What's your opinion?
Re: Yet another vectorized engine
On Wed, Feb 26, 2020 at 7:59 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > > On 26.02.2020 13:11, Hubert Zhang wrote: > > > >> and with JIT: >> >> 13.88% postgres postgres [.] tts_buffer_heap_getsomeattrs >>7.15% postgres vectorize_engine.so [.] vfloat8_accum >>6.03% postgres postgres [.] HeapTupleSatisfiesVisibility >>5.55% postgres postgres [.] bpchareq >>4.42% postgres vectorize_engine.so [.] VExecStoreColumns >>4.19% postgres postgres [.] hashbpchar >>4.09% postgres vectorize_engine.so [.] vfloat8pl >> >> > I also tested Q1 with your latest code. Result of vectorized is still slow. > PG13 native: 38 secs > PG13 Vec: 30 secs > PG13 JIT: 23 secs > PG13 JIT+Vec: 27 secs > > > It is strange that your results are much slower than my and profile is > very different. > Which postgres configuration you are using? > > ./configure CFLAGS="-O3 -g -march=native" --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm I also use `PGXS := $(shell $(PG_CONFIG) --pgxs)` to compile vectorized_engine. So it will share the same compile configuration. My perf result is as belows. There are three parts: > 1. lookup_hash_entry(43.5%) this part is not vectorized yet. > > It is vectorized in some sense: lookup_hash_entry performs bulk of hash > lookups and pass array with results of such lookups to aggregate transmit > functions. > It will be possible to significantly increase speed of HashAgg if we store > data in order of grouping attributes and use RLE (run length encoding) to > peform just one > hash lookup for group of values. But it requires creation of special > partitions (like it is done in Vertica and VOPS). > > Yes, Vertica's partition needed to be pre-sorted on user defined columns. So for TPCH Q1 on Postgres, we could not have that assumption. And my Q1 plan uses HashAgg instead of GroupAgg based on cost. > 2. scan part: fetch_input_tuple(36%) > 3. vadvance_aggregates part(20%) > I also perfed on PG96 vectorized version and got similar perf results and > running time of vectorized PG96 and PG13 are also similar. But PG13 is much > faster than PG96. So I just wonder whether we merge all the latest executor > code of PG13 into the vectorized PG13 branch? > > > Sorry, I do not understand the question. vectorize_executor contains > patched versions of nodeSeqscan and nodeAgg from standard executor. > When performing porting to PG13, I took the latest version of nodeAgg and > tried to apply your patches to it. Certainly not always it was possible and > I have to rewrite a lt of places. Concerning nodeSeqscan - I took old > version from vectorize_executor and port it to PG13. > > It is strange that I am not seeing lookup_hash_entry in profile in my > case. > > So you already have the PG13 nodeAgg, that is good. Yes, it is strange. Hash table probing is always the costly part. My perf command `perf record --call-graph dwarf -p pid` Could you share your lineitem schema and Q1 query? My schema and Q1 query are: CREATE TABLE lineitem ( l_orderkey BIGINT NOT NULL, l_partkey INTEGER NOT NULL, l_suppkey INTEGER NOT NULL, l_linenumber INTEGER NOT NULL, l_quantity double precision NOT NULL, l_extendedprice double precision NOT NULL, l_discount double precision NOT NULL, l_tax double precision NOT NULL, l_returnflag CHAR(1) NOT NULL, l_linestatus CHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct CHAR(25) NOT NULL, l_shipmode CHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL ); select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(l_discount) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '106 day' group by l_returnflag, l_linestatus ; -- Thanks Hubert Zhang
Re: Yet another vectorized engine
Hi Konstantin, On Tue, Feb 25, 2020 at 6:44 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > > On 25.02.2020 11:06, Hubert Zhang wrote: > > Hi Konstantin, > > I checkout your branch pg13 in repo > https://github.com/zhangh43/vectorize_engine > After I fixed some compile error, I tested Q1 on TPCH-10G > The result is different from yours and vectorize version is too slow. Note > that I disable parallel worker by default. > no JIT no Vectorize: 36 secs > with JIT only: 23 secs > with Vectorize only: 33 secs > JIT + Vectorize: 29 secs > > My config option is `CFLAGS='-O3 -g -march=native' > --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm` > I will do some spike on why vectorized is so slow. Could you please > provide your compile option and the TPCH dataset size and your > queries(standard Q1?) to help me to debug on it. > > > > Hi, Hubert > > Sorry, looks like I have used slightly deteriorated snapshot of master so > I have not noticed some problems. > Fixes are committed. > > Most of the time is spent in unpacking heap tuple > (tts_buffer_heap_getsomeattrs): > > 24.66% postgres postgres [.] tts_buffer_heap_getsomeattrs >8.28% postgres vectorize_engine.so [.] VExecStoreColumns >5.94% postgres postgres [.] HeapTupleSatisfiesVisibility >4.21% postgres postgres [.] bpchareq >4.12% postgres vectorize_engine.so [.] vfloat8_accum > > > In my version of nodeSeqscan I do not keep all fetched 1024 heap tuples > but stored there attribute values in vector columns immediately. > But to avoid extraction of useless data it is necessary to know list of > used columns. > The same problem is solved in zedstore, but unfortunately there is no > existed method in Postgres to get list > of used attributes. I have done it but my last implementation contains > error which cause loading of all columns. > Fixed version is committed. > > Now profile without JIT is: > > 15.52% postgres postgres [.] tts_buffer_heap_getsomeattrs > 10.25% postgres postgres [.] ExecInterpExpr >6.54% postgres postgres [.] HeapTupleSatisfiesVisibility >5.12% postgres vectorize_engine.so [.] VExecStoreColumns >4.86% postgres postgres [.] bpchareq >4.80% postgres vectorize_engine.so [.] vfloat8_accum >3.78% postgres postgres [.] tts_minimal_getsomeattrs >3.66% postgres vectorize_engine.so [.] VExecAgg >3.38% postgres postgres [.] hashbpchar > > and with JIT: > > 13.88% postgres postgres [.] tts_buffer_heap_getsomeattrs >7.15% postgres vectorize_engine.so [.] vfloat8_accum >6.03% postgres postgres [.] HeapTupleSatisfiesVisibility >5.55% postgres postgres [.] bpchareq >4.42% postgres vectorize_engine.so [.] VExecStoreColumns >4.19% postgres postgres [.] hashbpchar >4.09% postgres vectorize_engine.so [.] vfloat8pl > > I also tested Q1 with your latest code. Result of vectorized is still slow. PG13 native: 38 secs PG13 Vec: 30 secs PG13 JIT: 23 secs PG13 JIT+Vec: 27 secs My perf result is as belows. There are three parts: 1. lookup_hash_entry(43.5%) this part is not vectorized yet. 2. scan part: fetch_input_tuple(36%) 3. vadvance_aggregates part(20%) I also perfed on PG96 vectorized version and got similar perf results and running time of vectorized PG96 and PG13 are also similar. But PG13 is much faster than PG96. So I just wonder whether we merge all the latest executor code of PG13 into the vectorized PG13 branch? - agg_fill_hash_table ◆ - 43.50% lookup_hash_entry (inlined) ▒ + 39.07% LookupTupleHashEntry ▒ 0.56% ExecClearTuple (inlined) ▒ - 36.06% fetch_input_tuple ▒ - ExecProcNode (inlined) ▒ - 36.03% VExecScan ▒ - 34.60% ExecScanFetch (inlined) ▒ - ExecScanFetch (inlined) ▒ - VSeqNext ▒ + 16.64% table_scan_getnextslot (inlined) ▒ - 10.29% slot_getsomeattrs (inlined) ▒ - 10.17% slot_getsomeattrs_int ▒ + tts_buffer_heap_getsomeattrs ▒ 7.14% VExecStoreColumns ▒ + 1.38% ExecQual (inlined) ▒ - 20.30% Vadvance_aggregates (inlined) ▒ - 17.46% Vadvance_transition_function (inlined) ▒ + 11.95% vfloat8_accum ▒ + 4.74% vfloat8pl ▒ 0.75% vint8inc_any ▒ + 2.77% ExecProject (inlined) -- Thanks Hubert Zhang
Re: Yet another vectorized engine
Hi Konstantin, I checkout your branch pg13 in repo https://github.com/zhangh43/vectorize_engine After I fixed some compile error, I tested Q1 on TPCH-10G The result is different from yours and vectorize version is too slow. Note that I disable parallel worker by default. no JIT no Vectorize: 36 secs with JIT only: 23 secs with Vectorize only: 33 secs JIT + Vectorize: 29 secs My config option is `CFLAGS='-O3 -g -march=native' --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm` I will do some spike on why vectorized is so slow. Could you please provide your compile option and the TPCH dataset size and your queries(standard Q1?) to help me to debug on it. On Mon, Feb 24, 2020 at 8:43 PM Hubert Zhang wrote: > Hi Konstantin, > I have added you as a collaborator on github. Please accepted and try > again. > I think non collaborator could also open pull requests. > > On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik < > k.knizh...@postgrespro.ru> wrote: > >> >> >> On 24.02.2020 05:08, Hubert Zhang wrote: >> >> Hi >> >> On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik < >> k.knizh...@postgrespro.ru> wrote: >> >>> >>> >>> On 12.02.2020 13:12, Hubert Zhang wrote: >>> >>> On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik < >>> k.knizh...@postgrespro.ru> wrote: >>> >>>> >>>> So looks like PG-13 provides significant advantages in OLAP queries >>>> comparing with 9.6! >>>> Definitely it doesn't mean that vectorized executor is not needed for >>>> new version of Postgres. >>>> Once been ported, I expect that it should provide comparable >>>> improvement of performance. >>>> >>>> But in any case I think that vectorized executor makes sense only been >>>> combine with columnar store. >>>> >>> >>> Thanks for the test. +1 on vectorize should be combine with columnar >>> store. I think when we support this extension >>> on master, we could try the new zedstore. >>> I'm not active on this work now, but will continue when I have time. >>> Feel free to join bring vops's feature into this extension. >>> >>> Thanks >>> >>> Hubert Zhang >>> >>> >>> I have ported vectorize_engine to the master. >>> It takes longer than I expected: a lot of things were changed in >>> executor. >>> >>> Results are the following: >>> >>> >>> par.warkers >>> PG9_6 >>> vectorize=off >>> PG9_6 >>> vectorize=on >>> master >>> vectorize=off >>> jit=on >>> master >>> vectorize=off >>> jit=off master >>> vectorize=on >>> jit=ofn master >>> vectorize=on >>> jit=off >>> 0 >>> 36 >>> 20 >>> 16 >>> 25.5 >>> 15 >>> 17.5 >>> 4 >>> 10 >>> - >>> 5 7 >>> - >>> - >>> >>> So it proves the theory that JIT provides almost the same speedup as >>> vector executor (both eliminates interpretation overhead but in different >>> way). >>> I still not sure that we need vectorized executor: because with standard >>> heap it provides almost no improvements comparing with current JIT version. >>> But in any case I am going to test it with vertical storage (zedstore or >>> cstore). >>> >>> >> Thanks for the porting and testing. >> Yes, PG master and 9.6 have many changes, not only executor, but also >> tupletableslot interface. >> >> What matters the performance of JIT and Vectorization is its >> implementation. This is just the beginning of vectorization work, just as >> your vops extension reported, vectorization could run 10 times faster in >> PG. With the overhead of row storage(heap), we may not reach that speedup, >> but I think we could do better. Also +1 on vertical storage. >> >> BTW, welcome to submit your PR for the PG master version. >> >> >> >> Sorry, but I have no permissions to push changes to your repository. >> I can certainly create my own fork of vectorize_engine, but I think it >> will be beter if I push pg13 branch in your repository. >> >> >> > > -- > Thanks > > Hubert Zhang > -- Thanks Hubert Zhang
Re: Yet another vectorized engine
Hi Konstantin, I have added you as a collaborator on github. Please accepted and try again. I think non collaborator could also open pull requests. On Mon, Feb 24, 2020 at 8:02 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > > On 24.02.2020 05:08, Hubert Zhang wrote: > > Hi > > On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik < > k.knizh...@postgrespro.ru> wrote: > >> >> >> On 12.02.2020 13:12, Hubert Zhang wrote: >> >> On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik < >> k.knizh...@postgrespro.ru> wrote: >> >>> >>> So looks like PG-13 provides significant advantages in OLAP queries >>> comparing with 9.6! >>> Definitely it doesn't mean that vectorized executor is not needed for >>> new version of Postgres. >>> Once been ported, I expect that it should provide comparable >>> improvement of performance. >>> >>> But in any case I think that vectorized executor makes sense only been >>> combine with columnar store. >>> >> >> Thanks for the test. +1 on vectorize should be combine with columnar >> store. I think when we support this extension >> on master, we could try the new zedstore. >> I'm not active on this work now, but will continue when I have time. Feel >> free to join bring vops's feature into this extension. >> >> Thanks >> >> Hubert Zhang >> >> >> I have ported vectorize_engine to the master. >> It takes longer than I expected: a lot of things were changed in executor. >> >> Results are the following: >> >> >> par.warkers >> PG9_6 >> vectorize=off >> PG9_6 >> vectorize=on >> master >> vectorize=off >> jit=on >> master >> vectorize=off >> jit=off master >> vectorize=on >> jit=ofn master >> vectorize=on >> jit=off >> 0 >> 36 >> 20 >> 16 >> 25.5 >> 15 >> 17.5 >> 4 >> 10 >> - >> 5 7 >> - >> - >> >> So it proves the theory that JIT provides almost the same speedup as >> vector executor (both eliminates interpretation overhead but in different >> way). >> I still not sure that we need vectorized executor: because with standard >> heap it provides almost no improvements comparing with current JIT version. >> But in any case I am going to test it with vertical storage (zedstore or >> cstore). >> >> > Thanks for the porting and testing. > Yes, PG master and 9.6 have many changes, not only executor, but also > tupletableslot interface. > > What matters the performance of JIT and Vectorization is its > implementation. This is just the beginning of vectorization work, just as > your vops extension reported, vectorization could run 10 times faster in > PG. With the overhead of row storage(heap), we may not reach that speedup, > but I think we could do better. Also +1 on vertical storage. > > BTW, welcome to submit your PR for the PG master version. > > > > Sorry, but I have no permissions to push changes to your repository. > I can certainly create my own fork of vectorize_engine, but I think it > will be beter if I push pg13 branch in your repository. > > > -- Thanks Hubert Zhang
Re: Yet another vectorized engine
Hi On Sat, Feb 22, 2020 at 12:58 AM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > > On 12.02.2020 13:12, Hubert Zhang wrote: > > On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik < > k.knizh...@postgrespro.ru> wrote: > >> >> So looks like PG-13 provides significant advantages in OLAP queries >> comparing with 9.6! >> Definitely it doesn't mean that vectorized executor is not needed for new >> version of Postgres. >> Once been ported, I expect that it should provide comparable improvement >> of performance. >> >> But in any case I think that vectorized executor makes sense only been >> combine with columnar store. >> > > Thanks for the test. +1 on vectorize should be combine with columnar > store. I think when we support this extension > on master, we could try the new zedstore. > I'm not active on this work now, but will continue when I have time. Feel > free to join bring vops's feature into this extension. > > Thanks > > Hubert Zhang > > > I have ported vectorize_engine to the master. > It takes longer than I expected: a lot of things were changed in executor. > > Results are the following: > > > par.warkers > PG9_6 > vectorize=off > PG9_6 > vectorize=on > master > vectorize=off > jit=on > master > vectorize=off > jit=off master > vectorize=on > jit=ofn master > vectorize=on > jit=off > 0 > 36 > 20 > 16 > 25.5 > 15 > 17.5 > 4 > 10 > - > 5 7 > - > - > > So it proves the theory that JIT provides almost the same speedup as > vector executor (both eliminates interpretation overhead but in different > way). > I still not sure that we need vectorized executor: because with standard > heap it provides almost no improvements comparing with current JIT version. > But in any case I am going to test it with vertical storage (zedstore or > cstore). > > > Thanks for the porting and testing. Yes, PG master and 9.6 have many changes, not only executor, but also tupletableslot interface. What matters the performance of JIT and Vectorization is its implementation. This is just the beginning of vectorization work, just as your vops extension reported, vectorization could run 10 times faster in PG. With the overhead of row storage(heap), we may not reach that speedup, but I think we could do better. Also +1 on vertical storage. BTW, welcome to submit your PR for the PG master version. -- Thanks Hubert Zhang
Re: Print physical file path when checksum check fails
Thanks Kyotaro, On Wed, Feb 19, 2020 at 2:02 PM Kyotaro Horiguchi wrote: > At Wed, 19 Feb 2020 13:28:04 +0900, Michael Paquier > wrote in > > On Wed, Feb 19, 2020 at 01:07:36PM +0900, Kyotaro Horiguchi wrote: > > > If we also verify checksum in md layer, callback is overkill since the > > > immediate caller consumes the event immediately. We can signal the > > > error by somehow returning a file tag. > > > > FWIW, I am wondering if there is any need for a change here and > > complicate more the code. If you know the block number, the page size > > and the segment file size you can immediately guess where is the > > damaged block. The first information is already part of the error > > I have had support requests related to broken block several times, and > (I think) most of *them* had hard time to locate the broken block or > even broken file. I don't think it is useles at all, but I'm not sure > it is worth the additional complexity. > > > damaged block. The first information is already part of the error > > message, and the two other ones are constants defined at > > compile-time. > > May you have misread the snippet? > > What Hubert proposed is: > > "invalid page in block %u of relation file %s; zeroing out page", > blkno, > > The second format in my messages just before is: > "invalid page in block %u in relation %u, file \"%s\"", > blockNum, smgr->smgr_rnode.node.relNode, smgrfname() > > All of them are not compile-time constant at all. > > I like your error message, the block number is relation level not file level. I 'll change the error message to "invalid page in block %u of relation %u, file %s" -- Thanks Hubert Zhang
Re: Print physical file path when checksum check fails
Thanks, On Thu, Feb 20, 2020 at 11:36 AM Andres Freund wrote: > Hi, > > On 2020-02-19 16:48:45 +0900, Michael Paquier wrote: > > On Wed, Feb 19, 2020 at 03:00:54PM +0900, Kyotaro Horiguchi wrote: > > > I have had support requests related to broken block several times, and > > > (I think) most of *them* had hard time to locate the broken block or > > > even broken file. I don't think it is useles at all, but I'm not sure > > > it is worth the additional complexity. > > > > I handle stuff like that from time to time, and any reports usually > > go down to people knowledgeable about PostgreSQL enough to know the > > difference. My point is that in order to know where a broken block is > > physically located on disk, you need to know four things: > > - The block number. > > - The physical location of the relation. > > - The size of the block. > > - The length of a file segment. > > The first two items are printed in the error message, and you can > > guess easily the actual location (file, offset) with the two others. > > > I am not necessarily against improving the error message here, but > > FWIW I think that we need to consider seriously if the code > > complications and the maintenance cost involved are really worth > > saving from one simple calculation. > > I don't think it's that simple for most. > > And if we e.g. ever get the undo stuff merged, it'd get more > complicated, because they segment entirely differently. Similar, if we > ever manage to move SLRUs into the buffer pool and checksummed, it'd > again work differently. > > Nor is it architecturally appealing to handle checksums in multiple > places above the smgr layer: For one, that requires multiple places to > compute verify them. But also, as the way checksums are computed depends > on the page format etc, it'll likely change for things like undo/slru - > which then again will require additional smarts if done above the smgr > layer. > So considering undo staff, it's better to move checksum logic into md.c I will keep it in the new patch. On 2020-02-19 16:48:45 +0900, Michael Paquier wrote: > Particularly, quickly reading through the patch, I am rather unhappy > > about the shape of the second patch which pushes down the segment > > number knowledge into relpath.c, and creates more complication around > > the handling of zero_damaged_pages and zero'ed pages. -- Michael > > I do not like the SetZeroDamagedPageInChecksum stuff at all however. > > I'm +1 on the first concern, and I will delete the new added function `GetRelationFilePath` in relpath.c and append the segno directly in error message inside function `VerifyPage` As for SetZeroDamagedPageInChecksum, the reason why I introduced it is that when we are doing smgrread() and one of the damaged page failed to pass the checksum check, we could not directly error out, since the caller of smgrread() may tolerate this error and just zero all the damaged page plus a warning message. Also, we could not just use GUC zero_damaged_pages to do this branch, since we also have ReadBufferMode(i.e. RBM_ZERO_ON_ERROR) to control it. To get rid of SetZeroDamagedPageInChecksum, one idea is to pass zero_damaged_page flag into smgrread(), something like below: == extern void smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, int flag); === Any comments? -- Thanks Hubert Zhang
Re: Print physical file path when checksum check fails
On Wed, Feb 12, 2020 at 5:22 PM Hubert Zhang wrote: > Thanks Andres, > > On Tue, Feb 11, 2020 at 5:30 AM Andres Freund wrote: > >> HHi, >> >> On 2020-02-10 16:04:21 +0800, Hubert Zhang wrote: >> > Currently we only print block number and relation path when checksum >> check >> > fails. See example below: >> > >> > ERROR: invalid page in block 333571 of relation base/65959/656195 >> >> > DBA complains that she needs additional work to calculate which physical >> > file is broken, since one physical file can only contain `RELSEG_SIZE` >> > number of blocks. For large tables, we need to use many physical files >> with >> > additional suffix, e.g. 656195.1, 656195.2 ... >> > >> > Is that a good idea to also print the physical file path in error >> message? >> > Like below: >> > >> > ERROR: invalid page in block 333571 of relation base/65959/656195, file >> > path base/65959/656195.2 >> >> I think that'd be a nice improvement. But: >> >> I don't think the way you did it is right architecturally. The >> segmenting is really something that lives within md.c, and we shouldn't >> further expose it outside of that. And e.g. the undo patchset uses files >> with different segmentation - but still goes through bufmgr.c. >> >> I wonder if this partially signals that the checksum verification piece >> is architecturally done in the wrong place currently? It's imo not good >> that every place doing an smgrread() needs to separately verify >> checksums. OTOH, it doesn't really belong inside smgr.c. >> >> >> This layering issue was also encountered in >> >> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 >> so perhaps we should work to reuse the FileTag it introduces to >> represent segments, without hardcoding the specific segment size? >> >> > I checked the FileTag commit. It calls `register_xxx_segment` inside md.c > to store the sync request into a hashtable and used by checkpointer later. > > Checksum verify is simpler. We could move the `PageIsVerified` into md.c > (mdread). But we can not elog error inside md.c because read buffer mode > RBM_ZERO_ON_ERROR is at bugmgr.c level. > > One idea is to change save the error message(or the FileTag) at (e.g. a > static string in bufmgr.c) by calling `register_checksum_failure` inside > mdread in md.c. > > As for your concern about the need to do checksum verify after every > smgrread, we now move the checksum verify logic into md.c, but we still > need to check the checksum verify result after smgrread and reset buffer to > zero if mode is RBM_ZERO_ON_ERROR. > > If this idea is OK, I will submit the new PR. > > Based on Andres's comments, here is the new patch for moving checksum verify logic into mdread() instead of call PageIsVerified in every smgrread(). Also using FileTag to print the physical file name when checksum verify failed, which handle segmenting inside md.c as well. -- Thanks Hubert Zhang 0002-Print-physical-file-path-when-verify-checksum-failed.patch Description: Binary data
Re: Yet another vectorized engine
On Tue, Feb 11, 2020 at 1:20 AM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > So looks like PG-13 provides significant advantages in OLAP queries > comparing with 9.6! > Definitely it doesn't mean that vectorized executor is not needed for new > version of Postgres. > Once been ported, I expect that it should provide comparable improvement > of performance. > > But in any case I think that vectorized executor makes sense only been > combine with columnar store. > Thanks for the test. +1 on vectorize should be combine with columnar store. I think when we support this extension on master, we could try the new zedstore. I'm not active on this work now, but will continue when I have time. Feel free to join bring vops's feature into this extension. Thanks Hubert Zhang
Re: Print physical file path when checksum check fails
Thanks Andres, On Tue, Feb 11, 2020 at 5:30 AM Andres Freund wrote: > HHi, > > On 2020-02-10 16:04:21 +0800, Hubert Zhang wrote: > > Currently we only print block number and relation path when checksum > check > > fails. See example below: > > > > ERROR: invalid page in block 333571 of relation base/65959/656195 > > > DBA complains that she needs additional work to calculate which physical > > file is broken, since one physical file can only contain `RELSEG_SIZE` > > number of blocks. For large tables, we need to use many physical files > with > > additional suffix, e.g. 656195.1, 656195.2 ... > > > > Is that a good idea to also print the physical file path in error > message? > > Like below: > > > > ERROR: invalid page in block 333571 of relation base/65959/656195, file > > path base/65959/656195.2 > > I think that'd be a nice improvement. But: > > I don't think the way you did it is right architecturally. The > segmenting is really something that lives within md.c, and we shouldn't > further expose it outside of that. And e.g. the undo patchset uses files > with different segmentation - but still goes through bufmgr.c. > > I wonder if this partially signals that the checksum verification piece > is architecturally done in the wrong place currently? It's imo not good > that every place doing an smgrread() needs to separately verify > checksums. OTOH, it doesn't really belong inside smgr.c. > > > This layering issue was also encountered in > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3eb77eba5a51780d5cf52cd66a9844cd4d26feb0 > so perhaps we should work to reuse the FileTag it introduces to > represent segments, without hardcoding the specific segment size? > > I checked the FileTag commit. It calls `register_xxx_segment` inside md.c to store the sync request into a hashtable and used by checkpointer later. Checksum verify is simpler. We could move the `PageIsVerified` into md.c (mdread). But we can not elog error inside md.c because read buffer mode RBM_ZERO_ON_ERROR is at bugmgr.c level. One idea is to change save the error message(or the FileTag) at (e.g. a static string in bufmgr.c) by calling `register_checksum_failure` inside mdread in md.c. As for your concern about the need to do checksum verify after every smgrread, we now move the checksum verify logic into md.c, but we still need to check the checksum verify result after smgrread and reset buffer to zero if mode is RBM_ZERO_ON_ERROR. If this idea is OK, I will submit the new PR. Thanks Hubert Zhang
Print physical file path when checksum check fails
Hi hacker, Currently we only print block number and relation path when checksum check fails. See example below: ERROR: invalid page in block 333571 of relation base/65959/656195 DBA complains that she needs additional work to calculate which physical file is broken, since one physical file can only contain `RELSEG_SIZE` number of blocks. For large tables, we need to use many physical files with additional suffix, e.g. 656195.1, 656195.2 ... Is that a good idea to also print the physical file path in error message? Like below: ERROR: invalid page in block 333571 of relation base/65959/656195, file path base/65959/656195.2 Patch is attached. -- Thanks Hubert Zhang 0001-Print-physical-file-path-when-checksum-check-fails.patch Description: Binary data
Re: Yet another vectorized engine
Thanks Konstantin, Your suggestions are very helpful. I have added them into issues of vectorize_engine repo https://github.com/zhangh43/vectorize_engine/issues On Wed, Dec 4, 2019 at 10:08 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > > On 04.12.2019 12:13, Hubert Zhang wrote: > > 3. Why you have to implement your own plan_tree_mutator and not using > expression_tree_mutator? > > I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg > implementation). expression_tree_mutator cannot be used to mutate plan node > such as Agg, am I right? > > > O, sorry, I see. > > > >> 4. As far as I understand you now always try to replace SeqScan with your >> custom vectorized scan. But it makes sense only if there are quals for this >> scan or aggregation is performed. >> In other cases batch+unbatch just adds extra overhead, doesn't it? >> > Probably extra overhead for heap format and query like 'select i from t;' > without qual, projection, aggregation. > But with column store, VectorScan could directly read batch, and no > additional batch cost. Column store is the better choice for OLAP queries. > > > Generally, yes. > But will it be true for the query with a lot of joins? > > select * from T1 join T2 on (T1.pk=T2.fk) join T3 on (T2.pk=T3.fk) join T4 > ... > > How can batching improve performance in this case? > Also if query contains LIMIT clause or cursors, then batching can cause > fetching of useless records (which never will be requested by client). > > Can we conclude that it would be better to use vector engine for OLAP > queries and row engine for OLTP queries. > > 5. Throwing and catching exception for queries which can not be vectorized >> seems to be not the safest and most efficient way of handling such cases. >> May be it is better to return error code in plan_tree_mutator and >> propagate this error upstairs? > > > Yes, as for efficiency, another way is to enable some plan node to be > vectorized and leave other nodes not vectorized and add batch/unbatch layer > between them(Is this what you said "propagate this error upstairs"). As you > mentioned, this could introduce additional overhead. Is there any other > good approaches? > What do you mean by not safest? > PG catch will receive the ERROR, and fallback to the original > non-vectorized plan. > > > The problem with catching and ignoring exception was many times discussed > in hackers. > Unfortunately Postgres PG_TRY/PG_CATCH mechanism is not analog of > exception mechanism in more high level languages, like C++, Java... > It doesn't perform stack unwind. If some resources (files, locks, > memory,...) were obtained before throwing error, then them are not > reclaimed. > Only rollback of transaction is guaranteed to release all resources. And > it actually happen in case of normal error processing. > But if you catch and ignore exception , trying to continue execution, then > it can cause many problems. > > May be in your case it is not a problem, because you know for sure where > error can happen: it is thrown by plan_tree_mutator > and looks like there are no resources obtained by this function. But in > any case overhead of setjmp is much higher than of explicit checks of > return code. > So checking return codes will not actually add some noticeable overhead > except code complication by adding extra checks. > But in can be hidden in macros which are used in any case (like MUTATE). > > > 7. How vectorized scan can be combined with parallel execution (it is >> already supported in9.6, isn't it?) >> > > We didn't implement it yet. But the idea is the same as non parallel one. > Copy the current parallel scan and implement vectorized Gather, keeping > their interface to be VectorTupleTableSlot. > Our basic idea to reuse most of the current PG executor logic, and make > them vectorized, then tuning performance gradually. > > > Parallel scan is scattering pages between parallel workers. > To fill VectorTupleSlot with data you may need more than one page (unless > you make a decision that it can fetch tuples only from single page). > So it should be somehow take in account specific of parallel search. > Also there is special nodes for parallel search so if we want to provide > parallel execution for vectorized operations we need also to substitute > this nodes with > custom nodes. > > -- > Konstantin Knizhnik > Postgres Professional: http://www.postgrespro.com > <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.postgrespro.com=DwMDaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=lz-kpGdw_rtpgYV2ho3DjDSB5Psxis_b-3VZKON7K7c=vdzzVhvy3WXoHG6U6a8YqBZnVe-7lCDU5SzNWwPDxSM=0TXQmqH_G8_Nao7F_n5m-ekne2NfeaJJPCaRkH_4_ME=> > The Russian Postgres Company > > -- Thanks Hubert Zhang
Re: Yet another vectorized engine
Thanks Konstantin for your detailed review! On Tue, Dec 3, 2019 at 5:58 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > > > On 02.12.2019 4:15, Hubert Zhang wrote: > > > The prototype extension is at https://github.com/zhangh43/vectorize_engine > > > I am very sorry, that I have no followed this link. > Few questions concerning your design decisions: > > 1. Will it be more efficient to use native arrays in vtype instead of > array of Datum? I think it will allow compiler to generate more efficient > code for operations with float4 and int32 types. > It is possible to use union to keep fixed size of vtype. Yes, I'm also considering that when scan a column store, the column batch is loaded into a continuous memory region. For int32, the size of this region is 4*BATCHSIZE, while for int16, the size is 2*BATCHSIZE. So using native array could just do a single memcpy to fill the vtype batch. > 2. Why VectorTupleSlot contains array (batch) of heap tuples rather than > vectors (array of vtype)? > a. VectorTupleSlot stores array of vtype in tts_values field which is used to reduce the code change and reuse functions like ExecProject. Of course we could use separate field to store vtypes. b. VectorTupleSlot also contains array of heap tuples. This used to do heap tuple deform. In fact, the tuples in a batch may across many pages, so we also need to pin an array of related pages instead of just one page. 3. Why you have to implement your own plan_tree_mutator and not using > expression_tree_mutator? > I also want to replace plan node, e.g. Agg->CustomScan(with VectorAgg implementation). expression_tree_mutator cannot be used to mutate plan node such as Agg, am I right? > 4. As far as I understand you now always try to replace SeqScan with your > custom vectorized scan. But it makes sense only if there are quals for this > scan or aggregation is performed. > In other cases batch+unbatch just adds extra overhead, doesn't it? > Probably extra overhead for heap format and query like 'select i from t;' without qual, projection, aggregation. But with column store, VectorScan could directly read batch, and no additional batch cost. Column store is the better choice for OLAP queries. Can we conclude that it would be better to use vector engine for OLAP queries and row engine for OLTP queries. 5. Throwing and catching exception for queries which can not be vectorized > seems to be not the safest and most efficient way of handling such cases. > May be it is better to return error code in plan_tree_mutator and > propagate this error upstairs? Yes, as for efficiency, another way is to enable some plan node to be vectorized and leave other nodes not vectorized and add batch/unbatch layer between them(Is this what you said "propagate this error upstairs"). As you mentioned, this could introduce additional overhead. Is there any other good approaches? What do you mean by not safest? PG catch will receive the ERROR, and fallback to the original non-vectorized plan. > 6. Have you experimented with different batch size? I have done similar > experiments in VOPS and find out that tile size larger than 128 are not > providing noticable increase of performance. > You are currently using batch size 1024 which is significantly larger than > typical amount of tuples on one page. > Good point, We will do some experiments on it. 7. How vectorized scan can be combined with parallel execution (it is > already supported in9.6, isn't it?) > We didn't implement it yet. But the idea is the same as non parallel one. Copy the current parallel scan and implement vectorized Gather, keeping their interface to be VectorTupleTableSlot. Our basic idea to reuse most of the current PG executor logic, and make them vectorized, then tuning performance gradually. -- Thanks Hubert Zhang
Re: Yet another vectorized engine
On Sun, Dec 1, 2019 at 10:05 AM Michael Paquier wrote: > On Thu, Nov 28, 2019 at 05:23:59PM +0800, Hubert Zhang wrote: > > Note that the vectorized executor engine is based on PG9.6 now, but it > > could be ported to master / zedstore with some effort. We would > appreciate > > some feedback before moving further in that direction. > > There has been no feedback yet, unfortunately. The patch does not > apply anymore, so a rebase is necessary. For now I am moving the > patch to next CF, waiting on author. > -- > Michael > Thanks we'll rebase and resubmit the patch. -- Thanks Hubert Zhang
Re: Yet another vectorized engine
Hi Konstantin, Thanks for your reply. On Fri, Nov 29, 2019 at 12:09 AM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > On 28.11.2019 12:23, Hubert Zhang wrote: > > We just want to introduce another POC for vectorized execution engine > https://github.com/zhangh43/vectorize_engine and want to get some > feedback on the idea. > > But I do not completely understand why you are proposing to implement it > as extension. > Yes, custom nodes makes it possible to provide vector execution without > affecting Postgres core. > But for efficient integration of zedstore and vectorized executor we need > to extend table-AM (VectorTupleTableSlot and correspondent scan functions). > Certainly it is easier to contribute vectorized executor as extension, but > sooner or later I think it should be added to Postgres core. > > As far as I understand you already have some prototype implementation > (otherwise how you got the performance results)? > If so, are you planning to publish it or you think that executor should be > developed from scratch? > The prototype extension is at https://github.com/zhangh43/vectorize_engine I agree vectorized executor should be added to Postgres core some days. But it is such a huge feature and need to change from not only the extended table-AM you mentioned and also every executor node , such as Agg,Join,Sort node etc. What's more, the expression evaluation function and aggregate's transition function, combine function etc. We all need to supply a vectorized version for them. Hence, implementing it as an extension first and if it is popular among community and stable, we could merge it into Postgres core whenever we want. We do want to get some feedback from the community about CustomScan. CustomScan is just an abstract layer. It's typically used to support user defined scan node, but some other PG extensions(pgstorm) have already used it as a general CustomNode e.g. Agg, Join etc. Since vectorized engine need to support vectorized processing in all executor node, follow the above idea, our choice is to use CustomScan. > Some my concerns based on VOPS experience: > > 1. Vertical (columnar) model is preferable for some kind of queries, but > there are some classes of queries for which it is less efficient. > Moreover, data is used to be imported in the database in row format. > Inserting it in columnar store record-by-record is very inefficient. > So you need some kind of bulk loader which will be able to buffer input > data before loading it in columnar store. > Actually this problem it is more related with data model rather than > vectorized executor. But what I want to express here is that it may be > better to have both representation (horizontal and vertical) > and let optimizer choose most efficient one for particular query. > > Yes, in general, for OLTP queries, row format is better and for OLAP queries column format is better. As for storage type(or data model), I think DBA should choose row or column store to use for a specific table. As for executor, it's a good idea to let optimizer to choose based on cost. It is a long term goal and our extension now will fallback to original row executor for Insert,Update,IndexScan cases in a rough way. We want our extension could be enhanced in a gradual way. > 2. Columnar store and vectorized executor are most efficient for query > like "select sum(x) from T where ...". > Unfortunately such simple queries are rarely used in real life. Usually > analytic queries contain group-by and joins. > And here vertical model is not always optimal (you have to reconstruct > rows from columns to perform join or grouping). > To provide efficient execution of queries you may need to create multiple > different projections of the same data (sorted by different subset of > attributes). > This is why Vertica (one of the most popular columnar store DBMS) is > supporting projections. > The same can be done in VOPS: using create_projection function you can > specify which attributes should be scalar (grouping attributes) and which > vectorized. > In this case you can perform grouping and joins using standard Postgres > executor, while perform vectorized operations for filtering and > accumulating aggregates. > > This is why Q1 is 20 times faster in VOPS and not 2 times as in your > prototype. > So I think that columnar store should make it possible to maintain several > projections of table and optimizer should be able to automatically choose > one of them for particular query. > Definitely synchronization of projections is challenged problem. > Fortunately OLAP usually not require most recent data. > Projection in Vertica is useful. I tested, VOPS is really faster. It could be nice if you could contribute it to PG core. Our extension is aimed t
Yet another vectorized engine
Hi hackers, We just want to introduce another POC for vectorized execution engine https://github.com/zhangh43/vectorize_engine and want to get some feedback on the idea. The basic idea is to extend the TupleTableSlot and introduce VectorTupleTableSlot, which is an array of datums organized by projected columns. The array of datum per column is continuous in memory. This makes the expression evaluation cache friendly and SIMD could be utilized. We have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot currently. Below are features in our design. 1. Pure extension. We don't hack any code into postgres kernel. 2. CustomScan node. We use CustomScan framework to replace original executor node such as SeqScan, Agg etc. Based on CustomScan, we could extend the CustomScanState, BeginCustomScan(), ExecCustomScan(), EndCustomScan() interface to implement vectorize executor logic. 3. Post planner hook. After plan is generated, we use plan_tree_walker to traverse the plan tree and check whether it could be vectorized. If yes, the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized nodes (in form of CustomScan node) and use vectorized executor. If no, we will revert to the original plan and use non-vectorized executor. In future this part could be enhanced, for example, instead of revert to original plan when some nodes cannot be vectorized, we could add Batch/UnBatch node to generate a plan with both vectorized as well as non-vectorized node. 4. Support implement new vectorized executor node gradually. We currently only vectorized SeqScan and Agg but other queries which including Join could also be run when vectorize extension is enabled. 5. Inherit original executor code. Instead of rewriting the whole executor, we choose a more smooth method to modify current Postgres executor node and make it vectorized. We copy the current executor node's c file into our extension, and add vectorize logic based on it. When Postgres enhance its executor, we could relatively easily merge them back. We want to know whether this is a good way to write vectorized executor extension? 6. Pluggable storage. Postgres has supported pluggable storage now. TupleTableSlot is refactored as abstract struct TupleTableSlotOps. VectorTupleTableSlot could be implemented under this framework when we upgrade the extension to latest PG. We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S. 28sec(Vectorized PG). Performance gain can be improved by: 1. heap tuple deform occupy many CPUs. We will try zedstore in future, since vectorized executor is more compatible with column store. 2. vectorized agg is not fully vectorized and we have many optimization need to do. For example, batch compute the hash value, optimize hash table for vectorized HashAgg. 3. Conversion cost from Datum to actual type and vice versa is also high, for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we store the actual type in VectorTupleTableSlot directly, instead of an array of datums. Related works: 1. VOPS is a vectorized execution extension. Link: https://github.com/postgrespro/vops. It doesn't use custom scan framework and use UDF to do the vectorized operation e.g. it changes the SQL syntax to do aggregation. 2. Citus vectorized executor is another POC. Link: https://github.com/citusdata/postgres_vectorization_test. It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw to support column storage. Note that the vectorized executor engine is based on PG9.6 now, but it could be ported to master / zedstore with some effort. We would appreciate some feedback before moving further in that direction. Thanks, Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen
Re: accounting for memory used for BufFile during hash joins
ich is 8 now. So we have to split some tuples in R3 to disk file R7, *which is not necessary*. When Probing S3, we also need to spilt some tuples in S3 into S7, *which is not necessary either*. Since R3 could be loaded into memory entirely, spill part of R3 to disk file not only introduce more file and file buffers(which is problem Tomas try to solve), but also slow down the performance. After the third batch processed, we got: disk files: batch4(R4,S4),batch(R6,S6),batch(R7,S7) Next, we begin to process R4 and S4. Similar to R3, some tuples in R4 also need to be spilled to file R8. But after this splitting, suppose R4 is still skewed, and we increase the batch number again to 16. As a result, some tuples in R4 will be spilled to file R12 and R16. When probing S4, similarly we need to split some tuples in S4 into S8,S12 and S16. After this step, we get: disk files: batch(R6,S6),batch(R7,S7),batch(R8,S8),batch(R12,S12),batch(R16,S16). Next, when we begin to process R6 and S6, even if we could build hash table for R6 all in memory, but we have to spilt R6 based on new batch number 16 and spill to file: R14. *It's not necessary.* Now we could conclude that increasing batch number would introduce unnecessary repeated spill not only on original batch(R3,S3) but also on new generated batch(R6,S6) in a cascade way. *In a worse case, suppose R2 is super skew and need to split 10 times, while R3 is OK to build hash table all in memory. In this case, we have to introduce R7,R11,,R4095, total 1023 unnecessary spill files. Each of these files may only contain less than ten tuples. Also, we need to palloc file buffer(512KB) for these spill files. This is the so called batch explosion problem.* *Solutions:* To avoid these unnecessary repeated spill, I propose to make function ExecHashGetBucketAndBatch as a hash function chain to determine the batchno. Here is the original implementation of ExecHashGetBucketAndBatch ``` //nbatch is the global batch number *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1); ``` We can see the original hash function basically calculate MOD of global batch number(IBN). A real hybrid hash join should use a hash function chain to determine the batchno. In the new algorithm, the component of hash function chain is defined as: MOD of #IBN, MOD of #IBN*2, MOD of #IBN*4,MOD of #IBN*8 etc. A small batch will just use the first hash function in chain, while the skew batch will use the same number of hash functions in chain as the times it is split. Here is the new implementation of ExecHashGetBucketAndBatch ``` /* i is the current batchno we are processing */ /* hashChainLen record the times batch i is spilt */ for (j=0;j> hashtable->log2_nbuckets) & ((#initialBatch)* (2^j) - 1); /* if the calculated batchno is still i, we need to call more hash functions * in chain to determine the final bucketno, else we could return directly. */ if ( batchno != i ) return batchno; } return batchno; ``` A quick example, Suppose R3's input is 3,7,11,15,19,23,27,31,35,15,27(we could ensure they MOD4=3) Suppose Initial batch number is 4 and memory could contain 4 tuples, the 5th tuple need to do batch spilt. Step1: batch3 process 3,7,11,15,19 and now need to split, chainLen[3]=2 batch3: 3,11,19 batch7: 7,15 Step2: 23,27,31 coming batch3: 3,11,19,27 batch7: 7,15,23,31 Step 3: 35 coming, batch3 need to split again chainLen[3]=3 batch3: 3,19,35 batch7: 7,15,23,31 batch11: 11,27 Step 4 15 coming, HashFun1 15%4=3, HashFun2 15%8=7; since 7!=3 spill 15 to batch7. Step 5 27 coming, 27%4=3, 27%8=3, 27%16 =11 since 27!=3 spill 27 to batch 11. Final state: chainLen[3]=3 batch3: 3,19,35 batch7: 7,15,23,31,15 batch11: 11,27,27 Here is pseudo code of processing of batch i: ``` /*Step 1: build hash table for Ri*/ tuple = ReadFromFile(Ri); /* get batchno by the new function*/ batchno =NewExecHashGetBucketAndBatch() /* do spill if not belong to current batch*/ if(batchno != i) spill to file[batchno] flag = InsertTupleToHashTable(HT, tuple); if (flag == NEED_SPILT) { hashChainLen[i] ++; /* then call ExecHashIncreaseNumBatches() to do the real spill */ } /* probe stage */ tuple = ReadFromFile(S[i+Bi*k]); batchno = NewExecHashGetBucketAndBatch() if (batchno == curbatch) probe and match else spillToFile(tuple, batchno) } ``` This solution only split the batch which needs to be split in a lazy way. If this solution makes sense, I would like write the real patch. Any comment? -- Thanks Hubert Zhang
How to create named portal except cursor?
Hi all, Is there any way to create a named portal except cursor in PG? I tried postgres-jdbc driver and use PrepareStatement. Backend could receive `bind` and `execute` message, but the portal name is still empty. How can I specify the portal name? -- Thanks Hubert Zhang
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
Thanks, Thomas. On Mon, Jul 8, 2019 at 6:47 AM Thomas Munro wrote: > On Mon, Feb 18, 2019 at 7:39 PM Hubert Zhang wrote: > > Based on the assumption we use smgr as hook position, hook API option1 > or option2 which is better? > > Or we could find some balanced API between option1 and option2? > > > > Again comments on other better hook positions are also appreciated! > > Hi Hubert, > > The July Commitfest is now running, and this entry is in "needs > review" state. Could you please post a rebased patch? > > I have questions about how disk quotas should work and I think we'll > probably eventually want more hooks than these, but simply adding > these hooks so extensions can do whatever they want doesn't seem very > risky for core. I think it's highly likely that the hook signatures > will have to change in future releases too, but that seems OK for such > detailed internal hooks. As for your question, my first reaction was > that I preferred your option 1, because SMgrRelation seems quite > private and there are no existing examples of that object being > exposed to extensions. But on reflection, other callbacks don't take > such a mollycoddling approach. So my vote is for option 2 "just pass > all the arguments to the callback", which I understand to be the > approach of patch you have posted. > > +if (smgrcreate_hook) > +{ > +(*smgrcreate_hook)(reln, forknum, isRedo); > +} > > Usually we don't use curlies for single line if branches. > > I have rebased the patch to v4 and removed the unnecessary braces. As your comments, Options 2 is still used in patch v4. Agree that diskquota extension may use more hooks in future. Currently the behavior of diskquota extension is that we use smgr hooks to detect active tables and record them in the shared memory. Bgworkers of diskquota extension will read these active tables from shared memory and calculate the latest table size and sum them into the size of schema or role. If size of schema of role exceeds their quota limit, they will be put into a black list in shared memory. When a new query comes, ExecutorCheckPerms_hook will be used to check the black list the cancel the query if needed. -- Thanks Hubert Zhang disk_quota_hooks_v4.patch Description: Binary data
Re: accounting for memory used for BufFile during hash joins
Hi Tomas, Here is the patch, it's could be compatible with your patch and it focus on when to regrow the batch. On Tue, May 28, 2019 at 3:40 PM Hubert Zhang wrote: > On Sat, May 4, 2019 at 8:34 AM Tomas Vondra > wrote: > >> The root cause is that hash join treats batches as pretty much free, but >> that's not really true - we do allocate two BufFile structs per batch, >> and each BufFile is ~8kB as it includes PGAlignedBuffer. >> >> The OOM is not very surprising, because with 524288 batches it'd need >> about 8GB of memory, and the system only has 8GB RAM installed. >> >> The second patch tries to enforce work_mem more strictly. That would be >> impossible if we were to keep all the BufFile structs in memory, so >> instead it slices the batches into chunks that fit into work_mem, and >> then uses a single "overflow" file for slices currently not in memory. >> These extra slices can't be counted into work_mem, but we should need >> just very few of them. For example with work_mem=4MB the slice is 128 >> batches, so we need 128x less overflow files (compared to per-batch). >> >> > Hi Tomas > > I read your second patch which uses overflow buf files to reduce the total > number of batches. > It would solve the hash join OOM problem what you discussed above: 8K per > batch leads to batch bloating problem. > > I mentioned in another thread: > > https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com > There is another hashjoin OOM problem which disables splitting batches too > early. PG uses a flag hashtable->growEnable to determine whether to split > batches. Once one splitting failed(all the tuples are assigned to only one > batch of two split ones) The growEnable flag would be turned off forever. > > The is an opposite side of batch bloating problem. It only contains too > few batches and makes the in-memory hash table too large to fit into memory. > > Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due > to performance), in-memory hash table takes memory as well and splitting > batched may(not must) reduce the in-memory hash table size but introduce > more batches(and thus more memory usage 8KB*#batch). > Can we conclude that it would be worth to splitting if satisfy: > (The reduced memory of in-memory hash table) - (8KB * number of new > batches) > 0 > > So I'm considering to combine our patch with your patch to fix join OOM > problem. No matter the OOM is introduced by (the memory usage of > in-memory hash table) or (8KB * number of batches). > > nbatch_inmemory in your patch could also use the upper rule to redefine. > > What's your opinion? > > Thanks > > Hubert Zhang > -- Thanks Hubert Zhang 0001-Allow-to-continue-to-split-batch-when-tuples-become-.patch Description: Binary data
Re: accounting for memory used for BufFile during hash joins
On Sat, May 4, 2019 at 8:34 AM Tomas Vondra wrote: > The root cause is that hash join treats batches as pretty much free, but > that's not really true - we do allocate two BufFile structs per batch, > and each BufFile is ~8kB as it includes PGAlignedBuffer. > > The OOM is not very surprising, because with 524288 batches it'd need > about 8GB of memory, and the system only has 8GB RAM installed. > > The second patch tries to enforce work_mem more strictly. That would be > impossible if we were to keep all the BufFile structs in memory, so > instead it slices the batches into chunks that fit into work_mem, and > then uses a single "overflow" file for slices currently not in memory. > These extra slices can't be counted into work_mem, but we should need > just very few of them. For example with work_mem=4MB the slice is 128 > batches, so we need 128x less overflow files (compared to per-batch). > > Hi Tomas I read your second patch which uses overflow buf files to reduce the total number of batches. It would solve the hash join OOM problem what you discussed above: 8K per batch leads to batch bloating problem. I mentioned in another thread: https://www.postgresql.org/message-id/flat/CAB0yrekv%3D6_T_eUe2kOEvWUMwufcvfd15SFmCABtYFOkxCFdfA%40mail.gmail.com There is another hashjoin OOM problem which disables splitting batches too early. PG uses a flag hashtable->growEnable to determine whether to split batches. Once one splitting failed(all the tuples are assigned to only one batch of two split ones) The growEnable flag would be turned off forever. The is an opposite side of batch bloating problem. It only contains too few batches and makes the in-memory hash table too large to fit into memory. Here is the tradeoff: one batch takes more than 8KB(8KB makes sense, due to performance), in-memory hash table takes memory as well and splitting batched may(not must) reduce the in-memory hash table size but introduce more batches(and thus more memory usage 8KB*#batch). Can we conclude that it would be worth to splitting if satisfy: (The reduced memory of in-memory hash table) - (8KB * number of new batches) > 0 So I'm considering to combine our patch with your patch to fix join OOM problem. No matter the OOM is introduced by (the memory usage of in-memory hash table) or (8KB * number of batches). nbatch_inmemory in your patch could also use the upper rule to redefine. What's your opinion? Thanks Hubert Zhang
Re: Replace hashtable growEnable flag
Thanks Tomas. I will follow this problem on your thread. This thread could be terminated. On Thu, May 16, 2019 at 3:58 AM Tomas Vondra wrote: > On Wed, May 15, 2019 at 06:19:38PM +0800, Hubert Zhang wrote: > >Hi all, > > > >When we build hash table for a hash join node etc., we split tuples into > >different hash buckets. Since tuples could not all be held in memory. > >Postgres splits each bucket into batches, only the current batch of bucket > >is in memory while other batches are written to disk. > > > >During ExecHashTableInsert(), if the memory cost exceeds the operator > >allowed limit(hashtable->spaceAllowed), batches will be split on the fly > by > >calling ExecHashIncreaseNumBatches(). > > > >In past, if data is distributed unevenly, the split of batch may > failed(All > >the tuples falls into one split batch and the other batch is empty) Then > >Postgres will set hashtable->growEnable to false. And never expand batch > >number any more. > > > >If tuples become diverse in future, spliting batch is still valuable and > >could avoid the current batch become too big and finally OOM. > > > >To fix this, we introduce a penalty on hashtable->spaceAllowed, which is > >the threshold to determine whether to increase batch number. > >If batch split failed, we increase the penalty instead of just turn off > the > >growEnable flag. > > > >Any comments? > > > > There's already another thread discussing various issues with how hashjoin > increases the number of batches, including various issues with how/when we > disable adding more batches. > > https://commitfest.postgresql.org/23/2109/ > > In general I think you're right something like this is necessary, but I > think we may need to rethink growEnable a bit more. > > For example, the way you implemented it, after reaching the increased > limit, we just increase the number of batches just like today, and then > decide whether it actually helped. But that means we double the number of > BufFile entries, which uses more and more memory (because each is 8kB and > we need 1 per batch). I think in this case (after increasing the limit) we > should check whether increasing batches makes sense or not. And only do it > if it helps. Otherwise we'll double the amount of memory for BufFile(s) > and also the work_mem. That's not a good idea. > > But as I said, there are other issues discussed on the other thread. For > example we only disable the growth when all rows fall into the same batch. > But that's overly strict. > > > regards > > -- > Tomas Vondra > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com=DwIBAg=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=lz-kpGdw_rtpgYV2ho3DjDSB5Psxis_b-3VZKON7K7c=y2bI6_b4EPRd9aTQGv9Pio3c_ZtCWs_jzKd4t8CtJEI=XHLORM8U7I6XR_EDkgSFtJDvhxIVd2rDA7r-xvJa278= > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > -- Thanks Hubert Zhang
Replace hashtable growEnable flag
Hi all, When we build hash table for a hash join node etc., we split tuples into different hash buckets. Since tuples could not all be held in memory. Postgres splits each bucket into batches, only the current batch of bucket is in memory while other batches are written to disk. During ExecHashTableInsert(), if the memory cost exceeds the operator allowed limit(hashtable->spaceAllowed), batches will be split on the fly by calling ExecHashIncreaseNumBatches(). In past, if data is distributed unevenly, the split of batch may failed(All the tuples falls into one split batch and the other batch is empty) Then Postgres will set hashtable->growEnable to false. And never expand batch number any more. If tuples become diverse in future, spliting batch is still valuable and could avoid the current batch become too big and finally OOM. To fix this, we introduce a penalty on hashtable->spaceAllowed, which is the threshold to determine whether to increase batch number. If batch split failed, we increase the penalty instead of just turn off the growEnable flag. Any comments? -- Thanks Hubert Zhang 0001-Using-growPenalty-to-replace-growEnable-in-hashtable.patch Description: Binary data
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
Hi Andres On Sat, Feb 16, 2019 at 12:53 PM Andres Freund wrote: > Hi, > On 2019-01-30 10:26:52 +0800, Hubert Zhang wrote: > > Hi Michael, Robert > > For you question about the hook position, I want to explain more about > the > > background why we want to introduce these hooks. > > We wrote a diskquota extension < > https://github.com/greenplum-db/diskquota> > > [ ...] > > On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang wrote: > > > > > > For this particular purpose, I don't immediately see why you need a > > >> > hook in both places. If ReadBuffer is called with P_NEW, aren't we > > >> > guaranteed to end up in smgrextend()? > > >> Yes, that's a bit awkward. > > Please note that on PG lists the customary and desired style is to quote > inline in messages rather than top-quote. > > Greetings, > > Andres Freund > Thanks for your note. I will reply the above questions again. On Wed, Nov 21, 2018 at 10:48 PM Robert Haas wrote: > On Tue, Nov 20, 2018 at 2:20 AM Haozhou Wang wrote: > > We prepared a patch that includes the hook points. And such hook points > are needed for disk quota extension. > > There are two hooks. > > One is SmgrStat_hook. It's used to perform ad-hoc logic in storage when > doing smgr create/extend/truncate in general. As for disk quota extension, > this hook is used to detect active tables(new created tables, tables > extending new blocks, or tables being truncated) > > The other is BufferExtendCheckPerms_hook. It's used to perform ad-hoc > logic when buffer extend a new block. Since ReadBufferExtended is a hot > function, we call this hook only when blockNum == P_NEW. As for disk quota > extension, this hook is used to do query enforcement during the query is > loading data. > > > > Any comments are appreciated. > > +1 for adding some hooks to support this kind of thing, but I think > the names you've chosen are not very good. The hook name should > describe the place from which it is called, not the purpose for which > one imagines that it will be used, because somebody else might imagine > another use. Both BufferExtendCheckPerms_hook_type and > SmgrStat_hook_type are imagining that they know what the hook does - > CheckPerms in the first case and Stat in the second case. > > For this particular purpose, I don't immediately see why you need a > hook in both places. If ReadBuffer is called with P_NEW, aren't we > guaranteed to end up in smgrextend()? We have removed the hook in ReadBufferExtended and only keep the hooks in SMGR. On Wed, Dec 26, 2018 at 1:56 PM Michael Paquier wrote: > On Wed, Nov 21, 2018 at 09:47:44AM -0500, Robert Haas wrote: > > +1 for adding some hooks to support this kind of thing, but I think > > the names you've chosen are not very good. The hook name should > > describe the place from which it is called, not the purpose for which > > one imagines that it will be used, because somebody else might imagine > > another use. Both BufferExtendCheckPerms_hook_type and > > SmgrStat_hook_type are imagining that they know what the hook does - > > CheckPerms in the first case and Stat in the second case. > > I personally don't mind making Postgres more pluggable, but I don't > think that we actually need the extra ones proposed here at the layer > of smgr, as smgr is already a layer designed to call an underlying set > of APIs able to extend, unlink, etc. depending on the storage type. > > The reason to use smgr as the hook position is as follows: These hooks will be used by diskquota extension, which needs to gather the current disk usage for each schema. We want to use hooks to detect the Active Tables, and only recalculate the disk size of all the Active Tables. As you mentioned, smgr is the layer to call underlying API to extend/unlink files. So it's also the place to detect the table size change, i.e. the Active Tables. For example, the smgrextend hook indicates that you allocate a new block and the corresponding table needs to be treated as Active Table. Besides, suppose we use smgr as hook position, I want to discuss the API passed to the hook function. Take smgrextend as example. The function interface of smgrextend is like that: ``` void smgrextend (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync); ``` So the hook api of smgrextend could have two main options. Hook API Option1 ``` typedef void (*smgrextend_hook_type) (RelFileNode smgr_rnode,ForkNumber forknum); ``` Hook API Option 2 ``` typedef void (*smgrextend_hook_type) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,char *buffer, bool skipFsync); ``` As for Option1. Since diskquota extension only needs the relfilenode information t
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
Hi all, Currently we add hooks in SMGR to detect which table is being modified(disk size change). Inserting rows into existing page with room will not introduce new block, and thus should not be treated as active table. smgrextend is a good position to meet this behavior. We welcome suggestions on other better hook positions! Besides, suppose we use smgr as hook position, I want to discuss the API passed to the hook function. Take smgrextend as example. The function interface of smgrextend is like that: ``` void smgrextend (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync); ``` So the hook api of smgrextend could have two main options. Hook API Option1 ``` typedef void (*smgrextend_hook_type) (RelFileNode smgr_rnode,ForkNumber forknum); ``` Hook API Option 2 ``` typedef void (*smgrextend_hook_type) (SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,char *buffer, bool skipFsync); ``` As for Option1. Since diskquota extension only needs the relfilenode information to detect the active tables, Option1 just pass the RelFileNode to the hook function. It's more clear and has semantic meaning. While Option 2 is to pass the original parameters to the hook functions without any filter. This is more general and let the different hook implementations to decide how to use these parameters. Option 1 also needs some additional work to handle smgrdounlinkall case, whose input parameter is the SMgrRelation list. We may need to palloc Relfilenode list and pfree it manually. smgrdounlinkall function interface: ``` smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo, char *relstorages) ``` Based on the assumption we use smgr as hook position, hook API option1 or option2 which is better? Or we could find some balanced API between option1 and option2? Again comments on other better hook positions are appreciated! Thanks Hubert On Wed, Jan 30, 2019 at 10:26 AM Hubert Zhang wrote: > Hi Michael, Robert > For you question about the hook position, I want to explain more about the > background why we want to introduce these hooks. > We wrote a diskquota extension <https://github.com/greenplum-db/diskquota> > for Postgresql(which is inspired by Heikki's pg_quota > <https://github.com/hlinnaka/pg_quota>). Diskquota extension is used to > control the disk usage in Postgresql in a fine-grained way, which means: > 1. You could set disk quota limit at schema level or role level. > 2. A background worker will gather the current disk usage for each > schema/role in realtime. > 3. A background worker will generate the blacklist for schema/role whose > quota limit is exceeded. > 4. New transaction want to insert data into the schema/role in the > blacklist will be cancelled. > > In step 2, gathering the current disk usage for each schema needs to sum > disk size of all the tables in this schema. This is a time consuming > operation. We want to use hooks in SMGR to detect the Active Table, and > only recalculate the disk size of all the Active Tables. > For example, the smgrextend hook indicates that you allocate a new block > and the table need to be treated as Active Table. > > Do you have some better hook positions recommend to solve the above user > case? > Thanks in advance. > > Hubert > > > > > > On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang wrote: > >> > For this particular purpose, I don't immediately see why you need a >>> > hook in both places. If ReadBuffer is called with P_NEW, aren't we >>> > guaranteed to end up in smgrextend()? >>> Yes, that's a bit awkward. >> >> >> Hi Michael, we revisit the ReadBuffer hook and remove it in the latest >> patch. >> ReadBuffer hook is original used to do enforcement(e.g. out of diskquota >> limit) when query is loading data. >> We plan to put the enforcement work of running query to separate >> diskquota worker process. >> Let worker process to detect the backends to be cancelled and send SIGINT >> to these backends. >> So there is no need for ReadBuffer hook anymore. >> >> Our patch currently only contains smgr related hooks to catch the file >> change and get the Active Table list for diskquota extension. >> >> Thanks Hubert. >> >> >> On Mon, Jan 7, 2019 at 6:56 PM Haozhou Wang wrote: >> >>> Thanks very much for your comments. >>> >>> To the best of my knowledge, smgr is a layer that abstract the storage >>> operations. Therefore, it is a good place to control or collect information >>> the storage operations without touching the physical storage layer. >>> Moreover, smgr is coming with actual disk IO operation (not consider the >>> OS cache) for postgres. So we do not need to worr
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
Hi Michael, Robert For you question about the hook position, I want to explain more about the background why we want to introduce these hooks. We wrote a diskquota extension <https://github.com/greenplum-db/diskquota> for Postgresql(which is inspired by Heikki's pg_quota <https://github.com/hlinnaka/pg_quota>). Diskquota extension is used to control the disk usage in Postgresql in a fine-grained way, which means: 1. You could set disk quota limit at schema level or role level. 2. A background worker will gather the current disk usage for each schema/role in realtime. 3. A background worker will generate the blacklist for schema/role whose quota limit is exceeded. 4. New transaction want to insert data into the schema/role in the blacklist will be cancelled. In step 2, gathering the current disk usage for each schema needs to sum disk size of all the tables in this schema. This is a time consuming operation. We want to use hooks in SMGR to detect the Active Table, and only recalculate the disk size of all the Active Tables. For example, the smgrextend hook indicates that you allocate a new block and the table need to be treated as Active Table. Do you have some better hook positions recommend to solve the above user case? Thanks in advance. Hubert On Tue, Jan 22, 2019 at 12:08 PM Hubert Zhang wrote: > > For this particular purpose, I don't immediately see why you need a >> > hook in both places. If ReadBuffer is called with P_NEW, aren't we >> > guaranteed to end up in smgrextend()? >> Yes, that's a bit awkward. > > > Hi Michael, we revisit the ReadBuffer hook and remove it in the latest > patch. > ReadBuffer hook is original used to do enforcement(e.g. out of diskquota > limit) when query is loading data. > We plan to put the enforcement work of running query to separate diskquota > worker process. > Let worker process to detect the backends to be cancelled and send SIGINT > to these backends. > So there is no need for ReadBuffer hook anymore. > > Our patch currently only contains smgr related hooks to catch the file > change and get the Active Table list for diskquota extension. > > Thanks Hubert. > > > On Mon, Jan 7, 2019 at 6:56 PM Haozhou Wang wrote: > >> Thanks very much for your comments. >> >> To the best of my knowledge, smgr is a layer that abstract the storage >> operations. Therefore, it is a good place to control or collect information >> the storage operations without touching the physical storage layer. >> Moreover, smgr is coming with actual disk IO operation (not consider the >> OS cache) for postgres. So we do not need to worry about the buffer >> management in postgres. >> It will make the purpose of hook is pure: a hook for actual disk IO. >> >> Regards, >> Haozhou >> >> On Wed, Dec 26, 2018 at 1:56 PM Michael Paquier >> wrote: >> >>> On Wed, Nov 21, 2018 at 09:47:44AM -0500, Robert Haas wrote: >>> > +1 for adding some hooks to support this kind of thing, but I think >>> > the names you've chosen are not very good. The hook name should >>> > describe the place from which it is called, not the purpose for which >>> > one imagines that it will be used, because somebody else might imagine >>> > another use. Both BufferExtendCheckPerms_hook_type and >>> > SmgrStat_hook_type are imagining that they know what the hook does - >>> > CheckPerms in the first case and Stat in the second case. >>> >>> I personally don't mind making Postgres more pluggable, but I don't >>> think that we actually need the extra ones proposed here at the layer >>> of smgr, as smgr is already a layer designed to call an underlying set >>> of APIs able to extend, unlink, etc. depending on the storage type. >>> >>> > For this particular purpose, I don't immediately see why you need a >>> > hook in both places. If ReadBuffer is called with P_NEW, aren't we >>> > guaranteed to end up in smgrextend()? >>> >>> Yes, that's a bit awkward. >>> -- >>> Michael >> >> > > -- > Thanks > > Hubert Zhang > -- Thanks Hubert Zhang
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
> > > For this particular purpose, I don't immediately see why you need a > > hook in both places. If ReadBuffer is called with P_NEW, aren't we > > guaranteed to end up in smgrextend()? > Yes, that's a bit awkward. Hi Michael, we revisit the ReadBuffer hook and remove it in the latest patch. ReadBuffer hook is original used to do enforcement(e.g. out of diskquota limit) when query is loading data. We plan to put the enforcement work of running query to separate diskquota worker process. Let worker process to detect the backends to be cancelled and send SIGINT to these backends. So there is no need for ReadBuffer hook anymore. Our patch currently only contains smgr related hooks to catch the file change and get the Active Table list for diskquota extension. Thanks Hubert. On Mon, Jan 7, 2019 at 6:56 PM Haozhou Wang wrote: > Thanks very much for your comments. > > To the best of my knowledge, smgr is a layer that abstract the storage > operations. Therefore, it is a good place to control or collect information > the storage operations without touching the physical storage layer. > Moreover, smgr is coming with actual disk IO operation (not consider the > OS cache) for postgres. So we do not need to worry about the buffer > management in postgres. > It will make the purpose of hook is pure: a hook for actual disk IO. > > Regards, > Haozhou > > On Wed, Dec 26, 2018 at 1:56 PM Michael Paquier > wrote: > >> On Wed, Nov 21, 2018 at 09:47:44AM -0500, Robert Haas wrote: >> > +1 for adding some hooks to support this kind of thing, but I think >> > the names you've chosen are not very good. The hook name should >> > describe the place from which it is called, not the purpose for which >> > one imagines that it will be used, because somebody else might imagine >> > another use. Both BufferExtendCheckPerms_hook_type and >> > SmgrStat_hook_type are imagining that they know what the hook does - >> > CheckPerms in the first case and Stat in the second case. >> >> I personally don't mind making Postgres more pluggable, but I don't >> think that we actually need the extra ones proposed here at the layer >> of smgr, as smgr is already a layer designed to call an underlying set >> of APIs able to extend, unlink, etc. depending on the storage type. >> >> > For this particular purpose, I don't immediately see why you need a >> > hook in both places. If ReadBuffer is called with P_NEW, aren't we >> > guaranteed to end up in smgrextend()? >> >> Yes, that's a bit awkward. >> -- >> Michael > > -- Thanks Hubert Zhang disk_quota_hooks_v3.patch Description: Binary data
Discussion: Fast DB/Schema/Table disk size check in Postgresql
a extension. Thanks in advance! -- Thanks Hubert Zhang
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
> > For this particular purpose, I don't immediately see why you need a > hook in both places. If ReadBuffer is called with P_NEW, aren't we > guaranteed to end up in smgrextend()? For the usercase in diskquota. BufferExtendCheckPerms_hook is used to do dynamic query enforcement, while smgr related hooks are used to detect relfilenodeoid of active tables and write them into shared memory(which is used to accelerate refreshing of diskquota model). The reason we don't use smgr_extend hook to replace ReadBuffer hook to do enforcement has two folds: 1. As for enforcement, we don't want to affect the performance of insert query. But hooks in smgr_extend need to convert relfilenode to reloid firstly which need an indexscan. 2. Using hooks in ReadBuffer instead of smgr_extend could avoid to enforcement on 'cluster relation' operator. For example, 'vacuum full table' will firstly cluster and create a new table, and then delete the old table. Because the disk usage will first grow and then shrink, if quota limit is reached, then vacuum full will fail.(but in fact we want vacuum full to reduce disk usage) Using hooks in ReadBuffer is one solution to this problem. Of course, there are other solutions. But This is one of the reason we use BufferExtendCheckPerms_hook to do enforcement at current stage. On Thu, Nov 22, 2018 at 7:26 PM Haozhou Wang wrote: > Thank you very much for your review. > We refactored our patch with new names and comments. > > For ReadBufferExtended hook, yes, Readbuffer with P_NEW will then call > smgrextend. > > But in smgrextend, we cannot get the oid of a relation, and it will take > some time to get the oid via smgrrelation. > We would like to add a hook just before the smgrextend to get the oid and > avoid use RelidByRelfilenode(). > > New patch is attached in the attachment. > Thank a lot! > > Regards, > Haozhou > > > On Wed, Nov 21, 2018 at 10:48 PM Robert Haas > wrote: > >> On Tue, Nov 20, 2018 at 2:20 AM Haozhou Wang wrote: >> > We prepared a patch that includes the hook points. And such hook points >> are needed for disk quota extension. >> > There are two hooks. >> > One is SmgrStat_hook. It's used to perform ad-hoc logic in storage when >> doing smgr create/extend/truncate in general. As for disk quota extension, >> this hook is used to detect active tables(new created tables, tables >> extending new blocks, or tables being truncated) >> > The other is BufferExtendCheckPerms_hook. It's used to perform ad-hoc >> logic when buffer extend a new block. Since ReadBufferExtended is a hot >> function, we call this hook only when blockNum == P_NEW. As for disk quota >> extension, this hook is used to do query enforcement during the query is >> loading data. >> > >> > Any comments are appreciated. >> >> +1 for adding some hooks to support this kind of thing, but I think >> the names you've chosen are not very good. The hook name should >> describe the place from which it is called, not the purpose for which >> one imagines that it will be used, because somebody else might imagine >> another use. Both BufferExtendCheckPerms_hook_type and >> SmgrStat_hook_type are imagining that they know what the hook does - >> CheckPerms in the first case and Stat in the second case. >> >> For this particular purpose, I don't immediately see why you need a >> hook in both places. If ReadBuffer is called with P_NEW, aren't we >> guaranteed to end up in smgrextend()? >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> > > > -- Thanks Hubert Zhang
Re: Control your disk usage in PG: Introduction to Disk Quota Extension
Thanks, Tomas, Yes, we want to add the hooks into postgres repo. But before that, we plan to firstly get some feedbacks from community about the diskquota extension implementation and usage? Later, we'll modify our license and submit the hooks into CF. Thanks Hubert On Wed, Nov 14, 2018 at 3:54 AM Tomas Vondra wrote: > On Tue, 2018-11-13 at 16:47 +0800, Hubert Zhang wrote: > > Hi all, > > > > We implement disk quota feature on Postgresql as an extension(link: > > https://github.com/greenplum-db/diskquota), > > If you are interested, try and use it to limit the amount of disk > > space that > > a schema or a role can use in Postgresql. > > Any feedback or question are appreciated. > > > > It's not clear to me what's the goal of this thread? I understand what > quotas are about, but are you sharing it because (a) it's a useful > extension, (b) you propose adding a couple of new hooks (and keep the > extension separate) or (c) you propose adding both the hooks and the > extension (into contrib)? > > I assume it's either (b) and (c), in which case you should add it to > 2019-01 CF: https://commitfest.postgresql.org/21/ > > In either case, it might be useful to add a LICENSE to the github > repository, it's not clear what's the situation in this respect. That > probably matters especially for (b), because for (c) it'd end up with > PostgreSQL license automatically. > > regards > > -- > Tomas Vondra http://www.2ndQuadrant.com > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > -- Thanks Hubert Zhang
Control your disk usage in PG: Introduction to Disk Quota Extension
disk quota stats periodically diskquota.naptime = 2 # restart database to load preload library. pg_ctl restart 1. Create diskquota extension in monitored database. create extension diskquota; 1. Reload database configuraion # reset monitored database list in postgresql.conf diskquota.monitor_databases = 'postgres, postgres2' # reload configuration pg_ctl reload <https://github.com/greenplum-db/diskquota/blob/master/README.md#usage>Usage 1. Set/update/delete schema quota limit using diskquota.set_schema_quota create schema s1; select diskquota.set_schema_quota('s1', '1 MB'); set search_path to s1; create table a(i int); # insert small data succeeded insert into a select generate_series(1,100); # insert large data failed insert into a select generate_series(1,1000); # insert small data failed insert into a select generate_series(1,100); # delete quota configuration select diskquota.set_schema_quota('s1', '-1'); # insert small data succeed select pg_sleep(5); insert into a select generate_series(1,100); reset search_path; 1. Set/update/delete role quota limit using diskquota.set_role_quota create role u1 nologin; create table b (i int); alter table b owner to u1; select diskquota.set_role_quota('u1', '1 MB'); # insert small data succeeded insert into b select generate_series(1,100); # insert large data failed insert into b select generate_series(1,1000); # insert small data failed insert into b select generate_series(1,100); # delete quota configuration select diskquota.set_role_quota('u1', '-1'); # insert small data succeed select pg_sleep(5); insert into a select generate_series(1,100); reset search_path; 1. Show schema quota limit and current usage select * from diskquota.show_schema_quota_view; <https://github.com/greenplum-db/diskquota/blob/master/README.md#test>Test Run regression tests. cd contrib/diskquota; make installcheck <https://github.com/greenplum-db/diskquota/blob/master/README.md#benchmark--performence-test>Benchmark & Performence Test <https://github.com/greenplum-db/diskquota/blob/master/README.md#cost-of-diskquota-worker>Cost of diskquota worker During each refresh interval, the disk quota worker need to refresh the disk quota model. It take less than 100ms under 100K user tables with no avtive tables. It take less than 200ms under 100K user tables with 1K active tables. <https://github.com/greenplum-db/diskquota/blob/master/README.md#impact-on-oltp-queries>Impact on OLTP queries We test OLTP queries to measure the impact of enabling diskquota feature. The range is from 2k tables to 10k tables. Each connection will insert 100 rows into each table. And the parallel connections range is from 5 to 25. Number of active tables will be around 1k. Without diskquota enabled (seconds) 2k4k6k8k10k 5 4.002 11.356 18.460 28.591 41.123 10 4.832 11.988 21.113 32.829 45.832 15 6.238 16.896 28.722 45.375 64.642 20 8.036 21.711 38.499 61.763 87.875 25 9.909 27.175 47.996 75.688 106.648 With diskquota enabled (seconds) 2k4k6k8k10k 5 4.135 10.641 18.776 28.804 41.740 10 4.773 12.407 22.351 34.243 47.568 15 6.355 17.305 30.941 46.967 66.216 20 9.451 22.231 40.645 61.758 88.309 25 10.096 26.844 48.910 76.537 108.025 The performance difference between with/without diskquota enabled are less then 2-3% in most case. Therefore, there is no significant performance downgrade when diskquota is enabled. <https://github.com/greenplum-db/diskquota/blob/master/README.md#notes>Notes 1. Drop database with diskquota enabled. If DBA enable monitoring diskquota on a database, there will be a connection to this database from diskquota worker process. DBA need to first remove this database from diskquota.monitor_databases in postgres.conf, and reload configuration by call pg_ctl reload. Then database could be dropped successfully. 1. Temp table. Diskquota supports to limit the disk usage of temp table as well. But schema and role are different. For role, i.e. the owner of the temp table, diakquota will treat it the same as normal tables and sum its table size to its owner's quota. While for schema, temp table is located under namespace 'pg_temp_backend_id', so temp table size will not sum to the current schema's qouta. -- Thanks Hubert Zhang, Haozhou Wang, Hao Wu, Jack WU
Re: Is there way to detect uncommitted 'new table' in pg_class?
Thanks On Thu, Nov 1, 2018 at 8:38 AM Michael Paquier wrote: > On Wed, Oct 31, 2018 at 01:30:52PM -0400, Robert Haas wrote: > > In theory, at least, you could write C code to scan the catalog tables > > with SnapshotDirty to find the catalog entries, but I don't think that > > helps a whole lot. You couldn't necessarily rely on those catalog > > entries to be in a consistent state, and even if they were, they might > > depend on committed types or functions or similar whose definitions > > your backend can't see. Moreover, the creating backend will have an > > AccessExclusiveLock on the table -- if you write C code to bypass that > > and read the data anyway, then you will probably destabilize the > > entire system for complicated reasons that I don't feel like > > explaining right now. > > One take here is that we cannot give any guarantee that a single DDL > will create only one consistent version of the tuple added in system > catalogs. In those cases a new version is made visible by using > CommandCounterIncrement() so as the follow-up processing can see it. > > > You should try very hard to find some way of solving this problem that > > doesn't require reading data from a table that hasn't been committed > > yet, because you are almost certainly not going to be able to make > > that work reliably even if you are willing to write code in C. > > +1. > -- > Michael > -- Thanks Hubert Zhang
Is there way to detect uncommitted 'new table' in pg_class?
Hi all, In PG READ UNCOMMITTED is treated as READ COMMITTED But I have a requirement to read dirty table. Is there way to detect table which is created in other uncommitted transaction? T1: BEGIN; create table a(i int); T2: select * from pg_class where relname='a'; could return table a? -- Thanks Hubert Zhang
Re: Is there any way to request unique lwlock inside a background worker in PG9.4?
Thanks a lot. On Wed, Oct 17, 2018 at 11:21 PM Andres Freund wrote: > Hi, > > On 2018-10-17 23:11:26 +0800, Hubert Zhang wrote: > > The section "Share Memory and LWLocks" describe the AddinShmemInitLock > which > > is used to protect the ShmemInitStruct() when backend workers initialize > > their shm. My requirement is to how to protect the shm access within the > > bgworkers(not at init stage). This lock should be bgworkers specific. > > Rereead the page please. After you RequestAddinLWLocks() during > initialization, you can use LWLockAssign() to get an lwlock. Which you > then can use as is your pleasure. > > Greetings, > > Andres Freund > -- Thanks Hubert Zhang
Re: Is there any way to request unique lwlock inside a background worker in PG9.4?
Hi Amit The section "Share Memory and LWLocks" describe the AddinShmemInitLock which is used to protect the ShmemInitStruct() when backend workers initialize their shm. My requirement is to how to protect the shm access within the bgworkers(not at init stage). This lock should be bgworkers specific. On Wed, Oct 17, 2018 at 7:51 PM Amit Kapila wrote: > On Wed, Oct 17, 2018 at 3:49 PM Hubert Zhang wrote: > > > > Hi all, > > > > I want to init SHM in a background worker, which is supported in PG9.4. > Also I need to use lwlock to protect the share memory inside the worker > code. > > > > RequestNamedLWLockTranche is the way to handle it, but it's not > supported in PG 9.4, is there any way I could request an unique lock inside > worker init code in PG 9.4? > > > > You might want to look at section "Shared Memory and LWLocks" in the docs > [1]. > > [1] - https://www.postgresql.org/docs/9.4/static/xfunc-c.html > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > -- Thanks Hubert Zhang
Is there any way to request unique lwlock inside a background worker in PG9.4?
Hi all, I want to init SHM in a background worker, which is supported in PG9.4. Also I need to use lwlock to protect the share memory inside the worker code. RequestNamedLWLockTranche is the way to handle it, but it's not supported in PG 9.4, is there any way I could request an unique lock inside worker init code in PG 9.4? -- Thanks Hubert Zhang
Re: Proposal for disk quota feature
> > The quotas or object limits, resource limits are pretty useful and > necessary, but I don't see these like new type of objects, it is much more > some property of current objects. Because we have one syntax for this > purpose I prefer it. Because is not good to have two syntaxes for similar > purpose. SCHEMA and TABLE are OK for me, But as I mentioned before, ROLE is a special case when using ALTER SET at this moment. TABLE and SCHEMA are both database level, e.g. pg_class and pg_namespace both residents in one database. But ROLE is cluster-level. They don't belong to a database. ALTER ROLE XXX SET disk_quota = xxx means to set the quota for the user on all the databases in the first glance. But in our first stage design, ROLE's quota is bind to a specific database. E.g. Role Jack could have 10GB quota on database A and 2GB quota on database B. SQL syntax is not hard to modify, I don't think this should block the main design of disk quota feature. Is there any comment on the design and architecture? If no, we'll firstly submit our patch and involve more discussion? On Sat, Sep 22, 2018 at 3:03 PM Pavel Stehule wrote: > > > so 22. 9. 2018 v 8:48 odesílatel Hubert Zhang napsal: > >> But it looks like redundant to current GUC configuration and limits >> >> what do you mean by current GUC configuration? Is that the general block >> number limit in your patch? If yes, the difference between GUC and >> pg_diskquota catalog is that pg_diskquota will store different quota limit >> for the different role, schema or table instead of a single GUC value. >> > > storage is not relevant in this moment. > > I don't see to consistent to sets some limits via SET command, or ALTER X > SET, and some other with CREATE QUOTA ON. > > The quotas or object limits, resource limits are pretty useful and > necessary, but I don't see these like new type of objects, it is much more > some property of current objects. Because we have one syntax for this > purpose I prefer it. Because is not good to have two syntaxes for similar > purpose. > > So instead CREATE DISC QUATA ON SCHEMA xxx some value I prefer > > ALTER SCHEMA xxx SET disc_quota = xxx; > > The functionality is +/- same. But ALTER XX SET was introduce first, and I > don't feel comfortable to have any new syntax for similar purpose > > Regards > > Pavel > > > > > >> >> On Sat, Sep 22, 2018 at 11:17 AM Pavel Stehule >> wrote: >> >>> >>> >>> pá 21. 9. 2018 v 16:21 odesílatel Hubert Zhang >>> napsal: >>> >>>> just fast reaction - why QUOTA object? >>>>> Isn't ALTER SET enough? >>>>> Some like >>>>> ALTER TABLE a1 SET quote = 1MB; >>>>> ALTER USER ... >>>>> ALTER SCHEMA .. >>>>> New DDL commans looks like too hard hammer . >>>> >>>> >>>> It's an option. Prefer to consider quota setting store together: >>>> CREATE DISK QUOTA way is more nature to store quota setting in a >>>> separate pg_diskquota catalog >>>> While ALTER SET way is more close to store quota setting in pg_class, >>>> pg_role, pg_namespace. etc in an integrated way. >>>> (Note that here I mean nature/close is not must, ALTER SET could also >>>> store in pg_diskquota and vice versa.) >>>> >>> >>> I have not a problem with new special table for storing this >>> information. But it looks like redundant to current GUC configuration and >>> limits. Can be messy do some work with ALTER ROLE, and some work via CREATE >>> QUOTE. >>> >>> Regards >>> >>> Pavel >>> >>> >>>> Here are some differences I can think of: >>>> 1 pg_role is a global catalog, not per database level. It's harder to >>>> tracker the user's disk usage in the whole clusters(considering 1000+ >>>> databases). So the semantic of CREATE DISK QUOTA ON USER is limited: it >>>> only tracks the user's disk usage inside the current database. >>>> 2 using separate pg_diskquota could add more field except for quota >>>> limit without adding too many fields in pg_class, e.g. red zone to give the >>>> user a warning or the current disk usage of the db objects. >>>> >>>> On Fri, Sep 21, 2018 at 8:01 PM Pavel Stehule >>>> wrote: >>>> >>>>> >>>>> >>>>> pá 21. 9. 2018 v 13:32 odesílatel Hubert Zhang >>>>> napsal: >>>>> >>>>>> >>>>>> >>>>>> >
Re: Proposal for disk quota feature
> > But it looks like redundant to current GUC configuration and limits what do you mean by current GUC configuration? Is that the general block number limit in your patch? If yes, the difference between GUC and pg_diskquota catalog is that pg_diskquota will store different quota limit for the different role, schema or table instead of a single GUC value. On Sat, Sep 22, 2018 at 11:17 AM Pavel Stehule wrote: > > > pá 21. 9. 2018 v 16:21 odesílatel Hubert Zhang napsal: > >> just fast reaction - why QUOTA object? >>> Isn't ALTER SET enough? >>> Some like >>> ALTER TABLE a1 SET quote = 1MB; >>> ALTER USER ... >>> ALTER SCHEMA .. >>> New DDL commans looks like too hard hammer . >> >> >> It's an option. Prefer to consider quota setting store together: >> CREATE DISK QUOTA way is more nature to store quota setting in a separate >> pg_diskquota catalog >> While ALTER SET way is more close to store quota setting in pg_class, >> pg_role, pg_namespace. etc in an integrated way. >> (Note that here I mean nature/close is not must, ALTER SET could also >> store in pg_diskquota and vice versa.) >> > > I have not a problem with new special table for storing this information. > But it looks like redundant to current GUC configuration and limits. Can be > messy do some work with ALTER ROLE, and some work via CREATE QUOTE. > > Regards > > Pavel > > >> Here are some differences I can think of: >> 1 pg_role is a global catalog, not per database level. It's harder to >> tracker the user's disk usage in the whole clusters(considering 1000+ >> databases). So the semantic of CREATE DISK QUOTA ON USER is limited: it >> only tracks the user's disk usage inside the current database. >> 2 using separate pg_diskquota could add more field except for quota limit >> without adding too many fields in pg_class, e.g. red zone to give the user >> a warning or the current disk usage of the db objects. >> >> On Fri, Sep 21, 2018 at 8:01 PM Pavel Stehule >> wrote: >> >>> >>> >>> pá 21. 9. 2018 v 13:32 odesílatel Hubert Zhang >>> napsal: >>> >>>> >>>> >>>> >>>> >>>> *Hi all,We redesign disk quota feature based on the comments from Pavel >>>> Stehule and Chapman Flack. Here are the new design.OverviewBasically, disk >>>> quota feature is used to support multi-tenancy environment, different level >>>> of database objects could be set a quota limit to avoid over use of disk >>>> space. A common case could be as follows: DBA could enable disk quota on a >>>> specified database list. DBA could set disk quota limit for >>>> tables/schemas/roles in these databases. Separate disk quota worker process >>>> will monitor the disk usage for these objects and detect the objects which >>>> exceed their quota limit. Queries loading data into these “out of disk >>>> quota” tables/schemas/roles will be cancelled.We are currently working at >>>> init implementation stage. We would like to propose our idea firstly and >>>> get feedbacks from community to do quick iteration.SQL Syntax (How to use >>>> disk quota)1 Specify the databases with disk quota enabled in GUC >>>> “diskquota_databases” in postgresql.conf and restart the database.2 DBA >>>> could set disk quota limit for table/schema/role.CREATE DISK QUOTA tablea1 >>>> ON TABLE a1 with (quota = ‘1MB’);CREATE DISK QUOTA roleu1 ON USER u1 with >>>> (quota = ‘1GB’);CREATE DISK QUOTA schemas1 ON SCHEMA s1 with (quota = >>>> ‘3MB’);* >>>> >>> >>> just fast reaction - why QUOTA object? >>> >>> Isn't ALTER SET enough? >>> >>> Some like >>> >>> ALTER TABLE a1 SET quote = 1MB; >>> ALTER USER ... >>> ALTER SCHEMA .. >>> >>> New DDL commans looks like too hard hammer . >>> >>> >>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *3 Simulate a schema out of quota limit case: suppose table a1 and >>>> table a2 are both under schema s1.INSERT INTO a1 SELECT >>>> generate_series(1,1000);INSERT INTO a2 SELECT >>>> generate_series(1,300);SELECT pg_sleep(5)INSERT INTO a1 SELECT >>>> generate_series(1,1000);ERROR: sc
Re: Proposal for disk quota feature
> > just fast reaction - why QUOTA object? > Isn't ALTER SET enough? > Some like > ALTER TABLE a1 SET quote = 1MB; > ALTER USER ... > ALTER SCHEMA .. > New DDL commans looks like too hard hammer . It's an option. Prefer to consider quota setting store together: CREATE DISK QUOTA way is more nature to store quota setting in a separate pg_diskquota catalog While ALTER SET way is more close to store quota setting in pg_class, pg_role, pg_namespace. etc in an integrated way. (Note that here I mean nature/close is not must, ALTER SET could also store in pg_diskquota and vice versa.) Here are some differences I can think of: 1 pg_role is a global catalog, not per database level. It's harder to tracker the user's disk usage in the whole clusters(considering 1000+ databases). So the semantic of CREATE DISK QUOTA ON USER is limited: it only tracks the user's disk usage inside the current database. 2 using separate pg_diskquota could add more field except for quota limit without adding too many fields in pg_class, e.g. red zone to give the user a warning or the current disk usage of the db objects. On Fri, Sep 21, 2018 at 8:01 PM Pavel Stehule wrote: > > > pá 21. 9. 2018 v 13:32 odesílatel Hubert Zhang napsal: > >> >> >> >> >> *Hi all,We redesign disk quota feature based on the comments from Pavel >> Stehule and Chapman Flack. Here are the new design.OverviewBasically, disk >> quota feature is used to support multi-tenancy environment, different level >> of database objects could be set a quota limit to avoid over use of disk >> space. A common case could be as follows: DBA could enable disk quota on a >> specified database list. DBA could set disk quota limit for >> tables/schemas/roles in these databases. Separate disk quota worker process >> will monitor the disk usage for these objects and detect the objects which >> exceed their quota limit. Queries loading data into these “out of disk >> quota” tables/schemas/roles will be cancelled.We are currently working at >> init implementation stage. We would like to propose our idea firstly and >> get feedbacks from community to do quick iteration.SQL Syntax (How to use >> disk quota)1 Specify the databases with disk quota enabled in GUC >> “diskquota_databases” in postgresql.conf and restart the database.2 DBA >> could set disk quota limit for table/schema/role.CREATE DISK QUOTA tablea1 >> ON TABLE a1 with (quota = ‘1MB’);CREATE DISK QUOTA roleu1 ON USER u1 with >> (quota = ‘1GB’);CREATE DISK QUOTA schemas1 ON SCHEMA s1 with (quota = >> ‘3MB’);* >> > > just fast reaction - why QUOTA object? > > Isn't ALTER SET enough? > > Some like > > ALTER TABLE a1 SET quote = 1MB; > ALTER USER ... > ALTER SCHEMA .. > > New DDL commans looks like too hard hammer . > > > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> *3 Simulate a schema out of quota limit case: suppose table a1 and table >> a2 are both under schema s1.INSERT INTO a1 SELECT >> generate_series(1,1000);INSERT INTO a2 SELECT >> generate_series(1,300);SELECT pg_sleep(5)INSERT INTO a1 SELECT >> generate_series(1,1000);ERROR: schema's disk space quota exceededDROP >> TABLE a2;SELECT pg_sleep(5)INSERT INTO a1 SELECT >> generate_series(1,1000);INSERT 0 1000ArchitectureDisk quota has the >> following components.1. Quota Setting Store is where the disk quota setting >> to be stored and accessed. We plan to use catalog table pg_diskquota to >> store these information. pg_diskquota is >> like:CATALOG(pg_diskquota,6122,DiskQuotaRelationId){ NameData quotaname; /* >> diskquota name */int16 quotatype; /* diskquota type name */ Oid >> quotatargetoid; /* diskquota target db object oid*/ int32 quotalimit; /* >> diskquota size limit in MB*/ int32 quotaredzone; /* diskquota redzone in >> MB*/} FormData_pg_diskquota;2. Quota Change Detector is the monitor of size >> change of database objects. We plan to use stat collector to detect the >> ‘active’ table list at initial stage. But stat collector has some >> limitation on finding the active table which is in a running transaction. >> Details see TODO section.3. Quota Size Checker is where to calculate the >> size and compare with quota limit for database objects. According to >> Pavel’s comment, autovacuum launcher and worker process could be a good >> reference to disk quota. So we plan to use a disk quota launcher daemon >> process and several disk quota worker process to finish this work. Launcher >> process is responsible for starting worker process based on a user defined >> database
Re: Proposal for disk quota feature
% cpu usage and will be finished within 50ms. Todo/LimitationBefore we propose our patch, we plan to enhance it with the following ideas:1. Setting database list with disk quota enabled dynamically without restart database. Since we have the disk quota launcher process, it could detect the new ‘diskquota_databases’ list and start/stop the corresponding disk quota worker process.2. Enforcement when query is running. Considering the case when there is 10MB quota left, but next query will insert 10GB data. Current enforcement design will allow this query to be executed. This is limited by the ‘active’ table detection is generated by stat collector. Postgres backend will only send table stat information to collector only when the transaction ends. We need a new way to detect the ‘active’ table even when this table is being modified inside a running transaction.3. Monitor unlimited number of databases. Current we set the max number of disk quota worker process to be 10 to reduce the affection normal workload. But how about if we want to monitor the disk quota of more than 10 databases? Our solution is to let disk quota launcher to manage a queue of database need to be monitored. And disk quota worker process consuming the queue and refresh the disk usage/quota for this database. After some periods, worker will return the database to the queue, and fetch the top database from queue to process. The period determine the delay of detecting disk quota change. To implement this feature, we need to support a subprocess of postmaster to rebind to another database instead of the database binded in InitPostgres().4. Support active table detection on vacuum full and vacuum analyze. Currently vacuum full and vacuum analyze are not tracked by stat collector.Thanks to Heikki, Pavel Stehule,Chapman Flack for the former comments on disk quota feature. Any comments on how to improve disk quota feature are appreciated.* On Mon, Sep 3, 2018 at 12:05 PM, Pavel Stehule wrote: > > > 2018-09-03 3:49 GMT+02:00 Hubert Zhang : > >> Thanks Pavel. >> Your patch did enforcement on storage level(md.c or we could also use >> smgr_extend). It's straight forward. >> But I prefer to implement disk_quota as a feature with following >> objectives: >> 1 set/alter disk quota setting on different database objects, e.g. user, >> database, schema etc. not only a general GUC, but we could set separate >> quota limit for a specific objects. >> 2 enforcement operator should work at two positions: before query is >> running and when query is running. The latter one's implementation maybe >> similar to your patch. >> > > The patch was just example. The resource quotes should be more complex - > per partition, table, schema, database, user - so GUC are possible, but not > very user friendly. > > Our case is specific, but not too much. The servers are used for > multidimensional analyses - and some tables can grow too fast (COPY, INSERT > SELECT). We need to solve limits immediately. The implementation is simple, > so I did it. Same implementation on database level, or schema level needs > some more locks, so it will not be too effective. The resource management > can be complex very complex, and I expect so it will be hard work. > > Regards > > Pavel > > >> On Sun, Sep 2, 2018 at 8:44 PM, Pavel Stehule >> wrote: >> >>> Hi >>> >>> 2018-09-02 14:18 GMT+02:00 Hubert Zhang : >>> >>>> Thanks Chapman. >>>> @Pavel, could you please explain more about your second suggestion >>>> "implement >>>> some quotas on storage level?" >>>> >>> >>> See attached patch - it is very simple - and good enough for our >>> purposes. >>> >>> Regards >>> >>> Pavel >>> >>> >>> >>>> We will not keep the long-lived processes attach to all databases(just >>>> like you mentioned servers with thousands of databases) >>>> And you are right, we could share ideas with autovacuum process, fork >>>> worker processes in need. >>>> "autovacuum checks for tables that have had a large number of >>>> inserted, updated or deleted tuples. These checks use the statistics >>>> collection facility" >>>> diskquota process is similar to autovacuum at caring about insert, but >>>> the difference is that it also care about vucuum full, truncate and drop. >>>> While update and delete may not be interested since no file change happens. >>>> So a separate diskquota process is preferred. >>>> >>>> So if we implemented disk quota as a full native feature, and in the >>>> first initial vers
How to get active table within a transaction.
Hi all, I have a requirement to get the active table list in a child postmaster process. We are able to get the active table list after corresponding transaction ends by using stat collector. Each postgres process will gather the table change information locally, but only send the stat info to collector after transaction end(become idle). As an enhancement, we also want to get the active table while the transaction inserting the table is in progress. Delay is acceptable. Is there any existing ways in PG to support it? -- Thanks Hubert Zhang
Re: Proposal for disk quota feature
Thanks Pavel. Your patch did enforcement on storage level(md.c or we could also use smgr_extend). It's straight forward. But I prefer to implement disk_quota as a feature with following objectives: 1 set/alter disk quota setting on different database objects, e.g. user, database, schema etc. not only a general GUC, but we could set separate quota limit for a specific objects. 2 enforcement operator should work at two positions: before query is running and when query is running. The latter one's implementation maybe similar to your patch. On Sun, Sep 2, 2018 at 8:44 PM, Pavel Stehule wrote: > Hi > > 2018-09-02 14:18 GMT+02:00 Hubert Zhang : > >> Thanks Chapman. >> @Pavel, could you please explain more about your second suggestion >> "implement >> some quotas on storage level?" >> > > See attached patch - it is very simple - and good enough for our purposes. > > Regards > > Pavel > > > >> We will not keep the long-lived processes attach to all databases(just >> like you mentioned servers with thousands of databases) >> And you are right, we could share ideas with autovacuum process, fork >> worker processes in need. >> "autovacuum checks for tables that have had a large number of inserted, >> updated or deleted tuples. These checks use the statistics collection >> facility" >> diskquota process is similar to autovacuum at caring about insert, but >> the difference is that it also care about vucuum full, truncate and drop. >> While update and delete may not be interested since no file change happens. >> So a separate diskquota process is preferred. >> >> So if we implemented disk quota as a full native feature, and in the >> first initial version I prefer to implement the following features: >> 1 Fork diskquota launcher process under Postmaster serverloop, which is >> long-lived. >> 2 Diskquota launcher process is responsible for creating diskquota >> worker process for every database. >> 3 DIskquota setting is stored in a separate catalog table for each >> database. >> 4 Initialization stage, Diskquota launcher process creates diskquota worker >> process for all the databases(traverse like autovacuum). Worker process >> calculates disk usage of db objects and their diskquota setting. If any >> db object exceeds its quota limit, put them into the blacklist in the >> shared memory, which will later be used by enforcement operator. Worker >> process exits when works are done. >> 5 Running stage, Diskquota launcher process creates diskquota worker >> process for the database with a large number of insert, copy, truncate, >> drop etc. or create disk quota statement. Worker process updates the file >> size for db objects containing the result relation, and compare with the >> diskquota setting. Again, if exceeds quota limit, put them into blacklist, >> remove from blacklist vice versa. Worker process exits when works are >> done and a GUC could control the frequency of worker process restart to a >> specific database. As you know, this GUC also controls the delay when we do >> enforcement. >> 6 Enforcement. When postgres backend executes queries, check the >> blacklist in shared memory to determine whether the query is allowed(before >> execute) or need rollback(is executing)? >> >> If we implemented disk quota as an extension, we could just use >> background worker to start diskquota launcher process and use >> RegisterDynamicBackgroundWorker() to fork child diskquota worker >> processes by the launcher process as suggested by @Chapman. >> Diskquota setting could be stored in user table in a separate schema for >> each database(Schema and table created by create extension statement) just >> like what Heikki has done in pg_quota project. But in this case, we need to >> create extension for each database before diskquota worker process can be >> set up for that database. >> >> Any comments on the above design and which is preferred, native feature >> or extension as the POC? >> >> >> -- Hubert >> >> >> >> On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule >> wrote: >> >>> >>> >>> 2018-08-30 16:22 GMT+02:00 Chapman Flack : >>> >>>> On 08/30/2018 09:57 AM, Hubert Zhang wrote: >>>> >>>> > 2 Keep one worker process for each database. But using a parent/global >>>> > quota worker process to manage the lifecycle of database level worker >>>> > processes. It could handle the newly created database(avoid restart >>>> > database) and save resource w
Re: Proposal for disk quota feature
Thanks Chapman. @Pavel, could you please explain more about your second suggestion "implement some quotas on storage level?" We will not keep the long-lived processes attach to all databases(just like you mentioned servers with thousands of databases) And you are right, we could share ideas with autovacuum process, fork worker processes in need. "autovacuum checks for tables that have had a large number of inserted, updated or deleted tuples. These checks use the statistics collection facility" diskquota process is similar to autovacuum at caring about insert, but the difference is that it also care about vucuum full, truncate and drop. While update and delete may not be interested since no file change happens. So a separate diskquota process is preferred. So if we implemented disk quota as a full native feature, and in the first initial version I prefer to implement the following features: 1 Fork diskquota launcher process under Postmaster serverloop, which is long-lived. 2 Diskquota launcher process is responsible for creating diskquota worker process for every database. 3 DIskquota setting is stored in a separate catalog table for each database. 4 Initialization stage, Diskquota launcher process creates diskquota worker process for all the databases(traverse like autovacuum). Worker process calculates disk usage of db objects and their diskquota setting. If any db object exceeds its quota limit, put them into the blacklist in the shared memory, which will later be used by enforcement operator. Worker process exits when works are done. 5 Running stage, Diskquota launcher process creates diskquota worker process for the database with a large number of insert, copy, truncate, drop etc. or create disk quota statement. Worker process updates the file size for db objects containing the result relation, and compare with the diskquota setting. Again, if exceeds quota limit, put them into blacklist, remove from blacklist vice versa. Worker process exits when works are done and a GUC could control the frequency of worker process restart to a specific database. As you know, this GUC also controls the delay when we do enforcement. 6 Enforcement. When postgres backend executes queries, check the blacklist in shared memory to determine whether the query is allowed(before execute) or need rollback(is executing)? If we implemented disk quota as an extension, we could just use background worker to start diskquota launcher process and use RegisterDynamicBackgroundWorker() to fork child diskquota worker processes by the launcher process as suggested by @Chapman. Diskquota setting could be stored in user table in a separate schema for each database(Schema and table created by create extension statement) just like what Heikki has done in pg_quota project. But in this case, we need to create extension for each database before diskquota worker process can be set up for that database. Any comments on the above design and which is preferred, native feature or extension as the POC? -- Hubert On Fri, Aug 31, 2018 at 3:32 AM, Pavel Stehule wrote: > > > 2018-08-30 16:22 GMT+02:00 Chapman Flack : > >> On 08/30/2018 09:57 AM, Hubert Zhang wrote: >> >> > 2 Keep one worker process for each database. But using a parent/global >> > quota worker process to manage the lifecycle of database level worker >> > processes. It could handle the newly created database(avoid restart >> > database) and save resource when a database is not used. But this needs >> to >> > change worker process to be hierarchical. Postmaster becomes the >> grandfather >> > of database level worker processes in this case. >> >> I am using background workers this way in 9.5 at $work. >> >> In my case, one worker lives forever, wakes up on a set period, and >> starts a short-lived worker for every database, waiting for each >> one before starting the next. >> >> It was straightforward to implement. Looking back over the code, >> I see the global worker assigns its own PID to worker.bgw_notify_pid >> of each of its children, and also obtains a handle for each child >> from RegisterDynamicBackgroundWorker(). >> >> I imagine the global quota worker would prefer to start workers >> for every database and then just wait for notifications from any >> of them, but that seems equally straightforward at first glance. >> > > There are servers with thousands databases. Worker per database is not > good idea. > > It should to share ideas, code with autovacuum process. > > Not sure, how to effective implementation based on bg workers can be. On > servers with large set of databases, large set of tables it can identify > too big table too late. > > Isn't better to implement some quotas on storage level? > > Regards > > Pavel > > > >> -Chap >> >> > -- Thanks Hubert Zhang
Proposal for disk quota feature
Hi all, We want to introduce disk quota feature into Postgres. *Why disk quota* *In a multi-tenant environment, there is a requirement to limit the disk quota that database/schema/table can be written or a user can consume for different organizations.* *Meanwhile, other databases such as Oracle, Teradata, DB2 have already supported disk quota feature.* *Heikki has already implemented disk quota feature in Postgres as an extension pg_quota <https://github.com/hlinnaka/pg_quota>. We plan to enhance disk quota feature based on Heikki's implementation.* *Scope* The scope of disk quota feature is to limit the disk usage of database/schema/table objects and users. Here table means heap table, ao table, index table, partition table and associated table( toast table, visible table, large object etc.). Schema disk quota is the disk quota of all the tables in this schema. Database disk quota is the disk quota of all the tables in this database. User's quota is the size of all the tables whose owner are this user. Out of Scope: Note that spill files, xlogs, clogs and logs are not considered for database object level disk quota at this stage. *Design* We propose disk quota with the following components: 1. Quota Setting Store is where the disk quota setting to be stored and accessed. DBA or object owner uses SQL queries to configure the disk quota for each database objects. *2. Quota Change Detector is the monitor of size change of database objects in Postgres. It will write change information to shared memory to notify Quota Size Checker. The implementation of Change Detector could be hooks in smgr_extend/smgr_unlink/smgr_truncate when modifying the size of a heap table. The hooks will write to shared memory in a batched way to reduce the impact on OLTP performance.3. Quota Size Checker is implemented as a worker process. It maintains the current disk usage for each database objects and users, and compare them with settings in Quota Setting Store. If it detects the disk usage hit the quota redzone(either upon or below), it will notify Quota Enforcement Operator. 4. Quota Enforcement Operator has two roles: one is to check the disk quota permission before queries are executed(QueryBeforeRun Check), the other is to cancel the running queries when it reaches the disk quota limit dynamically(QueryRunning Check). Quota Enforcement Operator uses shared memory to store the enforcement information to guarantee a quick check. * *To implement the right proof of concept, we want to receive feedback from the community from the following aspects: * Q1. User Interface: when setting a disk quota, Option 1 is to use *insert into quota.config(user1, 10G)* Option 2 is to use UDF *select set_quota("role","user1",10G)* Option 3 is to use native SQL syntax *create disk quota on ROLE user1 10G, or create disk quota on SCHEMA s1 25G;* Q2. Quota Setting Store using user table or catalog? Option 1 is to create a schema called quota for each database and write quota settings into quota.config table, only DBA could modify it. This corresponds to Q1.option1 Option 2 is to store quota settings into the catalog. For Schema and Table write them to database level catalog. For database or user, write them to either database level or global catalog. I personally prefer Q1's option3 and Q2's option2, since it makes disk quota more like a native feature. We could support the quota worker process implementation as an extension for now, but in the long term, disk quota is like a fundamental feature of a database and should be a native feature just like other databases. So store quota conf into catalog and supply native syntax is better. Open Problem We prepare to implement Quota Size Checker as a worker process. Worker process needs to connect to a database to build the disk usage map and quota map(e.g. a user/shcema's disk usage on a given database) But one worker process can only bind to one database(InitPostgres(dbname)) at the initialization stage. It results in we need a separate worker process for each database. The solution to this problem is not straightforward, here are some ideas: 1 To make worker process could retrieve and cache information from all the databases. As Tom Lane pointed out that it needs to flush all the database specific thing, like relcache syscache etc. 2 Keep one worker process for each database. But using a parent/global quota worker process to manage the lifecycle of database level worker processes. It could handle the newly created database(avoid restart database) and save resource when a database is not used. But this needs to change worker process to be hierarchical. Postmaster becomes the grandfather of database level worker processes in this case. Any better ideas on it? -- Thanks Hubert Zhang
Is child process of postmaster able to access all the databases?
Hello all. background worker can use SPI to read a database, but it can call BackgroundWorkerInitializeConnection(dbname) only once. I wonder if there is a way to let a child process of postmaster to access all the databases one by one? -- Thanks Hubert Zhang
Re: Considering signal handling in plpython again
Hi Heikki, Not working on it now, you can go ahead. On Fri, Jun 22, 2018 at 12:56 AM, Heikki Linnakangas wrote: > Hi Hubert, > > Are you working on this, or should I pick this up? Would be nice to get > this done as soon as v12 development begins. > > - Heikki > -- Thanks Hubert Zhang
Secured and customizable PLPython and PLR on Postgresql
Hi all, As you know PLPython and PLR are untrusted language, which means only DBA could create these UDFs(very inconvenient). Moreover it's also hard to supply user specific Python/R env for different data scientists. We are working on an open source project to make PLPython and PLR trusted and customizable. The project is called PLContainer, which use Docker Container as the sandbox to avoid malicious user breaking Postgresql database or even the whole host machine. Now there is a basic version which could support PLContainer on PG9.6 (Thanks krait007 and markwort for the contribution). We still have a lot of issues to make it production ready and share with more peoples. [Github umbrella project](https://github.com/greenplum-db/plcontainer/projects/1) If you are interested in it, feel free to try it. Your suggestion and contribution will be appreciated. -- Thanks Hubert Zhang
Re: Considering signal handling in plpython again
There are remaining two problems 1. Do we need to consider when to delete the extension hook or it's not necessary? As the destroy function _PG_fini doesn't work, I cannot find a good place to reset to hook gracefully. I tested the drop language plpythonu statement which will not remove the python shared library in the current session, So it seems to be safe to leave the cancel_handler_hook not be reset. How about other extensions, for example plr. Does the "drop extension" statement will not remove the loaded shared library in the process either? -- Another idea is to register the hook at the beginning of plpython_call_handler and unregister the hook at the end of plpython_call_handler. 2. Do we need to use explicit hook list(List *cancel_hook_list) instead of implicit cancel_hook(which relies on the extension to link the cancel_hook inside their code e.g. prev_hook = cancel_hook; cancel_hook=my_hook; void my_hook(){mywork(); (*prev_hook)();} )? I didn't find any explicit hook list in PG code base, is that a good practice? -- Hubert On Mon, May 14, 2018 at 6:40 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > On 14/05/18 10:56, Hubert Zhang wrote: > >> For nested SPI case, one option is to turn off the bool variable when >> entering the SPI function(PLy_spi_prepare, PLy_spi_execute, PLy_cursor >> etc.) >> and turn on the bool variable again when exiting the SPI function. >> > > Yeah, that seems reasonable. > > - Heikki > -- Thanks Hubert Zhang
Re: Considering signal handling in plpython again
For nested SPI case, one option is to turn off the bool variable when entering the SPI function(PLy_spi_prepare, PLy_spi_execute, PLy_cursor etc.) and turn on the bool variable again when exiting the SPI function. If it's OK, we could follow this way to update Heikki's patch. --Hubert On Fri, May 11, 2018 at 9:28 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > > > On 11 May 2018 10:01:56 EEST, Hubert Zhang <hzh...@pivotal.io> wrote: > >2. Add a flag in hook function to indicate whether to call > >Py_AddPendingCall. > >This is straightforward.(I prefer it) > > Yeah, that's what I had in mind, too. A global bool variable that's set > when you enter libpython, and cleared on return. Need to handle nesting, > i.e if a PL/python function runs a slow query with SPI, and cancellation > happens during that. And the case that the SPI query calls another > PL/python function. > > - Heikki > -- Thanks Hubert Zhang
Re: Considering signal handling in plpython again
Thanks Heikki and Robert for your comments. I reviewed Heikki's patch and let's enhance it. As Heikki mentioned, there is a problem when no Python code is being executed. I tested it in the following case "select pysleep();" and then type ctrl-c, query cancelled successfully.(Patch works:)) "select * from a, b, c, d;" and then type ctrl-c, query cancelled successfully. "select pysleep();" It will terminate immediately(without type ctrl-c) due to the Py_AddPendingCall is registered on the last query.(The problem Heikki said) To fix this problem, we need to let query like "select * from a, b, c, d;" doesn't call Py_AddPendingCall. There are at least 3 ways. 1. Add a flag in hook function to indicate whether to call the hook function. In current patch the hook function will do two things: a. call Py_AddPendingCall; b. call the previous hook(prev_cancel_pending_hook). So this method will introduce side effect on miss the previous hook. To enhance it, we need to change the hook in postgres.c to hook array. And let hook function only do one thing. 2. Add a flag in hook function to indicate whether to call Py_AddPendingCall. This is straightforward.(I prefer it) 3. Delete the hook after it's been used. This is related to the lifecycle of signal hooks. In current patch the handler hook is registered inside PLy_initialize() which will be called for every plpython_call_handler(). I prefer to just register hook once and in _PG_init() for each extension. If follow this way, delete hook is not needed. Any comments? On Thu, May 10, 2018 at 10:50 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote: > On 10/05/18 09:32, Hubert Zhang wrote: > >> Hi all, >> >> I want to support canceling for a plpython query which may be a busy loop. >> >> I found some discussions on pgsql-hackers 2 years ago. Below is the link. >> >> https://www.postgresql.org/message-id/CAFYwGJ3+Xg7EcL2nU-MxX >> 6p+o6c895pm3myz-b+9n9dffeh...@mail.gmail.com >> >> Mario wrote a patch to fix this problem at that time >> *https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc >> 45e1d84fc46aa7c3ab0344c3 >> <https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc >> 45e1d84fc46aa7c3ab0344c3>* >> <https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc >> 45e1d84fc46aa7c3ab0344c3> >> >> The main logic is to register a new signal handler for SIGINT/SIGTERM >> and link the old signal handler in the chain. >> >> static void PLy_python_interruption_handler() >> { >> PyErr_SetString(PyExc_RuntimeError, "test except"); >> return NULL; >> } >> static void >> PLy_handle_interrupt(int sig) >> { >> // custom interruption >> int added = Py_AddPendingCall(PLy_python_interruption_handler, NULL); >> if (coreIntHandler) { >> (*coreIntHandler)(sig); >> } >> } >> >> Does anyone have some comments on this patch? >> > > PostgreSQL assumes to have control of all the signals. Although I don't > foresee any changes in this area any time soon, there's no guarantee that > overriding the SIGINT/SIGTERM will do what you want in the future. Also, > catching SIGINT/SIGTERM still won't react to recovery conflict interrupts. > > In that old thread, I think the conclusion was that we should provide a > hook in the backend for this, rather than override the signal handler > directly. We could then call the hook whenever InterruptPending is set. > No-one got around to write a patch to do that, though. > > As for me, I think handler function should call PyErr_SetInterrupt() >> instead of PyErr_SetString(PyExc_RuntimeError, "test except"); >> > > Hmm. I tested that, but because the Python's default SIGINT handler is not > installed, PyErr_SetInterrupt() will actually throw an error: > > TypeError: 'NoneType' object is not callable > > I came up with the attached patch, which is similar to Mario's, but it > introduces a new "hook" for this. > > One little problem remains: The Py_AddPendingCall() call is made > unconditionally in the signal handler, even if no Python code is currently > being executed. The pending call is queued up until the next time you run a > PL/Python function, which could be long after the original statement was > canceled. We need some extra checks to only raise the Python exception, if > the Python interpreter is currently active. > > - Heikki > -- Thanks Hubert Zhang
Considering signal handling in plpython again
Hi all, I want to support canceling for a plpython query which may be a busy loop. I found some discussions on pgsql-hackers 2 years ago. Below is the link. https://www.postgresql.org/message-id/cafywgj3+xg7ecl2nu-mxx6p+o6c895pm3myz-b+9n9dffeh...@mail.gmail.com Mario wrote a patch to fix this problem at that time *https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc45e1d84fc46aa7c3ab0344c3 <https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc45e1d84fc46aa7c3ab0344c3>* <https://github.com/CartoDB/postgres/commit/d587d8a6e4f035cc45e1d84fc46aa7c3ab0344c3> The main logic is to register a new signal handler for SIGINT/SIGTERM and link the old signal handler in the chain. static void PLy_python_interruption_handler() { PyErr_SetString(PyExc_RuntimeError, "test except"); return NULL; } static void PLy_handle_interrupt(int sig) { // custom interruption int added = Py_AddPendingCall(PLy_python_interruption_handler, NULL); if (coreIntHandler) { (*coreIntHandler)(sig); } } Does anyone have some comments on this patch? As for me, I think handler function should call PyErr_SetInterrupt() instead of PyErr_SetString(PyExc_RuntimeError, "test except"); -- Thanks Hubert Zhang