Re: [HACKERS] Printing bitmap objects in the debugger
On Wed, Sep 14, 2016 at 5:31 PM, Pavan Deolaseewrote: > > On Wed, Sep 14, 2016 at 3:46 PM, Pavan Deolasee > wrote: >> >> >> >> lately I'm using LVM debugger (which probably does not have something >> equivalent), > > > And I was so clueless about lldb's powerful scripting interface. For > example, you can write something like this in bms_utils.py: > > import lldb > > def print_bms_members (bms): > words = bms.GetChildMemberWithName("words") > nwords = int(bms.GetChildMemberWithName("nwords").GetValue()) > > ret = 'nwords = {0} bitmap: '.format(nwords,) > for i in range(0, nwords): > ret += hex(int(words.GetChildAtIndex(0, lldb.eNoDynamicValues, > True).GetValue())) > > return ret > Thanks a lot for digging into it. > And then do this while attached to lldb debugger: > > Process 99659 stopped > * thread #1: tid = 0x59ba69, 0x0001090b012f > postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673, > queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 > frame #0: 0x0001090b012f > postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673 >670 int wordnum, >671 bitnum; >672 > -> 673 if (x < 0) >674 elog(ERROR, "negative bitmapset member not allowed"); >675 if (a == NULL) >676 return bms_make_singleton(x); > (lldb) script > Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D. from bms_utils import * bms = lldb.frame.FindVariable ("a") print print_bms_members(bms) > nwords = 1 bitmap: 0x200 > I can get that kind of output by command p *bms p/x (or p/b) *bms->words@(bms->nwords) in gdb. But I can certainly extend the script you wrote above to print more meaningful output similar to outBitmapset(). But then this would be specific to LLVM. GDB too seems to have a similar interface https://sourceware.org/gdb/wiki/PythonGdbTutorial, so I can probably use above script with some modifications with GDB as well. Python script will be easier to maintain as compared to maintaining a patch that needs to be applied and compiled. Said that, I am not sure if every debugger supported on every platform we support has these features. Or may be developers work on only those platforms which have such powerful debuggers, so it's ok. Every debugger usually has much easier way to call a function and print its output, so having a function like the one I have in the patch makes things easy for all the debuggers and may be developers not familiar with python. > > The complete API reference is available here > http://lldb.llvm.org/python_reference/index.html > > Looks like an interesting SoC project to write useful lldb/gdb scripts to > print internal structures for ease of debugging :-) > +1, if we can include something like that in the repository so as to avoid every developer maintaining a script of his/her own. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 15, 2016 at 4:44 AM, Jeff Janeswrote: > On Tue, May 10, 2016 at 5:09 AM, Amit Kapila > wrote: >> >> >> >> Although, I don't think it is a very good idea to take any performance >> data with WIP patch, still I couldn't resist myself from doing so and below >> are the performance numbers. To get the performance data, I have dropped >> the primary key constraint on pgbench_accounts and created a hash index on >> aid column as below. >> >> alter table pgbench_accounts drop constraint pgbench_accounts_pkey; >> create index pgbench_accounts_pkey on pgbench_accounts using hash(aid); > > > > To be rigorously fair, you should probably replace the btree primary key > with a non-unique btree index and use that in the btree comparison case. I > don't know how much difference that would make, probably none at all for a > read-only case. > >> >> >> >> Below data is for read-only pgbench test and is a median of 3 5-min runs. >> The performance tests are executed on a power-8 m/c. > > > With pgbench -S where everything fits in shared_buffers and the number of > cores I have at my disposal, I am mostly benchmarking interprocess > communication between pgbench and the backend. I am impressed that you can > detect any difference at all. > > For this type of thing, I like to create a server side function for use in > benchmarking: > > create or replace function pgbench_query(scale integer,size integer) > RETURNS integer AS $$ > DECLARE sum integer default 0; > amount integer; > account_id integer; > BEGIN FOR i IN 1..size LOOP >account_id=1+floor(random()*scale); >SELECT abalance into strict amount FROM pgbench_accounts > WHERE aid = account_id; >sum := sum + amount; > END LOOP; > return sum; > END $$ LANGUAGE plpgsql; > > And then run using a command like this: > > pgbench -f <(echo 'select pgbench_query(40,1000)') -c$j -j$j -T 300 > > Where the first argument ('40', here) must be manually set to the same value > as the scale-factor. > > With 8 cores and 8 clients, the values I get are, for btree, hash-head, > hash-concurrent, hash-concurrent-cache, respectively: > > 598.2 > 577.4 > 668.7 > 664.6 > > (each transaction involves 1000 select statements) > > So I do see that the concurrency patch is quite an improvement. The cache > patch does not produce a further improvement, which was somewhat surprising > to me (I thought that that patch would really shine in a read-write > workload, but I expected at least improvement in read only) > To see the benefit from cache meta page patch, you might want to test with clients more than the number of cores, atleast that is what data by Mithun [1] indicates or probably in somewhat larger m/c. [1] - https://www.postgresql.org/message-id/CAD__OugX0aOa7qopz3d-nbBAoVmvSmdFJOX4mv5tFRpijqH47A%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 15, 2016 at 4:04 AM, Jeff Janeswrote: > On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janes wrote: >> >> === >> >> +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or >> tuples >> +that are moved due to split. The need for cleanup lock to remove dead >> tuples >> +is to ensure that scans' returns correct results. Scan that returns >> multiple >> +tuples from the same bucket page always restart the scan from the >> previous >> +offset number from which it has returned last tuple. >> >> Perhaps it would be better to teach scans to restart anywhere on the page, >> than to force more cleanup locks to be taken? > > > Commenting on one of my own questions: > > This won't work when the vacuum removes the tuple which an existing scan is > currently examining and thus will be used to re-find it's position when it > realizes it is not visible and so takes up the scan again. > > The index tuples in a page are stored sorted just by hash value, not by the > combination of (hash value, tid). If they were sorted by both, we could > re-find our position even if the tuple had been removed, because we would > know to start at the slot adjacent to where the missing tuple would be were > it not removed. But unless we are willing to break pg_upgrade, there is no > feasible way to change that now. > I think it is possible without breaking pg_upgrade, if we match all items of a page at once (and save them as local copy), rather than matching item-by-item as we do now. We are already doing similar for btree, refer explanation of BTScanPosItem and BTScanPosData in nbtree.h. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”
On Wed, Sep 14, 2016 at 4:03 PM, valeriofwrote: > Hi, I'm kind of new to Postgres and I'm trying to create a custom output > plugin for logical replication (Postgres is 9.5.4 version and I'm building > the project from a Windows 8/64 bit machine - same machine where the db is > installed). > I started from the code of the sample test_decoding, and I was trying to > simply rebuild it under a new name and install it into Postgres to see if > the module works. Once done, I will start modifying the code. > > My project is built in Visual Studio 2013 and the only steps I took were to > copy the built assembly under Postgres lib folder. When I run the command: > > postgres=# SELECT * FROM > pg_create_logical_replication_slot('regression_slot', 'my_decoding'); > > I get an error saying that "output plugins have to declare the > _PG_output_plugin_init symbol". > The error comes from LoadOutputPlugin(), when the call to load_external_function() fails. load_external_function() tries to search for given function in the given file. The file name is same as plugin name. So, it may be that it doesn't find a file with my_decoding library in the installation. If test_decoding plugin is working in your case, please check if you can find my_decoding library in the same location. If not, that's the problem. > In my code, I have the function declared as extern: > > extern void _PG_init(void); > extern void _PG_output_plugin_init(OutputPluginCallbacks *cb); > > and the body of the function is defined somewhere below it. > > The actual code is just a copy of the test_decoding sample: > https://github.com/postgres/postgres/blob/REL9_5_STABLE/contrib/test_decoding/test_decoding.c > > I'm not sure if the deployment process is ok (just copying the dll) or if > there is some other step to take. Can anyone shed some light? > It's hard to tell what's wrong exactly, without seeing the changes you have made. But, it looks like while copying test_decoding directory, you have forgot to replace some instance/s of test_decoding with my_decoding. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] make async slave to wait for lsn to be replayed
On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshovwrote: > Hi hackers, > > Few days earlier I've finished my work on WAITLSN statement utility, so I’d > like to share it. > [...] > Your feedback is welcome! > > [waitlsn_10dev.patch] Hi Ivan, Thanks for working on this. Here are some general thoughts on the feature, and an initial review. +1 for this feature. Explicitly waiting for a given commit to be applied is one of several approaches to achieve "causal consistency" for reads on replica nodes, and I think it will be very useful if combined with a convenient way to get the values to wait for when you run COMMIT. This could be used either by applications directly, or by middleware that somehow keeps track of dependencies between transactions and inserts waits. I liked the way Heikki Linnakangas imagined this feature[1]: BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE; ... or perhaps it could be spelled like this: BEGIN [isolation stuff] WAIT FOR COMMIT TOKEN TIMEOUT ; That allows waiting only at the start of a transaction, whereas your idea of making a utility command would allow a single READ COMMITTED transaction to wait multiple times for transactions it has heard about through side channels, which may be useful. Perhaps we could support the same syntax in a stand alone statement inside a transaction OR as part of a BEGIN ... statement. Being able to do it as part of BEGIN means that you can use this feature for single-snapshot transactions, ie REPEATABLE READ and SERIALIZABLE (of course you can't use SERIALIZABLE on hot standbys yet but that'll be fixed one day). Otherwise you'd be waiting for the LSN in the middle of your transaction but not be able to see the result because you don't take a new snapshot. Or maybe it's enough to use a standalone WAIT ... statement inside a REPEATABLE READ or SERIALIZABLE transaction as long as it's the first statement, and should be an error to do so any time later? I think working in terms of LSNs or XIDs explicitly is not a good idea: encouraging clients to think in terms of anything other than opaque 'commit tokens' seems like a bad idea because it limits future changes. For example, there is on-going discussion about introducing CSNs (commit sequence numbers), and there are some related concepts lurking in the SSI code; maybe we'd want to use those one day. Do you think it would make sense to have a concept of a commit token that is a non-analysable string as far as clients are concerned, so that clients are not encouraged to do anything at all with them except use them in a WAIT FOR COMMIT TOKEN statement? INITIAL FEEDBACK ON THE PATCH I didn't get as far as testing or detailed review because it has some obvious bugs and compiler warnings which I figured we should talk about first, and I also have some higher level questions about the design. gram.y:12882:15: error: assignment makes pointer from integer without a cast [-Werror=int-conversion] n->delay = $3; It looks like struct WaitLSNStmt accidentally has 'delay' as a pointer to int. Perhaps you want an int? Maybe it would be useful to include the units (millisecond, ms) in the name? waitlsn.c: In function 'WLDisownLatch': waitlsn.c:82:2: error: suggest parentheses around assignment used as truth value [-Werror=parentheses] if (MyBackendId = state->backend_maxid) ^~ Pretty sure you want == here. waitlsn.c: In function 'WaitLSNUtility': waitlsn.c:153:17: error: initialization makes integer from pointer without a cast [-Werror=int-conversion] int tdelay = delay; ^ Another place where I think you wanted an int but used a pointer to int? To fix that warning you need tdelay = *delay, but I think delay should really not be taken by pointer at all. @@ -6922,6 +6923,11 @@ StartupXLOG(void) + /* + * After update lastReplayedEndRecPtr set Latches in SHMEM array + */ + WaitLSNSetLatch(); + I think you should try to do this only after commit records are replayed, not after every record. Only commit records can make transactions visible, and the raison d'être for this feature is to let users wait for transactions they know about to become visible. You probably can't do it directly in xact_redo_commit though, because at that point XLogCtl->lastReplayedEndRecPtr hasn't been updated yet so a backend that wakes up might not see that it has advanced and go back to sleep. It is updated in the StartupXLOG loop after the redo function runs. That is the reason why WalRcvForceReply() is called from there rather than in xact_redo_commit, to implement remote_apply for replication. Perhaps you need something similar? + tdelay -= (GetCurrentTimestamp() - timer); You can't do arithmetic with TimestampTz like this. Depending on configure option --disable-integer-datetimes (which controls macro HAVE_INT64_TIMESTAMP), it may be a floating point number of seconds since the epoch, or an integer number of microseconds since the epoch. It looks
Re: [HACKERS] Logical Replication WIP
On 14 September 2016 at 04:56, Petr Jelinekwrote: > Not sure what you mean by negotiation. Why would that be needed? You know > server version when you connect and when you know that you also know what > capabilities that version of Postgres has. If you send unrecognized option > you get corresponding error. Right, because we can rely on the server version = the logical replication version now. All good. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH v2] Add overflow checks to money type input function
On 9/9/16 3:19 AM, Fabien COELHO wrote: > >> I have updated the patch with additional tests and comments per your >> review. Final(?) version attached. > > Applied on head, make check ok. No more comments on the code which does > what it says. I'm fine with this patch. Pushed, thanks. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
On Thu, Sep 15, 2016 at 9:44 AM, Peter Eisentrautwrote: > On 9/12/16 11:16 PM, Michael Paquier wrote: >>> I don't think tar file output in pg_basebackup needs to be fsynced. >>> > It's just a backup file like what pg_dump produces, and we don't fsync >>> > that either. The point of this change is to leave a data directory in >>> > a synced state equivalent to what initdb leaves behind. >> Here I don't agree. The point of the patch is to make sure that what >> gets generated by pg_basebackup is consistent on disk, for all the >> formats. It seems weird to me that we could say that the plain format >> makes things consistent while the tar format may cause some files to >> be lost in case of power outage. The docs make it clear I think: >> +By default, pg_basebackup will wait for all files >> +to be written safely to disk. >> But perhaps this should directly mention that this is done for all the >> formats? > > That doesn't really explain why we fsync. Data durability, particularly on ext4 as it has been discussed a couple of months back [1]. In case if a crash it could be perfectly possible to lose files, hence we need to be sure that the files themselves are flushed, as well as their parent directory to prevent any problems. I think that we should protect users' backups as much as we can, in the range we can. > If we think that all > "important" files should be fsynced, why aren't we doing it in pg_dump? pg_dump should do it where it can, see thread [2]. I am tackling problems once at a time, and that's as well a reason why I would like us to have a common set of routines in src/common or src/fe_utils to help improve this handling. > Or psql, or server-side copy? Similarly, there is no straightforward > mechanism by which you can unpack the tar file generated by > pg_basebackup and get the unpacked directory fsynced properly. (I > suppose initdb -S should be recommended.) Yes, those are cases that you cannot cover. Imagine for example pg_basebackup's tar or pg_dump data sent to stdout. There is nothing we can actually do for all those cases. However what we can do it giving a set of options making possible to get consistent backups for users. [1] Silent data loss on ext4: https://www.postgresql.org/message-id/56583bdd.9060...@2ndquadrant.com [2] Data durability: https://www.postgresql.org/message-id/20160327233033.gd20...@awork2.anarazel.de -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Printing bitmap objects in the debugger
On 2016/09/15 0:04, Tom Lane wrote: > Alvaro Herrerawrites: >> I don't understand. Why don't you just use "call pprint(the bitmapset)" >> in the debugger? > > Bitmapsets aren't Nodes, so pprint doesn't work directly on them. > I usually find that I can pprint some node containing the value(s) > I'm interested in, but maybe that isn't working for Ashutosh's > particular case. There are many loose (ie, not inside any Node) Relids variables within the optimizer code. Perhaps Ashutosh ended up needing to look at those a lot. Thanks, Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup, pg_receivexlog and data durability (was: silent data loss with ext4 / all current versions)
On 9/12/16 11:16 PM, Michael Paquier wrote: >> I don't think tar file output in pg_basebackup needs to be fsynced. >> > It's just a backup file like what pg_dump produces, and we don't fsync >> > that either. The point of this change is to leave a data directory in >> > a synced state equivalent to what initdb leaves behind. > Here I don't agree. The point of the patch is to make sure that what > gets generated by pg_basebackup is consistent on disk, for all the > formats. It seems weird to me that we could say that the plain format > makes things consistent while the tar format may cause some files to > be lost in case of power outage. The docs make it clear I think: > +By default, pg_basebackup will wait for all files > +to be written safely to disk. > But perhaps this should directly mention that this is done for all the > formats? That doesn't really explain why we fsync. If we think that all "important" files should be fsynced, why aren't we doing it in pg_dump? Or psql, or server-side copy? Similarly, there is no straightforward mechanism by which you can unpack the tar file generated by pg_basebackup and get the unpacked directory fsynced properly. (I suppose initdb -S should be recommended.) -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup wish list
On Thu, Sep 15, 2016 at 9:26 AM, Peter Eisentrautwrote: > On 9/13/16 7:24 PM, Michael Paquier wrote: >> PostgresNode.pm had better use the new --noclean option in its calls, >> the new default is not useful for debugging. > > We don't do it for initdb either. Is that a problem? Right. In case of failure it may prove to be useful for debugging if we'd use it. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup wish list
On 9/13/16 7:24 PM, Michael Paquier wrote: > PostgresNode.pm had better use the new --noclean option in its calls, > the new default is not useful for debugging. We don't do it for initdb either. Is that a problem? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OpenSSL 1.1 breaks configure and more
On 09/15/2016 02:03 AM, Andreas Karlsson wrote: On 09/12/2016 06:51 PM, Heikki Linnakangas wrote: Changes since last version: * Added more error checks to the my_BIO_s_socket() function. Check for NULL result from malloc(). Check the return code of BIO_meth_set_*() functions; looking at OpenSSL sources, they always succeed, but all the test/example programs that come with OpenSSL do check them. * Use BIO_get_new_index() to get the index number for the wrapper BIO. * Also call BIO_meth_set_puts(). It was missing in previous patch versions. * Fixed src/test/ssl test suite to also work with OpenSSL 1.1.0. * Changed all references (in existing code) to SSLEAY_VERSION_NUMBER into OPENSSL_VERSION_NUMBER, for consistency. * Squashed all into one patch. I intend to apply this to all supported branches, so please have a look! This is now against REL9_6_STABLE, but there should be little difference between branches in the code that this touches. This patch no longer seems to apply to head after the removed support of 0.9.6. Is that intentional? Never mind. I just failed at reading. Now for a review: It looks generally good but I think I saw one error. In fe-secure-openssl.c your code still calls SSL_library_init() in OpenSSL 1.1. I think it should be enough to just call OPENSSL_init_ssl(OPENSSL_INIT_LOAD_CONFIG, NULL) like you do in be-secure. Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers
> On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigaiwrote: > > It looks to me pg_buffercache tries to allocate more than 1GB using > > palloc(), when shared_buffers is more than 256GB. > > > > # show shared_buffers ; > > shared_buffers > > > > 280GB > > (1 row) > > > > # SELECT buffers, d.datname, coalesce(c.relname, '???') > > FROM (SELECT count(*) buffers, reldatabase, relfilenode > > FROM pg_buffercache group by reldatabase, relfilenode) b > >LEFT JOIN pg_database d ON d.oid = b.reldatabase > >LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database > > WHERE datname = current_database()) > >AND b.relfilenode = pg_relation_filenode(c.oid) > >ORDER BY buffers desc; > > ERROR: invalid memory alloc request size 1174405120 > > > > It is a situation to use MemoryContextAllocHuge(), instead of palloc(). > > Also, it may need a back patching? > > I guess so. Although it's not very desirable for it to use that much > memory, I suppose if you have a terabyte of shared_buffers you > probably have 4GB of memory on top of that to show what they contain. > Exactly. I found this problem when a people asked me why shared_buffers=280GB is slower than shared_buffers=128MB to scan 350GB table. As I expected, most of shared buffers are not in-use and it also reduced amount of free memory; usable for page-cache. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] OpenSSL 1.1 breaks configure and more
On 09/12/2016 06:51 PM, Heikki Linnakangas wrote: Changes since last version: * Added more error checks to the my_BIO_s_socket() function. Check for NULL result from malloc(). Check the return code of BIO_meth_set_*() functions; looking at OpenSSL sources, they always succeed, but all the test/example programs that come with OpenSSL do check them. * Use BIO_get_new_index() to get the index number for the wrapper BIO. * Also call BIO_meth_set_puts(). It was missing in previous patch versions. * Fixed src/test/ssl test suite to also work with OpenSSL 1.1.0. * Changed all references (in existing code) to SSLEAY_VERSION_NUMBER into OPENSSL_VERSION_NUMBER, for consistency. * Squashed all into one patch. I intend to apply this to all supported branches, so please have a look! This is now against REL9_6_STABLE, but there should be little difference between branches in the code that this touches. This patch no longer seems to apply to head after the removed support of 0.9.6. Is that intentional? Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracking timezone abbreviation removals in the IANA tz database
Robert Haaswrites: > I suggest that if you (or someone else who understands it, if there is > any such person) were willing to either improve the existing > documentation or maybe even write a whole new chapter on how we do > time zone handling, it would be most welcome. Well, I'll put that on my to-do list, but it's a very long list. And TBH, I thought the SGML docs around this were adequate already, so I might not be in the best position to rewrite them to satisfy you. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Andres Freundwrites: > On 2016-09-12 19:35:22 -0400, Tom Lane wrote: >> Anyway I'll draft a prototype and then we can compare. > Ok, cool. Here's a draft patch that is just meant to investigate what the planner changes might look like if we do it in the pipelined-result way. Accordingly, I didn't touch the executor, but just had it emit regular Result nodes for SRF-execution steps. However, the SRFs are all guaranteed to appear at top level of their respective tlists, so that those Results could be replaced with something that works like nodeFunctionscan. A difficulty with this restriction is that if you have a query like "select f1, generate_series(1,2) / 10 from tab" then you end up with both a SRF-executing Result and a separate scalar-projection Result above it, because the division-by-ten has to happen in a separate plan level. The planner's notions about the cost of Result make it think that this is quite expensive --- mainly because the upper Result will be iterated once per SRF output row, so that you get hit with cpu_tuple_cost per output row. And that in turn leads it to change plans in one or two cases in the regression tests. Maybe that's fine. I'm worried though that it's right that this will be unduly expensive. So I'm kind of tempted to define the SRF-executing node as acting more like, say, Agg or WindowFunc, in that it has a list of SRFs to execute and then it has the ability to project a scalar tlist on top of those results. That would likely save some cycles at execution, and it would also eliminate a few of the planner warts seen below, like the rule about not pushing a new scalar tlist down onto a SRF-executing Result. I'd have to rewrite split_pathtarget_at_srfs(), because it'd be implementing quite different rules about how to refactor targetlists, but that's not a big problem. On the whole I'm pretty pleased with this approach, at least from the point of view of the planner. The net addition of planner code is smaller than what you had, and though I'm no doubt biased, I think this version is much cleaner. Also, though this patch doesn't address exactly how we might do it, it's fairly clear that it'd be possible to allow FDWs and CustomScans to implement SRF execution, eg pushing a SRF down to a foreign server, in a reasonably straightforward extension of the existing upper-pathification hooks. If we go with the lateral function RTE approach, that's going to be somewhere between hard and impossible. So I think we should continue investigating this way of doing things. I'll try to take a look at the executor end of it tomorrow. However I'm leaving Friday for a week's vacation, and may not have anything to show before that. regards, tom lane diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 7e092d7..9052273 100644 *** a/src/backend/nodes/outfuncs.c --- b/src/backend/nodes/outfuncs.c *** _outProjectionPath(StringInfo str, const *** 1817,1822 --- 1817,1823 WRITE_NODE_FIELD(subpath); WRITE_BOOL_FIELD(dummypp); + WRITE_BOOL_FIELD(srfpp); } static void diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c index 47158f6..7c59c3d 100644 *** a/src/backend/optimizer/plan/createplan.c --- b/src/backend/optimizer/plan/createplan.c *** create_projection_plan(PlannerInfo *root *** 1421,1428 Plan *subplan; List *tlist; ! /* Since we intend to project, we don't need to constrain child tlist */ ! subplan = create_plan_recurse(root, best_path->subpath, 0); tlist = build_path_tlist(root, _path->path); --- 1421,1441 Plan *subplan; List *tlist; ! /* ! * XXX Possibly-temporary hack: if the subpath is a dummy ResultPath, ! * don't bother with it, just make a Result with no input. This avoids an ! * extra Result plan node when doing "SELECT srf()". Depending on what we ! * decide about the desired plan structure for SRF-expanding nodes, this ! * optimization might have to go away, and in any case it'll probably look ! * a good bit different. ! */ ! if (IsA(best_path->subpath, ResultPath) && ! ((ResultPath *) best_path->subpath)->path.pathtarget->exprs == NIL && ! ((ResultPath *) best_path->subpath)->quals == NIL) ! subplan = NULL; ! else ! /* Since we intend to project, we don't need to constrain child tlist */ ! subplan = create_plan_recurse(root, best_path->subpath, 0); tlist = build_path_tlist(root, _path->path); *** create_projection_plan(PlannerInfo *root *** 1441,1448 * creation, but that would add expense to creating Paths we might end up * not using.) */ ! if (is_projection_capable_path(best_path->subpath) || ! tlist_same_exprs(tlist, subplan->targetlist)) { /* Don't need a separate Result, just assign tlist to subplan */ plan = subplan; --- 1454,1462 * creation, but that would
Re: [HACKERS] Tracking timezone abbreviation removals in the IANA tz database
On Fri, Sep 2, 2016 at 8:55 AM, Tom Lanewrote: > I had thought that this wouldn't really affect us because PG's > interpretations of zone abbreviations are driven by the info in > timezone/tznames/ rather than the IANA files in timezone/data/. > I forgot however that the "dynamic abbreviation" code relies on > being able to find matching abbreviations in the IANA data. > For example, timezone/tznames/Default lists > > NOVST Asia/Novosibirsk # Novosibirsk Summer Time (obsolete) > > but that zone abbreviation now fails entirely: > > # select '2016-09-02 08:00:00 NOVST'::timestamptz; > ERROR: time zone abbreviation "novst" is not used in time zone > "Asia/Novosibirsk" > > (This is also the cause of bug #14307.) > > So the question is what to do about it. > > An easy answer is to just delete such entries from the timezone/tznames/ > files altogether. If we believe the IANA crew's assumption that nobody > uses these abbreviations in the real world, then that seems like it > removes useless maintenance overhead for little cost. I'm a little > worried though that somebody might have followed IANA's lead and actually > started using these abbreviations, in which case we'd get complaints. > > Another possibility is to keep these entries but get rid of the dynamic > zone aliases, reducing them to plain numeric UTC offsets. Probably the > value to use would be whatever was shown as the most recent active value > in the IANA list ... although that's open to interpretation. For > instance, according to the above we'd likely define NOVT as UTC+7, > but then logically NOVST ought to be UTC+8, which doesn't match up > with the fact that when it actually last appeared in the IANA data > it meant UTC+7. So that sounds like it'd be a bit of a mess involving > some judgment calls. > > In the end, part of the reason we've got these files is so that users > can make their own decisions about abbreviation meanings. So the > ultimate answer to any complaints is going to be "if you think NOVT > means thus-and-such then put in an entry that says so". > > So I'm leaning to the just-remove-it answer for any deleted abbreviation > that relies on a dynamic definition. Names that never had more than one > UTC offset can remain in the tznames list. > > Comments? I tried to understand these emails today and mostly failed. I think this stuff needs to be better documented. src/timezones/tznames/README pretty much presumes that you more or less know what's going on: # This directory contains files with timezone sets for PostgreSQL. The problem # is that time zone abbreviations are not unique throughout the world and you # might find out that a time zone abbreviation in the `Default' set collides # with the one you wanted to use. This can be fixed by selecting a timezone # set that defines the abbreviation the way you want it. There might already # be a file here that serves your needs. If not, you can create your own. Of course, it doesn't bother to define the term 'timezone set', a phrase that is not used in any other file in the source tree. Eventually, after some poking around, I figured out that this is described in the documentation under "B.3. Date/Time Configuration Files", but I didn't notice that when I searched the documentation for it. I only figured it out after I actually looked at the contents of "src/timezone/tznames/Default" and saw that it recommended looking at the "Date/Time Support", which I had done, but which I gave up on doing pretty quickly because there was no section whose title included the phrase "time zone". Once I read that section, I ran across this: # The offset is the equivalent offset in seconds from UTC, positive being east # from Greenwich and negative being west. For example, -18000 would be five # hours west of Greenwich, or North American east coast standard time. D # indicates that the zone name represents local daylight-savings time rather # than standard time. Alternatively, a time_zone_name can be given, in which # case that time zone definition is consulted, and the abbreviation's meaning # in that zone is used. I dimly understand that "the abbreviation's meaning in that zone" is referring to an EST vs. EDT type of distinction, but a novice reader of this documentation will probably not understand that from this language. Similarly, your email mentions "dynamic zone aliases" but: [rhaas pgsql]$ git grep -i 'dynamic zone alias' [rhaas pgsql]$ git grep -i 'dynamic zone' [rhaas pgsql]$ git grep -i 'zone alias' [rhaas pgsql]$ I'm guessing that's referring to the same thing as the above documentation paragraph, wherein a "time_zone_name" is given, but it doesn't really refer to the time zone name, but rather the DST or non-DST configuration of that time zone. But I'm not sure I'm right about that. Another thing that's not very well-documented is the use of our own tzdata vs. the system's tzdata. There is a paragraph on it in the installation
Re: [HACKERS] Logical Replication WIP
On 14/09/16 21:53, Andres Freund wrote: Hi, On 2016-09-14 21:17:42 +0200, Petr Jelinek wrote: +/* + * Gather Relations based o provided by RangeVar list. + * The gathered tables are locked in access share lock mode. + */ Why access share? Shouldn't we make this ShareUpdateExclusive or similar, to prevent schema changes? Hm, I thought AccessShare would be enough to prevent schema changes that matter to us (which is basically just drop afaik). Doesn't e.g. dropping an index matter as well? Drop of primary key matters I guess. + if (strcmp($1, "replicate_insert") == 0) + $$ = makeDefElem("replicate_insert", + (Node *)makeInteger(TRUE), @1); + else if (strcmp($1, "noreplicate_insert") == 0) + $$ = makeDefElem("replicate_insert", + (Node *)makeInteger(FALSE), @1); + else if (strcmp($1, "replicate_update") == 0) + $$ = makeDefElem("replicate_update", + (Node *)makeInteger(TRUE), @1); + else if (strcmp($1, "noreplicate_update") == 0) + $$ = makeDefElem("replicate_update", + (Node *)makeInteger(FALSE), @1); + else if (strcmp($1, "replicate_delete") == 0) + $$ = makeDefElem("replicate_delete", + (Node *)makeInteger(TRUE), @1); + else if (strcmp($1, "noreplicate_delete") == 0) + $$ = makeDefElem("replicate_delete", + (Node *)makeInteger(FALSE), @1); + else + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("unrecognized publication option \"%s\"", $1), + parser_errposition(@1))); + } + ; I'm kind of inclined to do this checking at execution (or transform) time instead. That allows extension to add options, and handle them in utility hooks. Thant's interesting point, I prefer the parsing to be done in gram.y, but it might be worth moving it for extensibility. Although there are so far other barriers for that. Citus uses the lack of such check for COPY to implement copy over it's distributed tables for example. So there's some benefit. Yeah I am not saying that I am fundamentally against it, I am just saying it won't help all that much probably. + check_subscription_permissions(); + + rel = heap_open(SubscriptionRelationId, RowExclusiveLock); + + /* Check if name is used */ + subid = GetSysCacheOid2(SUBSCRIPTIONNAME, MyDatabaseId, + CStringGetDatum(stmt->subname)); + if (OidIsValid(subid)) + { + ereport(ERROR, + (errcode(ERRCODE_DUPLICATE_OBJECT), +errmsg("subscription \"%s\" already exists", + stmt->subname))); + } + + /* Parse and check options. */ + parse_subscription_options(stmt->options, _given, , + , ); + + /* TODO: improve error messages here. */ + if (conninfo == NULL) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), +errmsg("connection not specified"))); Probably also makes sense to parse the conninfo here to verify it looks saen. Although that's fairly annoying to do, because the relevant code is libpq :( Well the connection is eventually used (in later patches) so maybe that's not problem. Well, it's nicer if it's immediately parsed, before doing complex and expensive stuff, especially if that happens outside of the transaction. Maybe, it's not too hard to add another function to libpqwalreceiver I guess. diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index 65230e2..f3d54c8 100644 --- a/src/backend/nodes/copyfuncs.c +++
Re: [HACKERS] Logical Replication WIP
On 14/09/16 18:21, Andres Freund wrote: (continuing, uh, a bit happier) On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote: +/* + * Relcache invalidation callback for our relation map cache. + */ +static void +logicalreprelmap_invalidate_cb(Datum arg, Oid reloid) +{ + LogicalRepRelMapEntry *entry; + + /* Just to be sure. */ + if (LogicalRepRelMap == NULL) + return; + + if (reloid != InvalidOid) + { + HASH_SEQ_STATUS status; + + hash_seq_init(, LogicalRepRelMap); + + /* TODO, use inverse lookup hastable? */ *hashtable + while ((entry = (LogicalRepRelMapEntry *) hash_seq_search()) != NULL) + { + if (entry->reloid == reloid) + entry->reloid = InvalidOid; can't we break here? Probably. +/* + * Initialize the relation map cache. + */ +static void +remoterelmap_init(void) +{ + HASHCTL ctl; + + /* Make sure we've initialized CacheMemoryContext. */ + if (CacheMemoryContext == NULL) + CreateCacheMemoryContext(); + + /* Initialize the hash table. */ + MemSet(, 0, sizeof(ctl)); + ctl.keysize = sizeof(uint32); + ctl.entrysize = sizeof(LogicalRepRelMapEntry); + ctl.hcxt = CacheMemoryContext; Wonder if this (and similar code earlier) should try to do everything in a sub-context of CacheMemoryContext instead. That'd make some issues easier to track down. Sure. don't see why not. +/* + * Open the local relation associated with the remote one. + */ +static LogicalRepRelMapEntry * +logicalreprel_open(uint32 remoteid, LOCKMODE lockmode) +{ + LogicalRepRelMapEntry *entry; + boolfound; + + if (LogicalRepRelMap == NULL) + remoterelmap_init(); + + /* Search for existing entry. */ + entry = hash_search(LogicalRepRelMap, (void *) , + HASH_FIND, ); + + if (!found) + elog(FATAL, "cache lookup failed for remote relation %u", +remoteid); + + /* Need to update the local cache? */ + if (!OidIsValid(entry->reloid)) + { + Oid nspid; + Oid relid; + int i; + TupleDesc desc; + LogicalRepRelation *remoterel; + + remoterel = >remoterel; + + nspid = LookupExplicitNamespace(remoterel->nspname, false); + if (!OidIsValid(nspid)) + ereport(FATAL, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("the logical replication target %s not found", + quote_qualified_identifier(remoterel->nspname, remoterel->relname; + relid = get_relname_relid(remoterel->relname, nspid); + if (!OidIsValid(relid)) + ereport(FATAL, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("the logical replication target %s not found", + quote_qualified_identifier(remoterel->nspname, + remoterel->relname; + + entry->rel = heap_open(relid, lockmode); This seems rather racy. I think this really instead needs something akin to RangeVarGetRelidExtended(). Maybe, I am not sure if it really matters here given how it's used, but I can change that. +/* + * Executor state preparation for evaluation of constraint expressions, + * indexes and triggers. + * + * This is based on similar code in copy.c + */ +static EState * +create_estate_for_relation(LogicalRepRelMapEntry *rel) +{ + EState *estate; + ResultRelInfo *resultRelInfo; + RangeTblEntry *rte; + + estate = CreateExecutorState(); + + rte = makeNode(RangeTblEntry); + rte->rtekind = RTE_RELATION; + rte->relid = RelationGetRelid(rel->rel); + rte->relkind = rel->rel->rd_rel->relkind; + estate->es_range_table = list_make1(rte); + + resultRelInfo = makeNode(ResultRelInfo); + InitResultRelInfo(resultRelInfo, rel->rel, 1, 0); + + estate->es_result_relations = resultRelInfo; + estate->es_num_result_relations = 1; + estate->es_result_relation_info = resultRelInfo; + + /* Triggers might need a slot */ + if (resultRelInfo->ri_TrigDesc) + estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate); + + return estate; +} Ugh, we do this
Re: [HACKERS] Hash Indexes
On Tue, May 10, 2016 at 5:09 AM, Amit Kapilawrote: > > > Although, I don't think it is a very good idea to take any performance > data with WIP patch, still I couldn't resist myself from doing so and below > are the performance numbers. To get the performance data, I have dropped > the primary key constraint on pgbench_accounts and created a hash index on > aid column as below. > > alter table pgbench_accounts drop constraint pgbench_accounts_pkey; > create index pgbench_accounts_pkey on pgbench_accounts using hash(aid); > To be rigorously fair, you should probably replace the btree primary key with a non-unique btree index and use that in the btree comparison case. I don't know how much difference that would make, probably none at all for a read-only case. > > > Below data is for read-only pgbench test and is a median of 3 5-min runs. > The performance tests are executed on a power-8 m/c. > With pgbench -S where everything fits in shared_buffers and the number of cores I have at my disposal, I am mostly benchmarking interprocess communication between pgbench and the backend. I am impressed that you can detect any difference at all. For this type of thing, I like to create a server side function for use in benchmarking: create or replace function pgbench_query(scale integer,size integer) RETURNS integer AS $$ DECLARE sum integer default 0; amount integer; account_id integer; BEGIN FOR i IN 1..size LOOP account_id=1+floor(random()*scale); SELECT abalance into strict amount FROM pgbench_accounts WHERE aid = account_id; sum := sum + amount; END LOOP; return sum; END $$ LANGUAGE plpgsql; And then run using a command like this: pgbench -f <(echo 'select pgbench_query(40,1000)') -c$j -j$j -T 300 Where the first argument ('40', here) must be manually set to the same value as the scale-factor. With 8 cores and 8 clients, the values I get are, for btree, hash-head, hash-concurrent, hash-concurrent-cache, respectively: 598.2 577.4 668.7 664.6 (each transaction involves 1000 select statements) So I do see that the concurrency patch is quite an improvement. The cache patch does not produce a further improvement, which was somewhat surprising to me (I thought that that patch would really shine in a read-write workload, but I expected at least improvement in read only) I've run this was 128MB shared_buffers and scale factor 40. Not everything fits in shared_buffers, but quite easily fits in RAM, and there is no meaningful IO caused by the benchmark. Cheers, Jeff
Re: [HACKERS] kqueue
On Thu, Sep 15, 2016 at 10:48 AM, Keith Fiskewrote: > Thomas Munro brought up in #postgresql on freenode needing someone to test a > patch on a larger FreeBSD server. I've got a pretty decent machine (3.1Ghz > Quad Core Xeon E3-1220V3, 16GB ECC RAM, ZFS mirror on WD Red HDD) so offered > to give it a try. > > Bench setup was: > pgbench -i -s 100 -d postgres > > I ran this against 96rc1 instead of HEAD like most of the others in this > thread seem to have done. Not sure if that makes a difference and can re-run > if needed. > With higher concurrency, this seems to cause decreased performance. You can > tell which of the runs is the kqueue patch by looking at the path to > pgbench. Thanks Keith. So to summarise, you saw no change with 1 client, but with 4 clients you saw a significant drop in performance (~93K TPS -> ~80K TPS), and a smaller drop for 64 clients (~72 TPS -> ~68K TPS). These results seem to be a nail in the coffin for this patch for now. Thanks to everyone who tested. I might be back in a later commitfest if I can figure out why and how to fix it. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] kqueue
On Wed, Sep 14, 2016 at 9:09 AM, Matteo Beccatiwrote: > Hi, > > On 14/09/2016 00:06, Tom Lane wrote: > >> I'm inclined to think the kqueue patch is worth applying just on the >> grounds that it makes things better on OS X and doesn't seem to hurt >> on FreeBSD. Whether anyone would ever get to the point of seeing >> intra-kernel contention on these platforms is hard to predict, but >> we'd be ahead of the curve if so. >> >> It would be good for someone else to reproduce my results though. >> For one thing, 5%-ish is not that far above the noise level; maybe >> what I'm measuring here is just good luck from relocation of critical >> loops into more cache-line-friendly locations. >> > > FWIW, I've tested HEAD vs patch on a 2-cpu low end NetBSD 7.0 i386 machine. > > HEAD: 1890/1935/1889 tps > kqueue: 1905/1957/1932 tps > > no weird surprises, and basically no differences either. > > > Cheers > -- > Matteo Beccati > > Development & Consulting - http://www.beccati.com/ Thomas Munro brought up in #postgresql on freenode needing someone to test a patch on a larger FreeBSD server. I've got a pretty decent machine (3.1Ghz Quad Core Xeon E3-1220V3, 16GB ECC RAM, ZFS mirror on WD Red HDD) so offered to give it a try. Bench setup was: pgbench -i -s 100 -d postgres I ran this against 96rc1 instead of HEAD like most of the others in this thread seem to have done. Not sure if that makes a difference and can re-run if needed. With higher concurrency, this seems to cause decreased performance. You can tell which of the runs is the kqueue patch by looking at the path to pgbench. SINGLE PROCESS [keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 1 -c 1 -M prepared -S postgres -p 5496 starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1547387 latency average: 0.039 ms tps = 25789.750236 (including connections establishing) tps = 25791.018293 (excluding connections establishing) [keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 1 -c 1 -M prepared -S postgres -p 5496 starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1549442 latency average: 0.039 ms tps = 25823.981255 (including connections establishing) tps = 25825.189871 (excluding connections establishing) [keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 1 -c 1 -M prepared -S postgres -p 5496 starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1547936 latency average: 0.039 ms tps = 25798.572583 (including connections establishing) tps = 25799.917170 (excluding connections establishing) [keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1/bin/pgbench -T 60 -j 1 -c 1 -M prepared -S postgres -p 5496 starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1520722 latency average: 0.039 ms tps = 25343.122533 (including connections establishing) tps = 25344.357116 (excluding connections establishing) [keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1/bin/pgbench -T 60 -j 1 -c 1 -M prepared -S postgres -p 5496~ starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1549282 latency average: 0.039 ms tps = 25821.107595 (including connections establishing) tps = 25822.407310 (excluding connections establishing) [keith@corpus /tank/pgdata]$ /home/keith/pgsql96rc1/bin/pgbench -T 60 -j 1 -c 1 -M prepared -S postgres -p 5496~ starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 1541907 latency average: 0.039 ms tps = 25698.025983 (including connections establishing) tps = 25699.270663 (excluding connections establishing) FOUR /home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 4 -c 4 -M prepared -S postgres -p 5496 starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 4 number of threads: 4 duration: 60 s number of transactions actually processed: 4282185 latency average: 0.056 ms tps = 71369.146931 (including connections establishing) tps = 71372.646243 (excluding connections establishing) [keith@corpus ~/postgresql-9.6rc1_kqueue]$ /home/keith/pgsql96rc1_kqueue/bin/pgbench -T 60 -j 4 -c 4 -M prepared -S postgres -p 5496 starting vacuum...end. transaction type: scaling factor: 100 query mode: prepared number of clients: 4 number of threads: 4 duration: 60
Re: [HACKERS] PATCH: Avoid use of __attribute__ when building with old Sun compiler versions
On Fri, Sep 2, 2016 at 11:55 AM, Andy Grundmanwrote: > The use of __attribute__ is currently enabled by checking for > __SUNPRO_C. However, this compiler only added support for > __attribute__ keywords in version 5.10 [1]. This patch adds a version > check to only enable this for supported versions [2]. > > I have tested this with Sun C 5.8 on a Solaris 9 SPARC system, > building 9.5.4 (+OpenSSL 1.1.0 patches). Wow, so we'd need this if building on a compiler released before June 2007. Is this the only change that's needed to build with the older compiler? Can you set up a buildfarm member for this platform? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Sep 13, 2016 at 9:31 AM, Jeff Janeswrote: > === > > +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or > tuples > +that are moved due to split. The need for cleanup lock to remove dead > tuples > +is to ensure that scans' returns correct results. Scan that returns > multiple > +tuples from the same bucket page always restart the scan from the previous > +offset number from which it has returned last tuple. > > Perhaps it would be better to teach scans to restart anywhere on the page, > than to force more cleanup locks to be taken? > Commenting on one of my own questions: This won't work when the vacuum removes the tuple which an existing scan is currently examining and thus will be used to re-find it's position when it realizes it is not visible and so takes up the scan again. The index tuples in a page are stored sorted just by hash value, not by the combination of (hash value, tid). If they were sorted by both, we could re-find our position even if the tuple had been removed, because we would know to start at the slot adjacent to where the missing tuple would be were it not removed. But unless we are willing to break pg_upgrade, there is no feasible way to change that now. Cheers, Jeff
Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()
On 15/09/16 03:45, Robert Haas wrote: On Wed, Sep 14, 2016 at 5:22 AM, Thomas Bergerwrote: Today, i found the time to read all the mails in this thread, and i think i have to explain, why we decided to open a bug for this behavior. Pn Tuesday, 23. August 2016, 13:30:29 Robert Haas wrote: J. Random User: I'm having a problem! Mailing List: Gee, how big are your tables? J. Random User: Here's some pg_size_pretty output. Mailing List: Gosh, we don't know what that means, what do you have this obscure GUC set to? J. Random User: Maybe I'll just give up on SQL and use MongoDB. In fact, we had just the other way around. One of our most critical databases had some extreme bloat. Some of our internal customers was very confused, about the sizes reported by the database. This confusion has led to wrong decisions. (And a long discussion about the choice of DBMS btw) I think there is a point missing in this whole discussion, or i just didn't see it: Yeah, the behavior of "kB" is defined in the "postgresql.conf" documentation. But no _user_ reads this. There is no link or hint in the documentation of "pg_size_pretty()" [1]. Interesting. I think that our documentation should only describe the way we use unit suffixes in one central place, but other places (like pg_size_pretty) could link to that central place. I don't believe that there is any general unanimity among users or developers about the question of which suffixes devote units denominated in units of 2^10 bytes vs. 10^3 bytes. About once a year, somebody makes an argument that we're doing it wrong, but the evidence that I've seen is very mixed. So when people say that there is only one right way to do this and we are not in compliance with that one right way, I guess I just don't believe it. Not everybody likes the way we do it, but I am fairly sure that if we change it, we'll make some currently-unhappy people happy and some currently-happy people unhappy. And the people who don't care but wanted to preserve backward compatibility will all be in the latter camp. However, that is not to say that the documentation couldn't be better. Well, I started programming 1968, and was taught that 1 kilobyte was 1024 (2^10). I object to Johny-come-latelies who try and insist it is 1000. As regards 'kB' vs 'KB', I'm not too worried either way - I think consistency is more important Cheers, Gavin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Choosing parallel_degree
On Wed, Sep 14, 2016 at 3:54 PM, Simon Riggswrote: >> I do think this comment is confusing: >> >> + *This value is not locked by the transaction, so this value may >> + *be changed while a SELECT that has used these values for planning >> + *is still executing. >> >> I don't know what it means for "this value" to be locked, or not >> locked, by the transaction. Basically, I have no idea what this is >> trying to explain. > > You're quoting that without context from the line above, which is > "get_tablespace_io_concurrency" Sure, but it doesn't make any sense to talk about tablespace_io_concurrency being locked by a transaction. At least not that I can see. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is the posix_memalign() equivalent for the PostgreSQL?
On Fri, Sep 2, 2016 at 1:17 PM, Andres Freundwrote: > On 2016-09-02 13:05:37 -0400, Tom Lane wrote: >> Anderson Carniel writes: >> > If not, according to your experience, is there a >> > significance difference between the performance of the O_DIRECT or not? >> >> AFAIK, nobody's really bothered to measure whether that would be useful >> for Postgres. The results would probably be quite platform-specific >> anyway. > > I've played with patches to make postgres use O_DIRECT. On linux, it's > rather beneficial for some workloads (fits into memory), but it also > works really badly for some others, because our IO code isn't > intelligent enough. We pretty much rely on write() being nearly > instantaneous when done by normal backends (during buffer replacement), > we rely on readahead, we rely on the kernel to stopgap some bad > replacement decisions we're making. So, suppose we changed the world so that backends don't write dirty buffers, or at least not normally. If they need to perform a buffer eviction, they first check the freelist, then run the clock sweep. The clock sweep puts clean buffers on the freelist and dirty buffers on a to-be-cleaned list. A background process writes buffers on the to-be-cleaned list and then adds them to the freelist afterward if the usage count hasn't been bumped meanwhile. As in Amit's bgreclaimer patch, we have a target size for the freelist, with a low watermark and a high watermark. When we drop below the low watermark, the background processes run the clock sweep and write from the to-be-cleaned list to try to populate it; when we surge above the high watermark, they go back to sleep. Further, suppose we also create a prefetch system, maybe based on the synchronous scan machinery. It preemptively pulls data into shared_buffers if an ongoing scan will need it soon. Or maybe don't base it on the synchronous scan machinery, but instead just have a queue that lets backends throw prefetch requests over the wall; when the queue wraps, old requests are discarded. A background process - or perhaps one per tablespace or something like that - pull the data in. Neither of those things seems that hard. And if we could do those things and make them work, then maybe we could offer direct I/O as an option. We'd still lose heavily in the case where our buffer eviction decisions are poor, but that'd probably spur some improvement in that area, which IMHO would be a good thing. I personally think direct I/O would be a really good thing, not least because O_ATOMIC is designed to allow MySQL to avoid double buffering, their alternative to full page writes. But we can't use it because it requires O_DIRECT. The savings are probably massive. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
On Tue, Aug 30, 2016 at 10:06 AM, Alvaro Herrerawrote: > Jesper Pedersen wrote: > > Hi, > > > > Attached is a patch that adds support for hash indexes in pageinspect. > > > > The functionality (hash_metap, hash_page_stats and hash_page_items) > follows > > the B-tree functions. > > I suggest that pageinspect functions are more convenient to use via the > get_raw_page interface, that is, instead of reading the buffer > themselves, the buffer is handed over from elsewhere and they receive it > as bytea. This enables other use cases such as grabbing a page from one > server and examining it in another one. > I've never needed to do that, but it does sound like it might be useful. But it is also annoying to have to do that when you want to examine a current server. Could we use overloading, so that it can be used in either way? For hash_page_items, the 'data' is always a 32 bit integer, isn't it? (unlike other pageinspect features, where the data could be any user-defined data type). If so, I'd rather see it in plain hex (without the spaces, without the trailing zeros) + /* page type (flags) */ + if (opaque->hasho_flag & LH_UNUSED_PAGE) + stat->type = 'u'; This can never be true, because LH_UNUSED_PAGE is zero. You should make this be the fall-through case, and have LH_META_PAGE be explicitly tested for. Cheers, Jeff
Re: [HACKERS] New SQL counter statistics view (pg_stat_sql)
On Fri, Sep 2, 2016 at 2:33 AM, Haribabu Kommiwrote: > On Wed, Aug 31, 2016 at 3:19 AM, Alvaro Herrera > wrote: >> Haribabu Kommi wrote: >> >>> Apart from the above, here are the following list of command tags that >>> are generated in the code, I took only the first word of the command tag >>> just to see how many categories present. The number indicates the >>> subset of operations or number of types it is used. Like create table, >>> create function and etc. >> >> Sounds about right. I suppose all those cases that you aggregated here >> would expand to full tags in the actual code. I furthermore suppose >> that some of these could be ignored, such as the transaction ones and >> things like load, lock, move, fetch, discard, deallocate (maybe lump >> them all together under "other", or some other rough categorization, as >> Tom suggests). > > Following is the pg_stat_sql view with the SQL categories that I considered > that are important. Rest of the them will be shown under others category. > > postgres=# \d pg_stat_sql > View "pg_catalog.pg_stat_sql" > Column | Type | Modifiers > -+--+--- > inserts | bigint | > deletes | bigint | > updates | bigint | > selects | bigint | > declare_cursors | bigint | > closes | bigint | > creates | bigint | > drops | bigint | > alters | bigint | > imports | bigint | > truncates | bigint | > copies | bigint | > grants | bigint | > revokes | bigint | > clusters| bigint | > vacuums | bigint | > analyzes| bigint | > refreshs| bigint | > locks | bigint | > checkpoints | bigint | > reindexes | bigint | > deallocates | bigint | > others | bigint | > stats_reset | timestamp with time zone | > > > If any additions/deletions, I can accommodate them. I think it is not a good idea to make the command names used here the plural forms of the command tags. Instead of "inserts", "updates", "imports", etc. just use "INSERT", "UPDATE", "IMPORT". That's simpler and less error prone - e.g. you won't end up with things like "refreshs", which is not a word. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
Hi, On 2016-09-14 21:17:42 +0200, Petr Jelinek wrote: > > > +/* > > > + * Gather Relations based o provided by RangeVar list. > > > + * The gathered tables are locked in access share lock mode. > > > + */ > > > > Why access share? Shouldn't we make this ShareUpdateExclusive or > > similar, to prevent schema changes? > > > > Hm, I thought AccessShare would be enough to prevent schema changes that > matter to us (which is basically just drop afaik). Doesn't e.g. dropping an index matter as well? > > > + if (strcmp($1, "replicate_insert") == 0) > > > + $$ = > > > makeDefElem("replicate_insert", > > > + > > > (Node *)makeInteger(TRUE), @1); > > > + else if (strcmp($1, > > > "noreplicate_insert") == 0) > > > + $$ = > > > makeDefElem("replicate_insert", > > > + > > > (Node *)makeInteger(FALSE), @1); > > > + else if (strcmp($1, "replicate_update") > > > == 0) > > > + $$ = > > > makeDefElem("replicate_update", > > > + > > > (Node *)makeInteger(TRUE), @1); > > > + else if (strcmp($1, > > > "noreplicate_update") == 0) > > > + $$ = > > > makeDefElem("replicate_update", > > > + > > > (Node *)makeInteger(FALSE), @1); > > > + else if (strcmp($1, "replicate_delete") > > > == 0) > > > + $$ = > > > makeDefElem("replicate_delete", > > > + > > > (Node *)makeInteger(TRUE), @1); > > > + else if (strcmp($1, > > > "noreplicate_delete") == 0) > > > + $$ = > > > makeDefElem("replicate_delete", > > > + > > > (Node *)makeInteger(FALSE), @1); > > > + else > > > + ereport(ERROR, > > > + > > > (errcode(ERRCODE_SYNTAX_ERROR), > > > + > > > errmsg("unrecognized publication option \"%s\"", $1), > > > + > > > parser_errposition(@1))); > > > + } > > > + ; > > > > I'm kind of inclined to do this checking at execution (or transform) > > time instead. That allows extension to add options, and handle them in > > utility hooks. > > > > Thant's interesting point, I prefer the parsing to be done in gram.y, but it > might be worth moving it for extensibility. Although there are so far other > barriers for that. Citus uses the lack of such check for COPY to implement copy over it's distributed tables for example. So there's some benefit. > > > + check_subscription_permissions(); > > > + > > > + rel = heap_open(SubscriptionRelationId, RowExclusiveLock); > > > + > > > + /* Check if name is used */ > > > + subid = GetSysCacheOid2(SUBSCRIPTIONNAME, MyDatabaseId, > > > + > > > CStringGetDatum(stmt->subname)); > > > + if (OidIsValid(subid)) > > > + { > > > + ereport(ERROR, > > > + (errcode(ERRCODE_DUPLICATE_OBJECT), > > > + errmsg("subscription \"%s\" already exists", > > > + stmt->subname))); > > > + } > > > + > > > + /* Parse and check options. */ > > > + parse_subscription_options(stmt->options, _given, , > > > +, > > > ); > > > + > > > + /* TODO: improve error messages here. */ > > > + if (conninfo == NULL) > > > + ereport(ERROR, > > > + (errcode(ERRCODE_SYNTAX_ERROR), > > > + errmsg("connection not specified"))); > > > > Probably also makes sense to parse the conninfo here to verify it looks > > saen. Although that's fairly annoying to do, because the relevant code > > is libpq :( > > > > Well the connection is eventually used (in later patches) so maybe that's > not problem. Well, it's nicer if it's immediately parsed, before doing complex and expensive stuff, especially if that happens outside of the transaction. > > > > > diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c > > > index 65230e2..f3d54c8 100644 > > > --- a/src/backend/nodes/copyfuncs.c > > > +++ b/src/backend/nodes/copyfuncs.c > > > > I
Re: [HACKERS] Choosing parallel_degree
On 14 September 2016 at 14:48, Robert Haaswrote: > On Thu, Sep 1, 2016 at 9:39 AM, Simon Riggs wrote: > Can I change this to a lower setting? I would have done this before > applying > the patch, but you beat me to it. I don't have a problem with reducing the lock level there, if we're convinced that it's safe. >>> >>> >>> I'll run up a patch, with appropriate comments. >> >> Attached > > This should really be posted on a new thread, since it changes a bunch > of reloptions, not only parallel_workers. I can't immediately think > of a reason why the changes wouldn't be safe, but I've failed to fully > apprehend all of the possible dangers multiple times previously, so we > should try to give everyone who might have ideas about this topic a > chance to chime in with anything we might be missing. OK. Will post on new thread. > I do think this comment is confusing: > > + *This value is not locked by the transaction, so this value may > + *be changed while a SELECT that has used these values for planning > + *is still executing. > > I don't know what it means for "this value" to be locked, or not > locked, by the transaction. Basically, I have no idea what this is > trying to explain. You're quoting that without context from the line above, which is "get_tablespace_io_concurrency" -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Choosing parallel_degree
On Thu, Sep 1, 2016 at 9:39 AM, Simon Riggswrote: >>> > Can I change this to a lower setting? I would have done this before >>> > applying >>> > the patch, but you beat me to it. >>> >>> I don't have a problem with reducing the lock level there, if we're >>> convinced that it's safe. >> >> >> I'll run up a patch, with appropriate comments. > > Attached This should really be posted on a new thread, since it changes a bunch of reloptions, not only parallel_workers. I can't immediately think of a reason why the changes wouldn't be safe, but I've failed to fully apprehend all of the possible dangers multiple times previously, so we should try to give everyone who might have ideas about this topic a chance to chime in with anything we might be missing. I do think this comment is confusing: + *This value is not locked by the transaction, so this value may + *be changed while a SELECT that has used these values for planning + *is still executing. I don't know what it means for "this value" to be locked, or not locked, by the transaction. Basically, I have no idea what this is trying to explain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Comment on GatherPath.single_copy
On Thu, Sep 1, 2016 at 3:15 AM, Kyotaro HORIGUCHIwrote: > At Wed, 31 Aug 2016 07:26:22 -0400, Tom Lane wrote in > <5934.1472642...@sss.pgh.pa.us> >> Robert Haas writes: >> > On Tue, Aug 30, 2016 at 6:38 PM, Tom Lane wrote: >> >> Robert, could you fix the documentation for that field so it's >> >> intelligible? >> >> > Uh, maybe. The trick, as you've already noted, is finding something >> > better. Maybe this: >> >> > -boolsingle_copy;/* path must not be executed >1x */ >> > +boolsingle_copy;/* don't execute path in multiple >> > processes */ >> >> OK by me. >> >> regards, tom lane > > Me too, thanks. OK, changed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
On 14/09/16 20:50, Andres Freund wrote: On 2016-09-14 13:20:02 -0500, Peter Eisentraut wrote: On 9/14/16 11:21 AM, Andres Freund wrote: + ExecInsert(NULL, /* mtstate is only used for onconflict handling which we don't support atm */ + remoteslot, + remoteslot, + NIL, + ONCONFLICT_NONE, + estate, + false); I have *severe* doubts about just using the (newly) exposed functions 1:1 here. It is a valid concern, but what is the alternative? ExecInsert() and the others appear to do exactly the right things that are required. They're actually a lot more heavyweight than what's required. If you e.g. do a large COPY on the source side, we create a single executor state (if at all), and then insert the rows using lower level routines. And that's *vastly* faster, than going through all the setup costs here for each row. Are your concerns mainly philosophical about calling into internal executor code, or do you have technical concerns that this will not do the right thing in some cases? Well, not about it being wrong in the sene of returning wrong results, but wrong in the sense of not even remotely being able to keep up in common cases. I'd say in common case they will. I don't plan to use these forever btw, but it's simplest to just use them in v1 IMHO instead of trying to reinvent new versions of these that perform better but also behave correctly (in terms of triggers and stuff for example). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
On 14/09/16 00:48, Andres Freund wrote: First read through the current version. Hence no real architectural comments. Hi, Thanks for looking! On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote: diff --git a/src/backend/commands/publicationcmds.c b/src/backend/commands/publicationcmds.c new file mode 100644 index 000..e0c719d --- /dev/null +++ b/src/backend/commands/publicationcmds.c @@ -0,0 +1,761 @@ +/*- + * + * publicationcmds.c + * publication manipulation + * + * Copyright (c) 2015, PostgreSQL Global Development Group + * + * IDENTIFICATION + * publicationcmds.c Not that I'm a fan of this line in the first place, but usually it does include the path. Yes, I don't bother with it in WIP version though, because this way I won't forget to change it when it's getting close to ready if there were renames. +static void +check_replication_permissions(void) +{ + if (!superuser() && !has_rolreplication(GetUserId())) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), +(errmsg("must be superuser or replication role to manipulate publications"; +} Do we want to require owner privileges for replication roles? I'd say no, but want to raise the question. No, we might want to invent some publish role for which we will so that we can do logical replication with higher granularity but for replication role it does not make sense. And I think the higher granularity ACLs is something for followup patches. +ObjectAddress +CreatePublication(CreatePublicationStmt *stmt) +{ + Relationrel; + ObjectAddress myself; + Oid puboid; + boolnulls[Natts_pg_publication]; + Datum values[Natts_pg_publication]; + HeapTuple tup; + boolreplicate_insert_given; + boolreplicate_update_given; + boolreplicate_delete_given; + boolreplicate_insert; + boolreplicate_update; + boolreplicate_delete; + + check_replication_permissions(); + + rel = heap_open(PublicationRelationId, RowExclusiveLock); + + /* Check if name is used */ + puboid = GetSysCacheOid1(PUBLICATIONNAME, CStringGetDatum(stmt->pubname)); + if (OidIsValid(puboid)) + { + ereport(ERROR, + (errcode(ERRCODE_DUPLICATE_OBJECT), +errmsg("publication \"%s\" already exists", + stmt->pubname))); + } + + /* Form a tuple. */ + memset(values, 0, sizeof(values)); + memset(nulls, false, sizeof(nulls)); + + values[Anum_pg_publication_pubname - 1] = + DirectFunctionCall1(namein, CStringGetDatum(stmt->pubname)); + + parse_publication_options(stmt->options, + _insert_given, _insert, + _update_given, _update, + _delete_given, _delete); + + values[Anum_pg_publication_puballtables - 1] = + BoolGetDatum(stmt->for_all_tables); + values[Anum_pg_publication_pubreplins - 1] = + BoolGetDatum(replicate_insert); + values[Anum_pg_publication_pubreplupd - 1] = + BoolGetDatum(replicate_update); + values[Anum_pg_publication_pubrepldel - 1] = + BoolGetDatum(replicate_delete); + + tup = heap_form_tuple(RelationGetDescr(rel), values, nulls); + + /* Insert tuple into catalog. */ + puboid = simple_heap_insert(rel, tup); + CatalogUpdateIndexes(rel, tup); + heap_freetuple(tup); + + ObjectAddressSet(myself, PublicationRelationId, puboid); + + /* Make the changes visible. */ + CommandCounterIncrement(); + + if (stmt->tables) + { + List *rels; + + Assert(list_length(stmt->tables) > 0); + + rels = GatherTableList(stmt->tables); + PublicationAddTables(puboid, rels, true, NULL); + CloseTables(rels); + } + else if (stmt->for_all_tables || stmt->schema) + { + List *rels; + + rels = GatherTables(stmt->schema); + PublicationAddTables(puboid, rels, true, NULL); + CloseTables(rels); + } Isn't this (and ALTER) racy? What happens if tables are concurrently created? This session wouldn't necessarily see the tables, and other sessions won't see for_all_tables/schema. Evaluating for_all_tables/all_in_schema when the publication is used, would solve that problem. Well, yes it is. It's technically not problem for all_in_schema as that's just
Re: [HACKERS] Hash Indexes
Hi, On 09/14/2016 07:24 AM, Amit Kapila wrote: On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersenwrote: On 09/13/2016 07:26 AM, Amit Kapila wrote: Attached, new version of patch which contains the fix for problem reported on write-ahead-log of hash index thread [1]. I have been testing patch in various scenarios, and it has a positive performance impact in some cases. This is especially seen in cases where the values of the indexed column are unique - SELECTs can see a 40-60% benefit over a similar query using b-tree. Here, I think it is better if we have the data comparing the situation of hash index with respect to HEAD as well. What I mean to say is that you are claiming that after the hash index improvements SELECT workload is 40-60% better, but where do we stand as of HEAD? The tests I have done are with a copy of a production database using the same queries sent with a b-tree index for the primary key, and the same with a hash index. Those are seeing a speed-up of the mentioned 40-60% in execution time - some involve JOINs. Largest of those tables is 390Mb with a CHAR() based primary key. UPDATE also sees an improvement. Can you explain this more? Is it more compare to HEAD or more as compare to Btree? Isn't this contradictory to what the test in below mail shows? Same thing here - where the fields involving the hash index aren't updated. In cases where the indexed column value isn't unique, it takes a long time to build the index due to the overflow page creation. Also in cases where the index column is updated with a high number of clients, ala -- ddl.sql -- CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); ANALYZE; -- test.sql -- \set id random(1,10) \set val random(0,10) BEGIN; UPDATE test SET val = :val WHERE id = :id; COMMIT; w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for hash, and 10062 tps for b-tree). Thanks for doing the tests. Have you applied both concurrent index and cache the meta page patch for these tests? So from above tests, we can say that after these set of patches read-only workloads will be significantly improved even better than btree in quite-a-few useful cases. Agreed. However when the indexed column is updated, there is still a large gap as compare to btree (what about the case when the indexed column is not updated in read-write transaction as in our pgbench read-write transactions, by any chance did you ran any such test?). I have done a run to look at the concurrency / TPS aspect of the implementation - to try something different than Mark's work on testing the pgbench setup. With definitions as above, with SELECT as -- select.sql -- \set id random(1,10) BEGIN; SELECT * FROM test WHERE id = :id; COMMIT; and UPDATE/Indexed with an index on 'val', and finally UPDATE/Nonindexed w/o one. [1] [2] [3] is new_hash - old_hash is the existing hash implementation on master. btree is master too. Machine is a 28C/56T with 256Gb RAM with 2 x RAID10 SSD for data + wal. Clients ran with -M prepared. [1] https://www.postgresql.org/message-id/caa4ek1+erbp+7mdkkahjzwq_dtdkocbpt7lswfwcqvuhbxz...@mail.gmail.com [2] https://www.postgresql.org/message-id/CAD__OujvYghFX_XVkgRcJH4VcEbfJNSxySd9x=1wp5vylvk...@mail.gmail.com [3] https://www.postgresql.org/message-id/caa4ek1juyr_ab7bxfnsg5+jqhiwgklkgacfk9bfd4mlffk6...@mail.gmail.com Don't know if you find this useful due to the small number of rows, but let me know if there are other tests I can run, f.ex. bump the number of rows. I think we need to focus on improving cases where index columns are updated, but it is better to do that work as a separate patch. Ok. Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
On 2016-09-14 13:20:02 -0500, Peter Eisentraut wrote: > On 9/14/16 11:21 AM, Andres Freund wrote: > >> + ExecInsert(NULL, /* mtstate is only used for onconflict handling which > >> we don't support atm */ > >> > + remoteslot, > >> > + remoteslot, > >> > + NIL, > >> > + ONCONFLICT_NONE, > >> > + estate, > >> > + false); > > I have *severe* doubts about just using the (newly) exposed functions > > 1:1 here. > > It is a valid concern, but what is the alternative? ExecInsert() and > the others appear to do exactly the right things that are required. They're actually a lot more heavyweight than what's required. If you e.g. do a large COPY on the source side, we create a single executor state (if at all), and then insert the rows using lower level routines. And that's *vastly* faster, than going through all the setup costs here for each row. > Are your concerns mainly philosophical about calling into internal > executor code, or do you have technical concerns that this will not do > the right thing in some cases? Well, not about it being wrong in the sene of returning wrong results, but wrong in the sense of not even remotely being able to keep up in common cases. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL consistency check facility
On Tue, Sep 13, 2016 at 9:04 PM, Michael Paquierwrote: > It seems to me that you need to think about the way to document things > properly first, with for example: > - Have a first documentation patch that explains what is a resource > manager for WAL, and what are the types available with a nice table. > - Add in your patch documentation to explain what are the benefits of > using this facility, the main purpose is testing, but there are also > mention upthread about users that would like to get that into > production, assuming that the overhead is minimal. So, I don't think that this patch should be required to document all of the currently-undocumented stuff that somebody might want to know that it is related to this patch. It should be enough to documented the patch itself. One paragraph in config.sgml in the usual format should be fine. Maybe two paragraphs. We do need to list the resource managers, but that can just be something like this: The default value of for this setting is off. To check all records written to the write-ahead log, set this parameter to all. To check only same records, specify a comma-separated list of resource managers. The resource managers which are currently supported are heap, btree, hash, BLAH, and BLAH. If somebody wants to write some user-facing documentation of the write-ahead log format, great. That could certainly be very helpful for people who are running pg_xlogdump. But I don't think that stuff goes in this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
On 9/14/16 11:21 AM, Andres Freund wrote: >> +ExecInsert(NULL, /* mtstate is only used for onconflict handling which >> we don't support atm */ >> > + remoteslot, >> > + remoteslot, >> > + NIL, >> > + ONCONFLICT_NONE, >> > + estate, >> > + false); > I have *severe* doubts about just using the (newly) exposed functions > 1:1 here. It is a valid concern, but what is the alternative? ExecInsert() and the others appear to do exactly the right things that are required. Are your concerns mainly philosophical about calling into internal executor code, or do you have technical concerns that this will not do the right thing in some cases? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL consistency check facility
On Thu, Sep 8, 2016 at 1:20 PM, Michael Paquierwrote: > >> 2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in >> REVMAP page. This happens only for two cases. I'm not sure what the >> reason can be. > > Hm? This smells like a block reference bug. What are the cases you are > referring to? > Following is the only case where the backup page stored in the wal record and the current page after redo are not consistent. test:BRIN using gmake-check Master - STATEMENT: VACUUM brintest; LOG: INSERT @ 0/59E1E0F8: - BRIN/UPDATE+INIT: heapBlk 100 pagesPerRange 1 old offnum 11, new offnum 1 Standby -- LOG: REDO @ 0/59E1B500; LSN 0/59E1E0F8: prev 0/59E17578; xid 0; len 14; blkref #0: rel 1663/16384/30556, blk 12; blkref #1: rel 1663/16384/30556, blk 1; blkref #2: rel 1663/16384/30556, blk 2 - BRIN/UPDATE+INIT: heapBlk 100 pagesPerRange 1 old offnum 11, new offnum 1 WARNING: Inconsistent page (at byte 26) found, rel 1663/16384/30556, forknum 0, blkno 1 CONTEXT: xlog redo at 0/59E1B500 for BRIN/UPDATE+INIT: heapBlk 100 pagesPerRange 1 old offnum 11, new offnum 1 thoughts? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrerawrote: > Robert Haas wrote: >> Actually, I think that probably *is* worthwhile, specifically because >> it might let us avoid multiple index scans in cases where we currently >> require them. Right now, our default maintenance_work_mem value is >> 64MB, which is enough to hold a little over ten million tuples. It's >> also large enough to hold a bitmap for a 14GB table. So right now if >> you deleted, say, 100 tuples per page you would end up with an index >> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to >> the bitmap representation for such cases would require only one index >> vacuum cycle for every 14GB, more than an order of magnitude >> improvement! > > Yeah, this sounds worthwhile. If we switch to the more compact > in-memory representation close to the point where we figure the TID > array is not going to fit in m_w_m, then we're saving some number of > additional index scans, and I'm pretty sure that the time to transform > from array to bitmap is going to be more than paid back by the I/O > savings. Yes, that seems pretty clear. The indexes can be arbitrarily large and there can be arbitrarily many of them, so we could be save a LOT of I/O. > One thing not quite clear to me is how do we create the bitmap > representation starting from the array representation in midflight > without using twice as much memory transiently. Are we going to write > the array to a temp file, free the array memory, then fill the bitmap by > reading the array from disk? I was just thinking about this exact problem while I was out to lunch.[1] I wonder if we could do something like this: 1. Allocate an array large enough for one pointer per gigabyte of the underlying relation. 2. Allocate 64MB, or the remaining amount of maintenance_work_mem if it's less, to store TIDs. 3. At the beginning of each 1GB chunk, add a pointer to the first free byte in the slab allocated in step 2 to the array allocated in step 1. Write a header word that identifies this as a TID list (rather than a bitmap) and leave space for a TID count; then, write the TIDs afterwards. Continue doing this until one of the following things happens: (a) we reach the end of the 1GB chunk - if that happens, restart step 3 for the next chunk; (b) we fill the chunk - see step 4, or (c) we write more TIDs for the chunk than the space being used for TIDs now exceeds the space needed for a bitmap - see step 5. 4. When we fill up one of the slabs allocated in step 2, allocate a new one and move the tuples for the current 1GB chunk to the beginning of the new slab using memmove(). This is wasteful of both CPU time and memory, but I think it's not that bad. The maximum possible waste is less than 10%, and many allocators have more overhead than that. We could reduce the waste by using, say, 256MB chunks rather than 1GB chunks. If no new slab can be allocated because maintenance_work_mem is completely exhausted (or the remaining space isn't enough for the TIDs that would need to be moved immediately), then stop and do an index vacuum cycle. 5. When we write a large enough number of TIDs for 1GB chunk that the bitmap would be smaller, check whether sufficient maintenance_work_mem remains to allocate a bitmap for 1GB chunk (~4.5MB). If not, never mind; continue with step 3 as if the bitmap representation did not exist. If so, allocate space for a bitmap, move all of the TIDs for the current chunk into it, and update the array allocated in step 1 to point to it. Then, finish scanning the current 1GB chunk, updating that bitmap rather than inserting TIDs into the slab. Rewind our pointer into the slab to where it was at the beginning of the current 1GB chunk, so that the memory we consumed for TIDs can be reused now that those TIDs have been transferred to a bitmap. If, earlier in the current 1GB chunk, we did a memmove-to-next-slab operation as described in step 4, this "rewind" might move our pointer back into the previous slab, in which case we can free the now-empty slab. (The next 1GB segment might have few enough TIDs that they will fit into the leftover space in the previous slab.) With this algorithm, we never exceed maintenance_work_mem, not even transiently. When memory is no longer sufficient to convert to the bitmap representation without bursting above maintenance_work_mem, we simply don't perform the conversion. Also, we do very little memory copying. An alternative I considered was to do a separate allocation for each 1GB chunk rather than carving the dead-tuple space out of slabs. But the problem with that is that you'll have to start those out small (in case you don't find many dead tuples) and then grow them, which means reallocating, which is bad both because it can burst above maintenance_work_mem while the repalloc is in process and also because you have to keep copying the data from the old chunk to the new, bigger chunk.
Re: [HACKERS] GiST: interpretation of NaN from penalty function
Andrew Borodinwrites: > Currently GiST treats NaN penalty as zero penalty, in terms of > generalized tree this means "perfect fit". I think that this situation > should be considered "worst fit" instead. On what basis? It seems hard to me to make any principled argument here. Certainly, "NaN means infinity", as your patch proposes, has no more basis to it than "NaN means zero". If the penalty function doesn't like that interpretation, it shouldn't return NaN. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Pavan Deolaseewrites: > On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera > wrote: >> One thing not quite clear to me is how do we create the bitmap >> representation starting from the array representation in midflight >> without using twice as much memory transiently. Are we going to write >> the array to a temp file, free the array memory, then fill the bitmap by >> reading the array from disk? > We could do that. People who are vacuuming because they are out of disk space will be very very unhappy with that solution. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tuplesort merge pre-reading
Addressed all your comments one way or another, new patch attached. Comments on some specific points below: On 09/12/2016 01:13 AM, Peter Geoghegan wrote: Other things I noticed: * You should probably point out that typically, access to batch memory will still be sequential, despite your block-based scheme. The preloading will now more or less make that the normal case. Any fragmentation will now be essentially in memory, not on disk, which is a big win. That's not true, the "buffers" in batch memory are not accessed sequentially. When we pull the next tuple from a tape, we store it in the next free buffer. Usually, that buffer was used to hold the previous tuple that was returned from gettuple(), and was just added to the free list. It's still quite cache-friendly, though, because we only need a small number of slots (one for each tape). * i think you should move "bool *mergeactive; /* active input run source? */" within Tuplesortstate to be next to the other batch memory stuff. No point in having separate merge and batch "sections" there anymore. Hmm. I think I prefer to keep the memory management stuff in a separate section. While it's true that it's currently only used during merging, it's not hard to imagine using it when building the initial runs, for example. Except for replacement selection, the pattern for building the runs is: add a bunch of tuples, sort, flush them out. It would be straightforward to use one large chunk of memory to hold all the tuples. I'm not going to do that now, but I think keeping the memory management stuff separate from merge-related state makes sense. * I think that you need to comment on why state->tuplecontext is not used for batch memory now. It is still useful, for multiple merge passes, but the situation has notably changed for it. Hmm. We don't actually use it after the initial runs at all anymore. I added a call to destroy it in mergeruns(). Now that we use the batch memory buffers for allocations < 1 kB (I pulled that number out of a hat, BTW), and we only need one allocation per tape (plus one), there's not much risk of fragmentation. On Sun, Sep 11, 2016 at 3:13 PM, Peter Geogheganwrote: * Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?: @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite) Assert(lt->frozen); datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect); } + + /* Allocate a read buffer */ + if (lt->buffer) + pfree(lt->buffer); + lt->buffer = palloc(lt->read_buffer_size); + lt->buffer_size = lt->read_buffer_size; Of course, when you do that you're going to have to make the new "buffer_size" and "read_buffer_size" fields of type "Size" (or, possibly, "int64", to match tuplesort.c's own buffer sizing variables ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos" and "nbytes" fields next to "buffer_size" within the struct LogicalTape, I think. ISTM that logtape.c blocknums can remain of type "long", though, since that reflects an existing hardly-relevant limitation that you're not making any worse. True. I fixed that by putting a MaxAllocSize cap on the buffer size instead. I doubt that doing > 1 GB of read-ahead of a single tape will do any good. I wonder if we should actually have a smaller cap there. Even 1 GB seems excessive. Might be better to start the merging sooner, rather than wait for the read of 1 GB to complete. The point of the OS readahead is that the OS will do that for us, in the background. And other processes might have better use for the memory anyway. * It couldn't hurt to make this code paranoid about LACKMEM() becoming true, which will cause havoc (we saw this recently in 9.6; a patch of mine to fix that just went in): + /* +* Use all the spare memory we have available for read buffers. Divide it +* memory evenly among all the tapes. +*/ + avail_blocks = state->availMem / BLCKSZ; + per_tape = avail_blocks / maxTapes; + cutoff = avail_blocks % maxTapes; + if (per_tape == 0) + { + per_tape = 1; + cutoff = 0; + } + for (tapenum = 0; tapenum < maxTapes; tapenum++) + { + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum, + (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ); + } In other words, we really don't want availMem to become < 0, since it's int64, but a derived value is passed to LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now, if LACKMEM() did happen it would be a bug anyway, but I recommend defensive code also be added. Per grow_memtuples(), "We need to be sure that we do not cause LACKMEM to become true, else the space management algorithm will go nuts". Let's be sure that we get that right, since, as we saw recently, especially since grow_memtuples() will not actually have the
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrerawrote: > > > One thing not quite clear to me is how do we create the bitmap > representation starting from the array representation in midflight > without using twice as much memory transiently. Are we going to write > the array to a temp file, free the array memory, then fill the bitmap by > reading the array from disk? > We could do that. Or may be compress TID array when consumed half m_w_m and do this repeatedly with remaining memory. For example, if we start with 1GB memory, we decide to compress at 512MB. Say that results in 300MB for bitmap. We then continue to accumulate TID and do another round of fold up when another 350MB is consumed. I think we should maintain per offset count of number of dead tuples to choose the most optimal bitmap size (that needs overflow region). We can also track how many blocks or block ranges have at least one dead tuple to know if it's worthwhile to have some sort of indirection. Together that can tell us how much compression can be achieved and allow us to choose the most optimal representation. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On 14 September 2016 at 11:19, Pavan Deolaseewrote: >> In >> theory we could even start with the list of TIDs and switch to the >> bitmap if the TID list becomes larger than the bitmap would have been, >> but I don't know if it's worth the effort. >> > > Yes, that works too. Or may be even better because we already know the > bitmap size requirements, definitely for the tuples collected so far. We > might need to maintain some more stats to further optimise the > representation, but that seems like unnecessary detailing at this point. That sounds best to me... build the simple representation, but as we do maintain stats to show to what extent that set of tuples is compressible. When we hit the limit on memory we can then selectively compress chunks to stay within memory, starting with the most compressible chunks. I think we should use the chunking approach Robert suggests, though mainly because that allows us to consider how parallel VACUUM should work - writing the chunks to shmem. That would also allow us to apply a single global limit for vacuum memory rather than an allocation per VACUUM. We can then scan multiple indexes at once in parallel, all accessing the shmem data structure. We should also find the compression is better when we consider chunks rather than the whole data structure at once. -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Robert Haas wrote: > Actually, I think that probably *is* worthwhile, specifically because > it might let us avoid multiple index scans in cases where we currently > require them. Right now, our default maintenance_work_mem value is > 64MB, which is enough to hold a little over ten million tuples. It's > also large enough to hold a bitmap for a 14GB table. So right now if > you deleted, say, 100 tuples per page you would end up with an index > vacuum cycles for every ~100,000 pages = 800MB, whereas switching to > the bitmap representation for such cases would require only one index > vacuum cycle for every 14GB, more than an order of magnitude > improvement! Yeah, this sounds worthwhile. If we switch to the more compact in-memory representation close to the point where we figure the TID array is not going to fit in m_w_m, then we're saving some number of additional index scans, and I'm pretty sure that the time to transform from array to bitmap is going to be more than paid back by the I/O savings. One thing not quite clear to me is how do we create the bitmap representation starting from the array representation in midflight without using twice as much memory transiently. Are we going to write the array to a temp file, free the array memory, then fill the bitmap by reading the array from disk? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GiST: interpretation of NaN from penalty function
Hi hackers! Currently GiST treats NaN penalty as zero penalty, in terms of generalized tree this means "perfect fit". I think that this situation should be considered "worst fit" instead. Here is a patch to highlight place in the code. I could not construct test to generate bad tree, which would be fixed by this patch. There is not so much of cases when you get NaN. None of them can be a result of usual additions and multiplications of real values. Do I miss something? Is there any case when NaN should be considered good fit? Greg Stark was talking about this in BANLkTi=d+bpps1cm4yc8kukhj63hwj4...@mail.gmail.com but that topic didn't go far (due to triangles). I'm currently messing with floats in penalties, very close to NaNs, and I think this question can be settled. Regrads, Andrey Borodin. diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c index fac166d..0a62561 100644 --- a/src/backend/access/gist/gistutil.c +++ b/src/backend/access/gist/gistutil.c @@ -675,8 +675,14 @@ gistpenalty(GISTSTATE *giststate, int attno, PointerGetDatum(orig), PointerGetDatum(add), PointerGetDatum()); - /* disallow negative or NaN penalty */ - if (isnan(penalty) || penalty < 0.0) + /* +* disallow negative or NaN penalty +* NaN is considered float infinity - i.e. most inapropriate fit +* negative is considered penaltiless fix +*/ + if (isnan(penalty)) + penalty = get_float4_infinity(); + if (penalty < 0.0) penalty = 0.0; } else if (isNullOrig && isNullAdd) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GiST penalty functions [PoC]
Here is v6 of cube penalty patch. There was a warning about unused edge function under systems without __STDC_IEC_559__ defined, this patch fixes it. Regards, Andrey Borodin. diff --git a/contrib/cube/cube.c b/contrib/cube/cube.c index 3feddef..ad868ac 100644 --- a/contrib/cube/cube.c +++ b/contrib/cube/cube.c @@ -18,6 +18,7 @@ #include "cubedata.h" + PG_MODULE_MAGIC; /* @@ -27,6 +28,15 @@ PG_MODULE_MAGIC; #define ARRNELEMS(x) ArrayGetNItems( ARR_NDIM(x), ARR_DIMS(x)) /* + * If IEEE754 floats are fully supported we use advanced penalty + * which takes into account cube volume, perimeter and tend to favor + * small nodes over big ones. For more info see g_cube_penalty implementation + */ +#ifdef __STDC_IEC_559__ +#define ADVANCED_PENALTY +#endif + +/* ** Input/Output routines */ PG_FUNCTION_INFO_V1(cube_in); @@ -91,14 +101,18 @@ PG_FUNCTION_INFO_V1(cube_enlarge); /* ** For internal use only */ -int32 cube_cmp_v0(NDBOX *a, NDBOX *b); -bool cube_contains_v0(NDBOX *a, NDBOX *b); -bool cube_overlap_v0(NDBOX *a, NDBOX *b); -NDBOX *cube_union_v0(NDBOX *a, NDBOX *b); -void rt_cube_size(NDBOX *a, double *sz); -NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep); -bool g_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber strategy); -bool g_cube_internal_consistent(NDBOX *key, NDBOX *query, StrategyNumber strategy); +static int32 cube_cmp_v0(NDBOX *a, NDBOX *b); +static boolcube_contains_v0(NDBOX *a, NDBOX *b); +static boolcube_overlap_v0(NDBOX *a, NDBOX *b); +static NDBOX *cube_union_v0(NDBOX *a, NDBOX *b); +#ifdef ADVANCED_PENALTY +static float pack_float(float value, int realm); +static voidrt_cube_edge(NDBOX *a, double *sz); +#endif +static voidrt_cube_size(NDBOX *a, double *sz); +static NDBOX *g_cube_binary_union(NDBOX *r1, NDBOX *r2, int *sizep); +static boolg_cube_leaf_consistent(NDBOX *key, NDBOX *query, StrategyNumber strategy); +static boolg_cube_internal_consistent(NDBOX *key, NDBOX *query, StrategyNumber strategy); /* ** Auxiliary funxtions @@ -419,7 +433,36 @@ g_cube_decompress(PG_FUNCTION_ARGS) } PG_RETURN_POINTER(entry); } +/* + * Function to pack floats of different realms + * This function serves to pack bit flags inside float type + * Resulted value represent can be from four different "realms" + * Every value from realm 3 is greater than any value from realms 2,1 and 0. + * Every value from realm 2 is less than every value from realm 3 and greater + * than any value from realm 1 and 0, and so on. Values from the same realm + * loose two bits of precision. This technique is possible due to floating + * point numbers specification according to IEEE 754: exponent bits are highest + * (excluding sign bits, but here penalty is always positive). If float a is + * greater than float b, integer A with same bit representation as a is greater + * than integer B with same bits as b. + */ +#ifdef ADVANCED_PENALTY +static float +pack_float(const float value, const int realm) +{ + union { +float f; +struct { unsigned value:31, sign:1; } vbits; +struct { unsigned value:29, realm:2, sign:1; } rbits; + } a; + + a.f = value; + a.rbits.value = a.vbits.value >> 2; + a.rbits.realm = realm; + return a.f; +} +#endif /* ** The GiST Penalty method for boxes @@ -441,9 +484,42 @@ g_cube_penalty(PG_FUNCTION_ARGS) rt_cube_size(DatumGetNDBOX(origentry->key), ); *result = (float) (tmp1 - tmp2); - /* -* fprintf(stderr, "penalty\n"); fprintf(stderr, "\t%g\n", *result); -*/ +#ifdef ADVANCED_PENALTY + /* Realm tricks are used only in case of IEEE754 support(IEC 60559) */ + + /* REALM 0: No extension is required, volume is zero, return edge */ + /* REALM 1: No extension is required, return nonzero volume */ + /* REALM 2: Volume extension is zero, return nonzero edge extension */ + /* REALM 3: Volume extension is nonzero, return it */ + + if( *result == 0 ) + { + double tmp3 = tmp1; /* remember entry volume */ + rt_cube_edge(ud, ); + rt_cube_edge(DatumGetNDBOX(origentry->key), ); + *result = (float) (tmp1 - tmp2); + if( *result == 0 ) + { + if( tmp3 != 0 ) + { + *result = pack_float(tmp3, 1); /* REALM 1 */ + } + else + { + *result = pack_float(tmp1, 0); /* REALM 0 */ + } + } + else + { + *result = pack_float(*result, 2); /* REALM 2 */ + } + } + else + { +
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haaswrote: > For instance, one idea to grow memory usage incrementally would be to > store dead tuple information separately for each 1GB segment of the > relation. So we have an array of dead-tuple-representation objects, > one for every 1GB of the relation. If there are no dead tuples in a > given 1GB segment, then this pointer can just be NULL. Otherwise, it > can point to either the bitmap representation (which will take ~4.5MB) > or it can point to an array of TIDs (which will take 6 bytes/TID). > That could handle an awfully wide variety of usage patterns > efficiently; it's basically never worse than what we're doing today, > and when the dead tuple density is high for any portion of the > relation it's a lot better. If you compress the list into a bitmap a posteriori, you know the number of tuples per page, so you could encode the bitmap even more efficiently. It's not a bad idea, one that can be slapped on top of the multiarray patch - when closing a segment, it can be decided whether to turn it into a bitmap or not. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical Replication WIP
(continuing, uh, a bit happier) On 2016-09-09 00:59:26 +0200, Petr Jelinek wrote: > +/* > + * Relcache invalidation callback for our relation map cache. > + */ > +static void > +logicalreprelmap_invalidate_cb(Datum arg, Oid reloid) > +{ > + LogicalRepRelMapEntry *entry; > + > + /* Just to be sure. */ > + if (LogicalRepRelMap == NULL) > + return; > + > + if (reloid != InvalidOid) > + { > + HASH_SEQ_STATUS status; > + > + hash_seq_init(, LogicalRepRelMap); > + > + /* TODO, use inverse lookup hastable? */ *hashtable > + while ((entry = (LogicalRepRelMapEntry *) > hash_seq_search()) != NULL) > + { > + if (entry->reloid == reloid) > + entry->reloid = InvalidOid; can't we break here? > +/* > + * Initialize the relation map cache. > + */ > +static void > +remoterelmap_init(void) > +{ > + HASHCTL ctl; > + > + /* Make sure we've initialized CacheMemoryContext. */ > + if (CacheMemoryContext == NULL) > + CreateCacheMemoryContext(); > + > + /* Initialize the hash table. */ > + MemSet(, 0, sizeof(ctl)); > + ctl.keysize = sizeof(uint32); > + ctl.entrysize = sizeof(LogicalRepRelMapEntry); > + ctl.hcxt = CacheMemoryContext; Wonder if this (and similar code earlier) should try to do everything in a sub-context of CacheMemoryContext instead. That'd make some issues easier to track down. > +/* > + * Open the local relation associated with the remote one. > + */ > +static LogicalRepRelMapEntry * > +logicalreprel_open(uint32 remoteid, LOCKMODE lockmode) > +{ > + LogicalRepRelMapEntry *entry; > + boolfound; > + > + if (LogicalRepRelMap == NULL) > + remoterelmap_init(); > + > + /* Search for existing entry. */ > + entry = hash_search(LogicalRepRelMap, (void *) , > + HASH_FIND, ); > + > + if (!found) > + elog(FATAL, "cache lookup failed for remote relation %u", > + remoteid); > + > + /* Need to update the local cache? */ > + if (!OidIsValid(entry->reloid)) > + { > + Oid nspid; > + Oid relid; > + int i; > + TupleDesc desc; > + LogicalRepRelation *remoterel; > + > + remoterel = >remoterel; > + > + nspid = LookupExplicitNamespace(remoterel->nspname, false); > + if (!OidIsValid(nspid)) > + ereport(FATAL, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("the logical replication target > %s not found", > + > quote_qualified_identifier(remoterel->nspname, remoterel->relname; > + relid = get_relname_relid(remoterel->relname, nspid); > + if (!OidIsValid(relid)) > + ereport(FATAL, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("the logical replication target > %s not found", > + > quote_qualified_identifier(remoterel->nspname, > + >remoterel->relname; > + > + entry->rel = heap_open(relid, lockmode); This seems rather racy. I think this really instead needs something akin to RangeVarGetRelidExtended(). > +/* > + * Executor state preparation for evaluation of constraint expressions, > + * indexes and triggers. > + * > + * This is based on similar code in copy.c > + */ > +static EState * > +create_estate_for_relation(LogicalRepRelMapEntry *rel) > +{ > + EState *estate; > + ResultRelInfo *resultRelInfo; > + RangeTblEntry *rte; > + > + estate = CreateExecutorState(); > + > + rte = makeNode(RangeTblEntry); > + rte->rtekind = RTE_RELATION; > + rte->relid = RelationGetRelid(rel->rel); > + rte->relkind = rel->rel->rd_rel->relkind; > + estate->es_range_table = list_make1(rte); > + > + resultRelInfo = makeNode(ResultRelInfo); > + InitResultRelInfo(resultRelInfo, rel->rel, 1, 0); > + > + estate->es_result_relations = resultRelInfo; > + estate->es_num_result_relations = 1; > + estate->es_result_relation_info = resultRelInfo; > + > + /* Triggers might need a slot */ > + if (resultRelInfo->ri_TrigDesc) > + estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate); > + > + return estate; > +} Ugh, we do this for every single change? That's pretty darn heavy. > +/* > + * Check if the local
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 8:47 PM, Robert Haaswrote: > > > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken.I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. Yeah, I wouldn't worry only about lookup speedup, but if does speeds up, that's a bonus. But the bitmaps seem to win even for memory consumption. As theory and experiments both show, at 10% dead tuple ratio, bitmaps will win handsomely. In > theory we could even start with the list of TIDs and switch to the > bitmap if the TID list becomes larger than the bitmap would have been, > but I don't know if it's worth the effort. > > Yes, that works too. Or may be even better because we already know the bitmap size requirements, definitely for the tuples collected so far. We might need to maintain some more stats to further optimise the representation, but that seems like unnecessary detailing at this point. > > On the other hand, if we switch to the bitmap as the ONLY possible > representation, we will lose badly when there are scattered updates - > e.g. 1 deleted tuple every 10 pages. Sure. I never suggested that. What I'd suggested is to switch back to array representation once we realise bitmaps are not going to work. But I see it's probably better the other way round. > So it seems like we probably > want to have both options. One tricky part is figuring out how we > switch between them when memory gets tight; we have to avoid bursting > above our memory limit while making the switch. Yes, I was thinking about this problem. Some modelling will be necessary to ensure that we don't go (much) beyond the maintenance_work_mem while switching representation, which probably means you need to do that earlier than necessary. > For instance, one idea to grow memory usage incrementally would be to > store dead tuple information separately for each 1GB segment of the > relation. So we have an array of dead-tuple-representation objects, > one for every 1GB of the relation. If there are no dead tuples in a > given 1GB segment, then this pointer can just be NULL. Otherwise, it > can point to either the bitmap representation (which will take ~4.5MB) > or it can point to an array of TIDs (which will take 6 bytes/TID). > That could handle an awfully wide variety of usage patterns > efficiently; it's basically never worse than what we're doing today, > and when the dead tuple density is high for any portion of the > relation it's a lot better. > > Yes seems like a good idea. Another idea that I'd in mind is to use some sort of indirection map where bitmap for every block or a set of blocks will either be recorded or not, depending on whether a bit is set for the range. If the bitmap exists, the indirection map will give out the offset into the larger bitmap area. Seems similar to what you described. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Sep 14, 2016 at 8:59 PM, Robert Haaswrote: > Sure, but you're testing at *really* high client counts here. Almost > nobody is going to benefit from a 5% improvement at 256 clients. I agree with your point, but here we need to consider one more thing, that on head we are gaining ~30% with both the approaches. So for comparing these two patches we can consider.. A. Other workloads (one can be as below) -> Load on CLogControlLock at commit (exclusive mode) + Load on CLogControlLock at Transaction status (shared mode). I think we can mix (savepoint + updates) B. Simplicity of the patch (if both are performing almost equal in all practical scenarios). C. Bases on algorithm whichever seems winner. I will try to test these patches with other workloads... > You > need to test 64 clients and 32 clients and 16 clients and 8 clients > and see what happens there. Those cases are a lot more likely than > these stratospheric client counts. I tested with 64 clients as well.. 1. On head we are gaining ~15% with both the patches. 2. But group lock vs granular lock is almost same. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Sep 14, 2016 5:18 PM, "Robert Haas"wrote: > > On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolasee > wrote: > > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page > > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there > > are 6 or more dead tuples per page, bitmap will outperform the current > > representation, assuming max allocation for bitmap. If we can use additional > > estimates to restrict the size to somewhat more conservative value and then > > keep overflow area, then probably the break-even happens even earlier than > > that. I hope this gives us a good starting point, but let me know if you > > think it's still a wrong approach to pursue. > > Well, it's certainly a bigger change. I think the big concern is that > the amount of memory now becomes fixed based on the table size. So > one problem is that you have to figure out what you're going to do if > the bitmap doesn't fit in maintenance_work_mem. A related problem is > that it might fit but use more memory than before, which could cause > problems for some people. Now on the other hand it could also use > less memory for some people, and that would be good. > > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken.I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. > > I'm fairly sure that our number one priority should be to minimize the > number of cases where we need to do multiple scans of the indexes to > stay within maintenance_work_mem. If we're satisfied we've met that > goal, then within that we should try to make VACUUM as fast as > possible with as little memory usage as possible. I'm not 100% sure I > know how to get there, or how much work it's worth expending. In > theory we could even start with the list of TIDs and switch to the > bitmap if the TID list becomes larger than the bitmap would have been, > but I don't know if it's worth the effort. > > /me thinks a bit. > > Actually, I think that probably *is* worthwhile, specifically because > it might let us avoid multiple index scans in cases where we currently > require them. Right now, our default maintenance_work_mem value is > 64MB, which is enough to hold a little over ten million tuples. It's > also large enough to hold a bitmap for a 14GB table. So right now if > you deleted, say, 100 tuples per page you would end up with an index > vacuum cycles for every ~100,000 pages = 800MB, whereas switching to > the bitmap representation for such cases would require only one index > vacuum cycle for every 14GB, more than an order of magnitude > improvement! > > On the other hand, if we switch to the bitmap as the ONLY possible > representation, we will lose badly when there are scattered updates - > e.g. 1 deleted tuple every 10 pages. So it seems like we probably > want to have both options. One tricky part is figuring out how we > switch between them when memory gets tight; we have to avoid bursting > above our memory limit while making the switch. And even if our > memory limit is very high, we want to avoid using memory gratuitously; > I think we should try to grow memory usage incrementally with either > representation. > > For instance, one idea to grow memory usage incrementally would be to > store dead tuple information separately for each 1GB segment of the > relation. So we have an array of dead-tuple-representation objects, > one for every 1GB of the relation. If there are no dead tuples in a > given 1GB segment, then this pointer can just be NULL. Otherwise, it > can point to either the bitmap representation (which will take ~4.5MB) > or it can point to an array of TIDs (which will take 6 bytes/TID). > That could handle an awfully wide variety of usage patterns > efficiently; it's basically never worse than what we're doing today, > and when the dead tuple density is high for any portion of the > relation it's a lot better. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers I'd say it's an idea worth pursuing. It's the base idea behind roaring bitmaps, arguably the best overall compressed bitmap implementation.
Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()
Robert Haaswrites: > Interesting. I think that our documentation should only describe the > way we use unit suffixes in one central place, but other places (like > pg_size_pretty) could link to that central place. > I don't believe that there is any general unanimity among users or > developers about the question of which suffixes devote units > denominated in units of 2^10 bytes vs. 10^3 bytes. About once a year, > somebody makes an argument that we're doing it wrong, but the evidence > that I've seen is very mixed. So when people say that there is only > one right way to do this and we are not in compliance with that one > right way, I guess I just don't believe it. Not everybody likes the > way we do it, but I am fairly sure that if we change it, we'll make > some currently-unhappy people happy and some currently-happy people > unhappy. And the people who don't care but wanted to preserve > backward compatibility will all be in the latter camp. That's about my position too: I cannot see that changing this is going to make things better to a degree that would justify breaking backwards compatibility. > However, that is not to say that the documentation couldn't be better. +1; your idea above seems sound. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()
On Wed, Sep 14, 2016 at 5:22 AM, Thomas Bergerwrote: > Today, i found the time to read all the mails in this thread, and i think i > have to explain, why we decided to open a bug for this behavior. > > Pn Tuesday, 23. August 2016, 13:30:29 Robert Haas wrote: >> J. Random User: I'm having a problem! >> Mailing List: Gee, how big are your tables? >> J. Random User: Here's some pg_size_pretty output. >> Mailing List: Gosh, we don't know what that means, what do you have >> this obscure GUC set to? >> J. Random User: Maybe I'll just give up on SQL and use MongoDB. > > In fact, we had just the other way around. One of our most critical databases > had some extreme bloat. > Some of our internal customers was very confused, about the sizes reported by > the database. > This confusion has led to wrong decisions. (And a long discussion about the > choice of DBMS btw) > > I think there is a point missing in this whole discussion, or i just didn't > see it: > > Yeah, the behavior of "kB" is defined in the "postgresql.conf" documentation. > But no _user_ reads this. There is no link or hint in the documentation of > "pg_size_pretty()" [1]. Interesting. I think that our documentation should only describe the way we use unit suffixes in one central place, but other places (like pg_size_pretty) could link to that central place. I don't believe that there is any general unanimity among users or developers about the question of which suffixes devote units denominated in units of 2^10 bytes vs. 10^3 bytes. About once a year, somebody makes an argument that we're doing it wrong, but the evidence that I've seen is very mixed. So when people say that there is only one right way to do this and we are not in compliance with that one right way, I guess I just don't believe it. Not everybody likes the way we do it, but I am fairly sure that if we change it, we'll make some currently-unhappy people happy and some currently-happy people unhappy. And the people who don't care but wanted to preserve backward compatibility will all be in the latter camp. However, that is not to say that the documentation couldn't be better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] palloc() too large on pg_buffercache with large shared_buffers
On Wed, Sep 14, 2016 at 12:13 AM, Kouhei Kaigaiwrote: > It looks to me pg_buffercache tries to allocate more than 1GB using > palloc(), when shared_buffers is more than 256GB. > > # show shared_buffers ; > shared_buffers > > 280GB > (1 row) > > # SELECT buffers, d.datname, coalesce(c.relname, '???') > FROM (SELECT count(*) buffers, reldatabase, relfilenode > FROM pg_buffercache group by reldatabase, relfilenode) b >LEFT JOIN pg_database d ON d.oid = b.reldatabase >LEFT JOIN pg_class c ON d.oid = (SELECT oid FROM pg_database > WHERE datname = current_database()) >AND b.relfilenode = pg_relation_filenode(c.oid) >ORDER BY buffers desc; > ERROR: invalid memory alloc request size 1174405120 > > It is a situation to use MemoryContextAllocHuge(), instead of palloc(). > Also, it may need a back patching? I guess so. Although it's not very desirable for it to use that much memory, I suppose if you have a terabyte of shared_buffers you probably have 4GB of memory on top of that to show what they contain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Sep 14, 2016 at 12:55 AM, Dilip Kumarwrote: > 2. Results > ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql > scale factor: 300 > Clients head(tps)grouplock(tps) granular(tps) > --- - -- --- > 12829367 3932637421 > 18029777 3781036469 > 25628523 3741835882 > > > grouplock --> 1) Group mode to reduce CLOGControlLock contention > granular --> 2) Use granular locking model > > I will test with 3rd approach also, whenever I get time. > > 3. Summary: > 1. I can see on head we are gaining almost ~30 % performance at higher > client count (128 and beyond). > 2. group lock is ~5% better compared to granular lock. Sure, but you're testing at *really* high client counts here. Almost nobody is going to benefit from a 5% improvement at 256 clients. You need to test 64 clients and 32 clients and 16 clients and 8 clients and see what happens there. Those cases are a lot more likely than these stratospheric client counts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 12:17 PM, Robert Haaswrote: > > I am kind of doubtful about this whole line of investigation because > we're basically trying pretty hard to fix something that I'm not sure > is broken.I do agree that, all other things being equal, the TID > lookups will probably be faster with a bitmap than with a binary > search, but maybe not if the table is large and the number of dead > TIDs is small, because cache efficiency is pretty important. But even > if it's always faster, does TID lookup speed even really matter to > overall VACUUM performance? Claudio's early results suggest that it > might, but maybe that's just a question of some optimization that > hasn't been done yet. FYI, the reported impact was on CPU time, not runtime. There was no significant difference in runtime (real time), because my test is heavily I/O bound. I tested with a few small tables and there was no significant difference either, but small tables don't stress the array lookup anyway so that's expected. But on the assumption that some systems may be CPU bound during vacuum (particularly those able to do more than 300-400MB/s sequential I/O), in those cases the increased or decreased cost of lazy_tid_reaped will directly correlate to runtime. It's just none of my systems, which all run on amazon and is heavily bandwidth constrained (fastest I/O subsystem I can get my hands on does 200MB/s). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Push down more full joins in postgres_fdw
On Tue, Sep 13, 2016 at 11:38 PM, Ashutosh Bapatwrote: > On Tue, Sep 13, 2016 at 10:28 PM, Robert Haas wrote: >> On Tue, Sep 6, 2016 at 9:07 AM, Ashutosh Bapat >> wrote: >>> That's not true with the alias information. As long as we detect which >>> relations need subqueries, their RTIs are enough to create unique aliases >>> e.g. if a relation involves RTIs 1, 2, 3 corresponding subquery can have >>> alias r123 without conflicting with any other alias. >> >> What if RTI 123 is also used in the query? > > Good catch. I actually meant some combination of 1, 2 and 3, which is > unique for a join between r1, r2 and r3. How about r1_2_3 or > r1_r2_r3? Sure, something like that can work, but if you have a big enough join the identifier might get too long. I'm not sure why it wouldn't work to just use the lowest RTI involved in the join, though; the others won't appear anywhere else at that query level. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 8:16 AM, Pavan Deolaseewrote: > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page > bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there > are 6 or more dead tuples per page, bitmap will outperform the current > representation, assuming max allocation for bitmap. If we can use additional > estimates to restrict the size to somewhat more conservative value and then > keep overflow area, then probably the break-even happens even earlier than > that. I hope this gives us a good starting point, but let me know if you > think it's still a wrong approach to pursue. Well, it's certainly a bigger change. I think the big concern is that the amount of memory now becomes fixed based on the table size. So one problem is that you have to figure out what you're going to do if the bitmap doesn't fit in maintenance_work_mem. A related problem is that it might fit but use more memory than before, which could cause problems for some people. Now on the other hand it could also use less memory for some people, and that would be good. I am kind of doubtful about this whole line of investigation because we're basically trying pretty hard to fix something that I'm not sure is broken.I do agree that, all other things being equal, the TID lookups will probably be faster with a bitmap than with a binary search, but maybe not if the table is large and the number of dead TIDs is small, because cache efficiency is pretty important. But even if it's always faster, does TID lookup speed even really matter to overall VACUUM performance? Claudio's early results suggest that it might, but maybe that's just a question of some optimization that hasn't been done yet. I'm fairly sure that our number one priority should be to minimize the number of cases where we need to do multiple scans of the indexes to stay within maintenance_work_mem. If we're satisfied we've met that goal, then within that we should try to make VACUUM as fast as possible with as little memory usage as possible. I'm not 100% sure I know how to get there, or how much work it's worth expending. In theory we could even start with the list of TIDs and switch to the bitmap if the TID list becomes larger than the bitmap would have been, but I don't know if it's worth the effort. /me thinks a bit. Actually, I think that probably *is* worthwhile, specifically because it might let us avoid multiple index scans in cases where we currently require them. Right now, our default maintenance_work_mem value is 64MB, which is enough to hold a little over ten million tuples. It's also large enough to hold a bitmap for a 14GB table. So right now if you deleted, say, 100 tuples per page you would end up with an index vacuum cycles for every ~100,000 pages = 800MB, whereas switching to the bitmap representation for such cases would require only one index vacuum cycle for every 14GB, more than an order of magnitude improvement! On the other hand, if we switch to the bitmap as the ONLY possible representation, we will lose badly when there are scattered updates - e.g. 1 deleted tuple every 10 pages. So it seems like we probably want to have both options. One tricky part is figuring out how we switch between them when memory gets tight; we have to avoid bursting above our memory limit while making the switch. And even if our memory limit is very high, we want to avoid using memory gratuitously; I think we should try to grow memory usage incrementally with either representation. For instance, one idea to grow memory usage incrementally would be to store dead tuple information separately for each 1GB segment of the relation. So we have an array of dead-tuple-representation objects, one for every 1GB of the relation. If there are no dead tuples in a given 1GB segment, then this pointer can just be NULL. Otherwise, it can point to either the bitmap representation (which will take ~4.5MB) or it can point to an array of TIDs (which will take 6 bytes/TID). That could handle an awfully wide variety of usage patterns efficiently; it's basically never worse than what we're doing today, and when the dead tuple density is high for any portion of the relation it's a lot better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Printing bitmap objects in the debugger
Tom Lane wrote: > Ashutosh Bapatwrites: > > While working on partition-wise join, I had to examine Relids objects > > many times. Printing the Bitmapset::words[] in binary format and then > > inferring the relids takes time and is error prone. > > FWIW, I generally rely on pprint() to look at planner data structures > from the debugger. Also: http://blog.pgaddict.com/posts/making-debugging-with-gdb-a-bit-easier This may not address the issue directly, but it's probably very helpful. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Printing bitmap objects in the debugger
Ashutosh Bapatwrites: > While working on partition-wise join, I had to examine Relids objects > many times. Printing the Bitmapset::words[] in binary format and then > inferring the relids takes time and is error prone. FWIW, I generally rely on pprint() to look at planner data structures from the debugger. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Printing bitmap objects in the debugger
Alvaro Herrerawrites: > I don't understand. Why don't you just use "call pprint(the bitmapset)" > in the debugger? Bitmapsets aren't Nodes, so pprint doesn't work directly on them. I usually find that I can pprint some node containing the value(s) I'm interested in, but maybe that isn't working for Ashutosh's particular case. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Covering + unique indexes.
One more update. I added ORDER BY clause to regression tests. It was done as a separate bugfix patch by Tom Lane some time ago, but it definitely should be included into the patch. -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c index d4f9090..99735ce 100644 --- a/contrib/dblink/dblink.c +++ b/contrib/dblink/dblink.c @@ -100,7 +100,7 @@ static remoteConn *getConnectionByName(const char *name); static HTAB *createConnHash(void); static void createNewConnection(const char *name, remoteConn *rconn); static void deleteConnection(const char *name); -static char **get_pkey_attnames(Relation rel, int16 *numatts); +static char **get_pkey_attnames(Relation rel, int16 *indnkeyatts); static char **get_text_array_contents(ArrayType *array, int *numitems); static char *get_sql_insert(Relation rel, int *pkattnums, int pknumatts, char **src_pkattvals, char **tgt_pkattvals); static char *get_sql_delete(Relation rel, int *pkattnums, int pknumatts, char **tgt_pkattvals); @@ -1483,7 +1483,7 @@ PG_FUNCTION_INFO_V1(dblink_get_pkey); Datum dblink_get_pkey(PG_FUNCTION_ARGS) { - int16 numatts; + int16 indnkeyatts; char **results; FuncCallContext *funcctx; int32 call_cntr; @@ -1509,7 +1509,7 @@ dblink_get_pkey(PG_FUNCTION_ARGS) rel = get_rel_from_relname(PG_GETARG_TEXT_P(0), AccessShareLock, ACL_SELECT); /* get the array of attnums */ - results = get_pkey_attnames(rel, ); + results = get_pkey_attnames(rel, ); relation_close(rel, AccessShareLock); @@ -1529,9 +1529,9 @@ dblink_get_pkey(PG_FUNCTION_ARGS) attinmeta = TupleDescGetAttInMetadata(tupdesc); funcctx->attinmeta = attinmeta; - if ((results != NULL) && (numatts > 0)) + if ((results != NULL) && (indnkeyatts > 0)) { - funcctx->max_calls = numatts; + funcctx->max_calls = indnkeyatts; /* got results, keep track of them */ funcctx->user_fctx = results; @@ -2021,10 +2021,10 @@ dblink_fdw_validator(PG_FUNCTION_ARGS) * get_pkey_attnames * * Get the primary key attnames for the given relation. - * Return NULL, and set numatts = 0, if no primary key exists. + * Return NULL, and set indnkeyatts = 0, if no primary key exists. */ static char ** -get_pkey_attnames(Relation rel, int16 *numatts) +get_pkey_attnames(Relation rel, int16 *indnkeyatts) { Relation indexRelation; ScanKeyData skey; @@ -2034,8 +2034,8 @@ get_pkey_attnames(Relation rel, int16 *numatts) char **result = NULL; TupleDesc tupdesc; - /* initialize numatts to 0 in case no primary key exists */ - *numatts = 0; + /* initialize indnkeyatts to 0 in case no primary key exists */ + *indnkeyatts = 0; tupdesc = rel->rd_att; @@ -2056,12 +2056,12 @@ get_pkey_attnames(Relation rel, int16 *numatts) /* we're only interested if it is the primary key */ if (index->indisprimary) { - *numatts = index->indnatts; - if (*numatts > 0) + *indnkeyatts = index->indnkeyatts; + if (*indnkeyatts > 0) { -result = (char **) palloc(*numatts * sizeof(char *)); +result = (char **) palloc(*indnkeyatts * sizeof(char *)); -for (i = 0; i < *numatts; i++) +for (i = 0; i < *indnkeyatts; i++) result[i] = SPI_fname(tupdesc, index->indkey.values[i]); } break; diff --git a/contrib/tcn/tcn.c b/contrib/tcn/tcn.c index 7352b29..c024cf3 100644 --- a/contrib/tcn/tcn.c +++ b/contrib/tcn/tcn.c @@ -138,9 +138,9 @@ triggered_change_notification(PG_FUNCTION_ARGS) /* we're only interested if it is the primary key and valid */ if (index->indisprimary && IndexIsValid(index)) { - int numatts = index->indnatts; + int indnkeyatts = index->indnkeyatts; - if (numatts > 0) + if (indnkeyatts > 0) { int i; @@ -150,7 +150,7 @@ triggered_change_notification(PG_FUNCTION_ARGS) appendStringInfoCharMacro(payload, ','); appendStringInfoCharMacro(payload, operation); -for (i = 0; i < numatts; i++) +for (i = 0; i < indnkeyatts; i++) { int colno = index->indkey.values[i]; diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 322d8d6..0983c93 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -3564,6 +3564,14 @@ pg_class.relnatts) + + indnkeyatts + int2 + + The number of key columns in the index. "Key columns" are ordinary + index columns in contrast with "included" columns. + + indisunique bool diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml index 40f201b..92bef4a 100644 --- a/doc/src/sgml/indexam.sgml +++ b/doc/src/sgml/indexam.sgml @@ -110,6 +110,8 @@ typedef struct IndexAmRoutine boolamclusterable; /* does AM handle predicate locks? */ boolampredlocks; +/* does AM support columns included with clause INCLUDING? */ +boolamcaninclude; /* type of data stored in index, or
Re: [HACKERS] Printing bitmap objects in the debugger
Ashutosh Bapat wrote: > Hi All, > While working on partition-wise join, I had to examine Relids objects > many times. Printing the Bitmapset::words[] in binary format and then > inferring the relids takes time and is error prone. Instead I wrote a > function bms_to_char() which allocates a StringInfo, calls > outBitmapset() to decode Bitmapset as a set of integers and returns > the string. In order to examine a Relids object all one has to do is > execute 'p bms_to_char(bms_object) under gdb. I don't understand. Why don't you just use "call pprint(the bitmapset)" in the debugger? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Rahila Syedwrites: >> Looking at the other variables hooks, they already emit errors and can >> deny the effect of a change corresponding to a new value, without >> informing the caller. Why would autocommit be different? > These instances where psql_error occurs inside hooks the command is > successful and the value supplied by user is reinterpreted to some other > value as user had supplied an unrecognisable value. > With psql_error_on_autocommit patch what was intended was to make > the command unsuccessful and keep the previous setting of autocommit. > Hence having it inside autocommit_hook did not seem appropriate to me. Nonetheless, asking all callers of SetVariable to deal with such cases is entirely unmaintainable/unacceptable. Have you considered expanding the API for hook functions? I'm not really sure why we didn't provide a way for the hooks to reject a setting to begin with. Actually, it would make a lot more sense UI-wise if attempting to assign a non-boolean value to a boolean variable resulted in an error and no change to the variable, instead of what happens now. Anyway, I'm not very thrilled with the idea that AUTOCOMMIT is so special that it should have a different behavior than any other built-in psql variable. If we make them all throw errors and refuse to change to bad values, that would be consistent and defensible IMO. But having AUTOCOMMIT alone act that way is not a feature, it's a wart. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] condition variables
On Tue, Sep 13, 2016 at 10:55 PM, Peter Geogheganwrote: > On Thu, Aug 11, 2016 at 2:47 PM, Robert Haas wrote: >> Another approach to the problem is to use a latch wait loop. That >> almost works. Interrupts can be serviced, and you can recheck shared >> memory to see whether the condition for proceeding is satisfied after >> each iteration of the loop. There's only one problem: when you do >> something that might cause the condition to be satisfied for other >> waiting backends, you need to set their latch - but you don't have an >> easy way to know exactly which processes are waiting, so how do you >> call SetLatch? I originally thought of adding a function like >> SetAllLatches(ParallelContext *) and maybe that can work, but then I >> had what I think is a better idea, which is to introduce a notion of >> condition variables. > > I don't see a CF entry for this. Are you planning to work on this > again soon, Robert? > > I have an eye on this patch due to my work on parallel CREATE INDEX. > It would be nice to have some rough idea of when you intend to commit > this. I basically figured I would commit it when and if it became clear that it'd get good use in some other patch which was on the road to being committed. I don't think it needs much work, just the assurance that it will get some use. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] sequences and pg_upgrade
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, failed Thank you for the patch. As I see there are no objections in the discussion, all the patches look clear. Could you clarify, please, why do we dump sequence in schemaOnly mode? + if (dopt.schemaOnly && dopt.sequence_data) + getSequenceData(, tblinfo, numTables, dopt.oids); Example: postgres=# create table t(i serial, data text); postgres=# insert into t(data) values ('aaa'); pg_dump -d postgres --sequence-data --schema-only > ../reviews/dump_pg Then restore it into newdb and add new value. newdb=# insert into t(data) values ('aaa'); INSERT 0 1 newdb=# select * from t; i | data ---+-- 2 | aaa I'm not an experienced user, but I thought that while doing dump/restore of schema of database we reset all the data. Why should the table in newly created (via pg_restore) database have non-default sequence value? I also did some other tests and all of them were passed. One more thing to do is a documentation for the new option. You should update help() function in pg_dump.c and also add some notes to pg_dump.sgml and probably to pgupgrade.sgml. The new status of this patch is: Waiting on Author -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Surprising behaviour of \set AUTOCOMMIT ON
Hello, >Looking at the other variables hooks, they already emit errors and can >deny the effect of a change corresponding to a new value, without >informing the caller. Why would autocommit be different? The only type of psql_error inside hooks is as follows, psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n", newval, "ECHO", "none"); These instances where psql_error occurs inside hooks the command is successful and the value supplied by user is reinterpreted to some other value as user had supplied an unrecognisable value. With psql_error_on_autocommit patch what was intended was to make the command unsuccessful and keep the previous setting of autocommit. Hence having it inside autocommit_hook did not seem appropriate to me. But as pointed out by you, the other way of setting autocommit i,e *SELECT 'on' as "AUTOCOMMIT" \gset* will not be handled by the patch. So I will change the patch to have the check in the autocommit_hook instead. This will mean if *\set AUTOCOMMIT ON* or *SELECT 'on' as "AUTOCOMMIT" \gset * is run inside a transaction, it will be effective after current transaction has ended. Appropriate message will be displayed notifying this to the user and user need not rerun the set AUTOCOMMIT command. Thank you, Rahila Syed On Thu, Sep 8, 2016 at 9:55 PM, Daniel Veritewrote: > Rahila Syed wrote: > > > However, including the check here will require returning the status > > of command from this hook. i.e if we throw error inside this > > hook we will need to return false indicating the value has not changed. > > Looking at the other variables hooks, they already emit errors and can > deny the effect of a change corresponding to a new value, without > informing the caller. Why would autocommit be different? > > For example echo_hook does this: > > /* ...if the value is in (queries,errors,all,none) then >assign pset.echo accordingly ... */ > else > { > psql_error("unrecognized value \"%s\" for \"%s\"; assuming \"%s\"\n", >newval, "ECHO", "none"); > pset.echo = PSQL_ECHO_NONE; > } > > > If the user issues \set ECHO FOOBAR, then it produces the above error > message and makes the same internal change as if \set ECHO none > had been issued. > > But, the value of the variable as seen by the user is still FOOBAR: > > \set > [...other variables...] > ECHO = 'FOOBAR' > > The proposed patch doesn't follow that behavior, as it denies > a new value for AUTOCOMMIT. You might argue that it's better, > but on the other side, there are two reasons against it: > > - it's not consistent with the other uses of \set that affect psql, > such as ECHO, ECHO_HIDDEN, ON_ERROR_ROLLBACK, > COMP_KEYWORD_CASE... and even AUTOCOMMIT as > \set AUTOCOMMIT FOOBAR is not denied, just reinterpreted. > > - it doesn't address the case of another method than \set being used > to set the variable, as in : SELECT 'on' as "AUTOCOMMIT" \gset > whereas the hook would work in that case. > > Best regards, > -- > Daniel Vérité > PostgreSQL-powered mailer: http://www.manitou-mail.org > Twitter: @DanielVerite >
Re: [HACKERS] kqueue
Hi, On 14/09/2016 00:06, Tom Lane wrote: I'm inclined to think the kqueue patch is worth applying just on the grounds that it makes things better on OS X and doesn't seem to hurt on FreeBSD. Whether anyone would ever get to the point of seeing intra-kernel contention on these platforms is hard to predict, but we'd be ahead of the curve if so. It would be good for someone else to reproduce my results though. For one thing, 5%-ish is not that far above the noise level; maybe what I'm measuring here is just good luck from relocation of critical loops into more cache-line-friendly locations. FWIW, I've tested HEAD vs patch on a 2-cpu low end NetBSD 7.0 i386 machine. HEAD: 1890/1935/1889 tps kqueue: 1905/1957/1932 tps no weird surprises, and basically no differences either. Cheers -- Matteo Beccati Development & Consulting - http://www.beccati.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 5:32 PM, Robert Haaswrote: > On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolasee > wrote: > > Another interesting bit about these small tables is that the largest used > > offset for these tables never went beyond 291 which is the value of > > MaxHeapTuplesPerPage. I don't know if there is something that prevents > > inserting more than MaxHeapTuplesPerPage offsets per heap page and I > don't > > know at this point if this gives us upper limit for bits per page (may > be it > > does). > > From PageAddItemExtended: > > /* Reject placing items beyond heap boundary, if heap */ > if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage) > { > elog(WARNING, "can't put more than MaxHeapTuplesPerPage items > in a heap page"); > return InvalidOffsetNumber; > } > > Also see the comment where MaxHeapTuplesPerPage is defined: > > * Note: with HOT, there could theoretically be more line pointers (not > actual > * tuples) than this on a heap page. However we constrain the number of > line > * pointers to this anyway, to avoid excessive line-pointer bloat and not > * require increases in the size of work arrays. > > Ah, thanks. So MaxHeapTuplesPerPage sets the upper boundary for the per page bitmap size. Thats about 36 bytes for 8K page. IOW if on an average there are 6 or more dead tuples per page, bitmap will outperform the current representation, assuming max allocation for bitmap. If we can use additional estimates to restrict the size to somewhat more conservative value and then keep overflow area, then probably the break-even happens even earlier than that. I hope this gives us a good starting point, but let me know if you think it's still a wrong approach to pursue. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 5:45 AM, Pavan Deolaseewrote: > Another interesting bit about these small tables is that the largest used > offset for these tables never went beyond 291 which is the value of > MaxHeapTuplesPerPage. I don't know if there is something that prevents > inserting more than MaxHeapTuplesPerPage offsets per heap page and I don't > know at this point if this gives us upper limit for bits per page (may be it > does). >From PageAddItemExtended: /* Reject placing items beyond heap boundary, if heap */ if ((flags & PAI_IS_HEAP) != 0 && offsetNumber > MaxHeapTuplesPerPage) { elog(WARNING, "can't put more than MaxHeapTuplesPerPage items in a heap page"); return InvalidOffsetNumber; } Also see the comment where MaxHeapTuplesPerPage is defined: * Note: with HOT, there could theoretically be more line pointers (not actual * tuples) than this on a heap page. However we constrain the number of line * pointers to this anyway, to avoid excessive line-pointer bloat and not * require increases in the size of work arrays. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Printing bitmap objects in the debugger
On Wed, Sep 14, 2016 at 3:46 PM, Pavan Deolaseewrote: > > > lately I'm using LVM debugger (which probably does not have something > equivalent), > And I was so clueless about lldb's powerful scripting interface. For example, you can write something like this in bms_utils.py: import lldb def print_bms_members (bms): words = bms.GetChildMemberWithName("words") nwords = int(bms.GetChildMemberWithName("nwords").GetValue()) ret = 'nwords = {0} bitmap: '.format(nwords,) for i in range(0, nwords): ret += hex(int(words.GetChildAtIndex(0, lldb.eNoDynamicValues, True).GetValue())) return ret And then do this while attached to lldb debugger: Process 99659 stopped * thread #1: tid = 0x59ba69, 0x0001090b012f postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673, queue = 'com.apple.main-thread', stop reason = breakpoint 1.1 frame #0: 0x0001090b012f postgres`bms_add_member(a=0x7fe60a0351f8, x=10) + 15 at bitmapset.c:673 670 int wordnum, 671 bitnum; 672 -> 673 if (x < 0) 674 elog(ERROR, "negative bitmapset member not allowed"); 675 if (a == NULL) 676 return bms_make_singleton(x); (lldb) script Python Interactive Interpreter. To exit, type 'quit()', 'exit()' or Ctrl-D. >>> from bms_utils import * >>> bms = lldb.frame.FindVariable ("a") >>> print print_bms_members(bms) nwords = 1 bitmap: 0x200 The complete API reference is available here http://lldb.llvm.org/python_reference/index.html Looks like an interesting SoC project to write useful lldb/gdb scripts to print internal structures for ease of debugging :-) Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Write Ahead Logging for Hash Indexes
On Wed, Sep 14, 2016 at 4:36 PM, Ashutosh Sharmawrote: > Hi All, > > Below is the backtrace for the issue reported in my earlier mail [1]. > From the callstack it looks like we are trying to release lock on a > meta page twice in _hash_expandtable(). > Thanks for the call stack. I think below code in patch is culprit. Here we have already released the meta page lock and then again on failure, we are trying to release it. _hash_expandtable() { .. /* Release the metapage lock, before completing the split. */ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); .. if (!buf_nblkno) { _hash_relbuf(rel, buf_oblkno); goto fail; } .. fail: /* We didn't write the metapage, so just drop lock */ _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); } This is a problem of concurrent hash index patch. I will send the fix in next version of the patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Sep 14, 2016 at 12:29 AM, Jesper Pedersenwrote: > On 09/13/2016 07:26 AM, Amit Kapila wrote: >> >> Attached, new version of patch which contains the fix for problem >> reported on write-ahead-log of hash index thread [1]. >> > > I have been testing patch in various scenarios, and it has a positive > performance impact in some cases. > > This is especially seen in cases where the values of the indexed column are > unique - SELECTs can see a 40-60% benefit over a similar query using b-tree. > Here, I think it is better if we have the data comparing the situation of hash index with respect to HEAD as well. What I mean to say is that you are claiming that after the hash index improvements SELECT workload is 40-60% better, but where do we stand as of HEAD? > UPDATE also sees an improvement. > Can you explain this more? Is it more compare to HEAD or more as compare to Btree? Isn't this contradictory to what the test in below mail shows? > In cases where the indexed column value isn't unique, it takes a long time > to build the index due to the overflow page creation. > > Also in cases where the index column is updated with a high number of > clients, ala > > -- ddl.sql -- > CREATE TABLE test AS SELECT generate_series(1, 10) AS id, 0 AS val; > CREATE INDEX IF NOT EXISTS idx_id ON test USING hash (id); > CREATE INDEX IF NOT EXISTS idx_val ON test USING hash (val); > ANALYZE; > > -- test.sql -- > \set id random(1,10) > \set val random(0,10) > BEGIN; > UPDATE test SET val = :val WHERE id = :id; > COMMIT; > > w/ 100 clients - it takes longer than the b-tree counterpart (2921 tps for > hash, and 10062 tps for b-tree). > Thanks for doing the tests. Have you applied both concurrent index and cache the meta page patch for these tests? So from above tests, we can say that after these set of patches read-only workloads will be significantly improved even better than btree in quite-a-few useful cases. However when the indexed column is updated, there is still a large gap as compare to btree (what about the case when the indexed column is not updated in read-write transaction as in our pgbench read-write transactions, by any chance did you ran any such test?). I think we need to focus on improving cases where index columns are updated, but it is better to do that work as a separate patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi All, Below is the backtrace for the issue reported in my earlier mail [1]. >From the callstack it looks like we are trying to release lock on a meta page twice in _hash_expandtable(). (gdb) bt #0 0x007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799 #1 0x0078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540 #2 0x004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0, buf=2, from_access=1, to_access=-1) at hashpage.c:331 #3 0x004a722d in _hash_expandtable (rel=0x7f5605b608c0, metabuf=2) at hashpage.c:995 #4 0x004a316a in _hash_doinsert (rel=0x7f5605b608c0, itup=0x2ba5030) at hashinsert.c:313 #5 0x004a0d85 in hashinsert (rel=0x7f5605b608c0, values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4, heapRel=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at hash.c:248 #6 0x004c5a16 in index_insert (indexRelation=0x7f5605b608c0, values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", heap_t_ctid=0x2c2e4f4, heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204 #7 0x0063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0, tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0) at execIndexing.c:388 #8 0x0066a1fc in ExecInsert (mtstate=0x2b99e50, slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0, onconflict=ONCONFLICT_NONE, estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481 #9 0x0066b841 in ExecModifyTable (node=0x2b99e50) at nodeModifyTable.c:1496 #10 0x00645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396 #11 0x006424d9 in ExecutePlan (estate=0x2b99c00, planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT, sendTuples=0 '\000', numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20 ) at execMain.c:1567 #12 0x0064087a in standard_ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at execMain.c:338 #13 0x7f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877 #14 0x00640751 in ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at execMain.c:284 #15 0x007c331e in ProcessQuery (plan=0x2b873f0, sourceText=0x2b89bd0 "insert into foo (index) select $1 from generate_series(1,1)", params=0x2b89c20, dest=0xd73c20 , completionTag=0x7ffdf540a3f0 "") at pquery.c:187 #16 0x007c4a0c in PortalRunMulti (portal=0x2ae7930, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20 , altdest=0xd73c20 , completionTag=0x7ffdf540a3f0 "") at pquery.c:1303 #17 0x007c4055 in PortalRun (portal=0x2ae7930, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90, altdest=0x2b5dc90, completionTag=0x7ffdf540a3f0 "") at pquery.c:815 #18 0x007bfb45 in exec_execute_message (portal_name=0x2b5d880 "", max_rows=9223372036854775807) at postgres.c:1977 #19 0x007c25c7 in PostgresMain (argc=1, argv=0x2b05df8, dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at postgres.c:4133 #20 0x00744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260 #21 0x00744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934 #22 0x00740f9c in ServerLoop () at postmaster.c:1691 #23 0x00740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at postmaster.c:1299 #24 0x006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228 Please let me know for any further inputs. [1]- https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharmawrote: > Hi All, > > I am getting following error when running the test script shared by > Jeff -[1] . The error is observed upon executing the test script for > around 3-4 hrs. > > 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock > buffer_content 1 is not held > 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into > foo (index) select $1 from generate_series(1,1) > > 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock > buffer_content 10 is not held > 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert > into foo (index) select $1 from generate_series(1,1) > > 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock > buffer_content 10 is not held > 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert > into foo (index) select $1 from generate_series(1,1) > > [1]- > https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com > > Please note that i am performing the test on latest patch for > concurrent hash index and WAL log in hash index shared by Amit > yesterday. > > > With Regards, > Ashutosh Sharma > EnterpriseDB:
Re: [HACKERS] pgbench - allow to store select results into variables
Hi Fabien, On 2016/09/13 17:41, Fabien COELHO wrote: > > Hello Amit, > >> [...] >> There still seems to be a change in behavior of the -r option due to the >> patch. Consider the following example: >> >> select a from a where a = 1 \; >> select a+1 from a where a = 1; >> ... >> - statement latencies in milliseconds: >> 2.889 select a from a where a = 1 ; > > vs > >> select a from a where a = 1 \; \into a >> select a+1 from a where a = 1; \into b >> ... >> 2.093 select a from a where a = 1 ; select a+1 from a where a = 1; > > Yep. > > Note that there is a small logical conumdrum in this argument: As the > script is not the same, especially as into was not possible before, > strictly speaking there is no behavior "change". Sure, scripts are not the same but it seemed like showing the whole compound query whereas previously only part of it was shown may be an implementation artifact of \into. > This said, what you suggest can be done. > > After giving it some thought, I suggest that it is not needed nor > desirable. If you want to achieve the initial effect, you just have to put > the "into a" on the next line: > > select a from a where a = 1 \; > \into a > select a+1 from a where a = 1; \into b > > Then you would get the -r cut at the end of the compound command. Thus the > current version gives full control of what will appear in the summary. If > I change "\into xxx\n" to mean "also cut here", then there is less control > on when the cut occurs when into is used. So it means that position of \into affects where a compound command gets cut for -r display. I was just wondering if that was unintentional. >> One more thing I observed which I am not sure if it's a fault of this >> patch is illustrated below: >> >> $ cat /tmp/into.sql >> \; >> select a from a where a = 1 \; >> select a+1 from a where a = 1; >> >> $ pgbench -r -n -t 1 -f /tmp/into.sql postgres >> >> - statement latencies in milliseconds: >> 2.349 ; >> >> Note that the compound select statement is nowhere to be seen in the >> latencies output. The output remains the same even if I use the \into's. >> What seems to be going on is that the empty statement on the first line >> (\;) is the only part kept of the compound statement spanning lines 1-3. > > Yes. > > This is really the (debatable) current behavior, and is not affected by > the patch. The "-r" summary takes the first line of the command, whatever > it is. In your example the first line is "\;", so you get what you asked > for, even if it looks rather strange, obviously. Yep, perhaps it's strange to write a script like that anyway, :) +boolsql_command_in_progress = false; >> [...] >> I understood that it refers to what you explain here. But to me it >> sounded like the name is referring to the progress of *execution* of a SQL >> command whereas the code in question is simply expecting to finish >> *parsing* the SQL command using the next lines. > > Ok. I changed it "sql_command_lexing_in_progress". > >>> The attached patch takes into all your comments but: >>> - comment about discarded results... >>> - the sql_command_in_progress variable name change >>> - the longer message on into at the start of a script >> >> The patch seems fine without these, although please consider the concern I >> raised with regard to the -r option above. > > I have considered it. As the legitimate behavior you suggested can be > achieved just by putting the into on the next line, ISTM that the current > proposition gives more control than doing a mandatory cut when into is used. > > Attached is a new version with the boolean renaming. Thanks. > The other thing I have considered is whether to implemented a "\gset" > syntax, as suggested by Pavel and Tom. Bar the aesthetic, the main issue I > have with it is that it does not work with compound commands, and what I > want is to get the values out of compound commands... because of my focus > on latency... so basically "\gset" does not do the job I want... Now I > recognize that other people would like it, so probably I'll do it anyway > in another patch. So, psql's \gset does not work as desired for compound commands (viz. it is only able to save the result of the last sub-command). You need to use \gset with each sub-command separately if no result should be discarded. Because of undesirable latency characteristics of sending multiple commands, you want to be able to modify compound command handling such that every sub-command's result could be saved. In that regard, it's good that pgbench does not use PQexec() which only returns the result of the last sub-command if a compound command was issued through it. pgbench's doCustom() currently discards all results by issuing discard_response(). You propose to change things such that results are intercepted in a new function read_response(), values assigned to intovars corresponding to each sub-command, and then discarded. The intovars
[HACKERS] Error running custom plugin: “output plugins have to declare the _PG_output_plugin_init symbol”
Hi, I'm kind of new to Postgres and I'm trying to create a custom output plugin for logical replication (Postgres is 9.5.4 version and I'm building the project from a Windows 8/64 bit machine - same machine where the db is installed). I started from the code of the sample test_decoding, and I was trying to simply rebuild it under a new name and install it into Postgres to see if the module works. Once done, I will start modifying the code. My project is built in Visual Studio 2013 and the only steps I took were to copy the built assembly under Postgres lib folder. When I run the command: postgres=# SELECT * FROM pg_create_logical_replication_slot('regression_slot', 'my_decoding'); I get an error saying that "output plugins have to declare the _PG_output_plugin_init symbol". In my code, I have the function declared as extern: extern void _PG_init(void); extern void _PG_output_plugin_init(OutputPluginCallbacks *cb); and the body of the function is defined somewhere below it. The actual code is just a copy of the test_decoding sample: https://github.com/postgres/postgres/blob/REL9_5_STABLE/contrib/test_decoding/test_decoding.c I'm not sure if the deployment process is ok (just copying the dll) or if there is some other step to take. Can anyone shed some light? thanks - -- View this message in context: http://postgresql.nabble.com/Error-running-custom-plugin-output-plugins-have-to-declare-the-PG-output-plugin-init-symbol-tp5921145.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Printing bitmap objects in the debugger
On Wed, Sep 14, 2016 at 3:43 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > Hi All, > While working on partition-wise join, I had to examine Relids objects > many times. Printing the Bitmapset::words[] in binary format and then > inferring the relids takes time and is error prone. Instead I wrote a > function bms_to_char() which allocates a StringInfo, calls > outBitmapset() to decode Bitmapset as a set of integers and returns > the string. In order to examine a Relids object all one has to do is > execute 'p bms_to_char(bms_object) under gdb. > > Can we not do this as gdb macros? My knowledge is rusty in this area and lately I'm using LVM debugger (which probably does not have something equivalent), but I believe gdb allows you to write useful macros. As a bonus, you can then use them even for inspecting core files. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] [PATCH] SortSupport for macaddr type
On 26/08/2016 19:44, Brandur wrote: > Hello, > Hello, > I've attached a patch to add SortSupport for Postgres' macaddr which has the > effect of improving the performance of sorting operations for the type. The > strategy that I employ is very similar to that for UUID, which is to create > abbreviated keys by packing as many bytes from the MAC address as possible > into > Datums, and then performing fast unsigned integer comparisons while sorting. > > I ran some informal local benchmarks, and for cardinality greater than 100k > rows, I see a speed up on `CREATE INDEX` of roughly 25% to 65%. (For those > interested, I put a few more numbers into a small report here [2].) > That's a nice improvement! > Admittedly, this is not quite as useful as speeding up sorting on a more > common > data type like TEXT or UUID, but the change still seems like a useful > performance improvement. I largely wrote it as an exercise to familiarize > myself with the Postgres codebase. > > I'll add an entry into the current commitfest as suggested by the Postgres > Wiki > and follow up here with a link. > > Thanks, and if anyone has feedback or other thoughts, let me know! > I just reviewed your patch. It applies and compiles cleanly, and the abbrev feature works as intended. There's not much to say since this is heavily inspired on the uuid SortSupport. The only really specific part is in the abbrev_converter function, and I don't see any issue with it. I have a few trivial comments: * you used macaddr_cmp_internal() function name, for uuid the same function is named uuid_internal_cmp(). Using the same naming pattern is probably better. * the function comment on macaddr_abbrev_convert() doesn't mention specific little-endian handling * "There will be two bytes of zero padding on the least significant end" "least significant bits" would be better * This patch will trigger quite a lot modifications after a pgindent run. Could you try to run pgindent on mac.c before sending an updated patch? Best regards. -- Julien Rouhaud http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Printing bitmap objects in the debugger
Hi All, While working on partition-wise join, I had to examine Relids objects many times. Printing the Bitmapset::words[] in binary format and then inferring the relids takes time and is error prone. Instead I wrote a function bms_to_char() which allocates a StringInfo, calls outBitmapset() to decode Bitmapset as a set of integers and returns the string. In order to examine a Relids object all one has to do is execute 'p bms_to_char(bms_object) under gdb. Is there a way, this can be included in the code? If it's available in the code, developers don't have to apply the patch and compile it for debugging. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 29b7712..b4cae11 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -93,20 +93,22 @@ outNode(str, node->fldname)) /* Write a bitmapset field */ #define WRITE_BITMAPSET_FIELD(fldname) \ (appendStringInfo(str, " :" CppAsString(fldname) " "), \ _outBitmapset(str, node->fldname)) #define booltostr(x) ((x) ? "true" : "false") +char *bms_to_char(const Bitmapset *bms); + /* * _outToken * Convert an ordinary string (eg, an identifier) into a form that * will be decoded back to a plain token by read.c's functions. * * If a null or empty string is given, it is encoded as "<>". */ static void _outToken(StringInfo str, const char *s) @@ -204,20 +206,31 @@ _outBitmapset(StringInfo str, const Bitmapset *bms) } /* for use by extensions which define extensible nodes */ void outBitmapset(StringInfo str, const Bitmapset *bms) { _outBitmapset(str, bms); } /* + * TODO: remove, used for debugging through gdb. + */ +char * +bms_to_char(const Bitmapset *bms) +{ + StringInfo str = makeStringInfo(); + outBitmapset(str, bms); + return str->data; +} + +/* * Print the value of a Datum given its type. */ void outDatum(StringInfo str, Datum value, int typlen, bool typbyval) { Size length, i; char *s; length = datumGetSize(value, typbyval, typlen); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
On Wed, Sep 14, 2016 at 8:47 AM, Pavan Deolaseewrote: > >> > Sawada-san offered to reimplement the patch based on what I proposed > upthread. In the new scheme of things, we will allocate a fixed size bitmap > of length 2D bits per page where D is average page density of live + dead > tuples. (The rational behind multiplying D by a factor of 2 is to consider > worst case scenario where every tuple also has a LP_DIRECT line > pointer). The value of D in most real world, large tables should not go > much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That > translates to about 25 bytes/page. So all TIDs with offset less than 2D can > be represented by a single bit. We augment this with an overflow to track > tuples which fall outside this limit. I believe this area will be small, > say 10% of the total allocation. > > So I cooked up the attached patch to track number of live/dead tuples found at offset 1 to MaxOffsetNumber. The idea was to see how many tuples actually go beyond the threshold of 2D offsets per page. Note that I am proposing to track 2D offsets via bitmap and rest via regular TID array. So I ran a pgbench test for 2hrs with scale factor 500. autovacuum scale factor was set to 0.1 or 10%. Some interesting bits: postgres=# select relname, n_tup_ins, n_tup_upd, n_tup_hot_upd, n_live_tup, n_dead_tup, pg_relation_size(relid)/8192 as relsize, (n_live_tup+n_dead_tup)/(pg_relation_size(relid)/8192) as density from pg_stat_user_tables ; relname | n_tup_ins | n_tup_upd | n_tup_hot_upd | n_live_tup | n_dead_tup | relsize | density --+---+---+---+++-+- pgbench_tellers | 5000 | 95860289 | 87701578 | 5000 | 0 |3493 | 1 pgbench_branches | 500 | 95860289 | 94158081 |967 | 0 |1544 | 0 pgbench_accounts | 5000 | 95860289 | 93062567 | 51911657 | 3617465 | 865635 | 64 pgbench_history | 95860289 | 0 | 0 | 95258548 | 0 | 610598 | 156 (4 rows) Smaller tellers and branches tables bloat so much that the density as computed by live + dead tuples falls close to 1 tuple/page. So for such tables, the idea of 2D bits/page will fail miserably. But I think these tables are worst representatives and I would be extremely surprised if we ever find very large table bloated so much. But even then, this probably tells us that we can't solely rely on the density measure. Another interesting bit about these small tables is that the largest used offset for these tables never went beyond 291 which is the value of MaxHeapTuplesPerPage. I don't know if there is something that prevents inserting more than MaxHeapTuplesPerPage offsets per heap page and I don't know at this point if this gives us upper limit for bits per page (may be it does). For pgbench_accounts table, the maximum offset used was 121, though bulk of the used offsets were at the start of the page (see attached graph). Now the test did not create enough dead tuples to trigger autovacuum on pgbench_accounts table. So I ran a manul vacuum at the end. (There are about 5% dead tuples in the table by the time test finished) postgres=# VACUUM VERBOSE pgbench_accounts ; INFO: vacuuming "public.pgbench_accounts" INFO: scanned index "pgbench_accounts_pkey" to remove 2797722 row versions DETAIL: CPU 0.00s/9.39u sec elapsed 9.39 sec. INFO: "pgbench_accounts": removed 2797722 row versions in 865399 pages DETAIL: CPU 0.10s/7.01u sec elapsed 7.11 sec. INFO: index "pgbench_accounts_pkey" now contains 5000 row versions in 137099 pages DETAIL: 2797722 index row versions were removed. 0 index pages have been deleted, 0 are currently reusable. CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: "pgbench_accounts": found 852487 removable, 5000 nonremovable row versions in 865635 out of 865635 pages DETAIL: 0 dead row versions cannot be removed yet. There were 802256 unused item pointers. Skipped 0 pages due to buffer pins. 0 pages are entirely empty. CPU 0.73s/27.20u sec elapsed 27.98 sec.tuple count at each offset (offnum:all_tuples:dead_tuples) For 2797722 dead line pointers, the current representation would have used 2797722 x 6 = 16786332 bytes of memory. The most optimal bitmap would have used 121 bits/page x 865399 pages = 13089159 bytes where as if we had provisioned 2D bits/page and assuming D = 64 based on the above calculation, we would have used 13846384 bytes of memory. This is about 18% less than the current representation. Of course, we would have allocated some space for overflow region, which will make the difference smaller/negligible. But the bitmaps would be extremely cheap to lookup during index scans. Now may be I got lucky, may be I did nor run tests long enough (though I believe that may have worked in favour of bitmap), may be mostly HOT updated tables are not good candidate for
Re: [HACKERS] [BUGS] BUG #14244: wrong suffix for pg_size_pretty()
Today, i found the time to read all the mails in this thread, and i think i have to explain, why we decided to open a bug for this behavior. Pn Tuesday, 23. August 2016, 13:30:29 Robert Haas wrote: > J. Random User: I'm having a problem! > Mailing List: Gee, how big are your tables? > J. Random User: Here's some pg_size_pretty output. > Mailing List: Gosh, we don't know what that means, what do you have > this obscure GUC set to? > J. Random User: Maybe I'll just give up on SQL and use MongoDB. In fact, we had just the other way around. One of our most critical databases had some extreme bloat. Some of our internal customers was very confused, about the sizes reported by the database. This confusion has led to wrong decisions. (And a long discussion about the choice of DBMS btw) I think there is a point missing in this whole discussion, or i just didn't see it: Yeah, the behavior of "kB" is defined in the "postgresql.conf" documentation. But no _user_ reads this. There is no link or hint in the documentation of "pg_size_pretty()" [1]. [1] https://www.postgresql.org/docs/9.5/static/functions-admin.html#FUNCTIONS-ADMIN-DBSIZE -- Thomas Berger PostgreSQL DBA Database Operations 1&1 Telecommunication SE | Ernst-Frey-Straße 10 | 76135 Karlsruhe | Germany Phone: +49 721 91374-6566 E-Mail: thomas.ber...@1und1.de | Web: www.1und1.de Hauptsitz Montabaur, Amtsgericht Montabaur, HRB 23963 Vorstand: Markus Huhn, Alessandro Nava, Moritz Roth, Ludger Sieverding, Martin Witt Aufsichtsratsvorsitzender: Michael Scheeren Member of United Internet Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern, weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu verwenden. This e-mail may contain confidential and/or privileged information. If you are not the intended recipient of this e-mail, you are hereby notified that saving, distribution or use of the content of this e-mail in any way is prohibited. If you have received this e-mail in error, please notify the sender and delete the e-mail. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi All, I am getting following error when running the test script shared by Jeff -[1] . The error is observed upon executing the test script for around 3-4 hrs. 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock buffer_content 1 is not held 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock buffer_content 10 is not held 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock buffer_content 10 is not held 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) [1]- https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com Please note that i am performing the test on latest patch for concurrent hash index and WAL log in hash index shared by Amit yesterday. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersenwrote: > On 09/13/2016 07:41 AM, Amit Kapila wrote: >>> >>> README: >>> +in_complete split flag. The reader algorithm works correctly, as it >>> will >>> scan >>> >>> What flag ? >>> >> >> in-complete-split flag which indicates that split has to be finished >> for that particular bucket. The value of these flags are >> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old >> bucket respectively. It is set in hasho_flag in special area of page. >> I have slightly expanded the definition in README, but not sure if it >> is good idea to mention flags defined in hash.h. Let me know, if still >> it is unclear or you want some thing additional to be added in README. >> > > I think it is better now. > >>> hashxlog.c: >>> >>> hash_xlog_move_page_contents >>> hash_xlog_squeeze_page >>> >>> Both have "bukcetbuf" (-> "bucketbuf"), and >>> >>> + if (BufferIsValid(bukcetbuf)); >>> >>> -> >>> >>> + if (BufferIsValid(bucketbuf)) >>> >>> and indent following line: >>> >>> LockBufferForCleanup(bukcetbuf); >>> >>> hash_xlog_delete >>> >>> has the "if" issue too. >>> >> >> Fixed all the above cosmetic issues. >> > > I meant there is an extra ';' on the "if" statements: > > + if (BufferIsValid(bukcetbuf)); <-- > > in hash_xlog_move_page_contents, hash_xlog_squeeze_page and > hash_xlog_delete. > > > Best regards, > Jesper > > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL consistency check facility
On Wed, Sep 14, 2016 at 11:31 AM, Michael Paquierwrote: > On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh > wrote: >> Master >> --- >> - If wal_consistency check is enabled or needs_backup is set in >> XLogRecordAssemble(), we do a fpw. >> - If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE, >> which in turns set has_image flag while decoding the record. >> - If a fpw needs to be restored during redo, i.e., needs_backup is true, >> then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO. > > Here that should be if wal_consistency is true, no? Nope. I'll try to explain using some pseudo-code: XLogRecordAssemble() { include_image = needs_backup || wal_consistency[rmid]; if (include_image) { set XLogRecordBlockHeader.fork_flags |= BKPBLOCK_HAS_IMAGE; if (needs_backup) set XLogRecordBlockImageHeader.bimg_info |=BKPIMAGE_IS_REQUIRED_FOR_REDO; } . } XLogReadBufferForRedoExtended() { .. if (XLogRecHasBlockImage() && XLogRecHasBlockImageForRedo()) { RestoreBlockImage(); return BLK_RESTORED; } .. } checkConsistency() { if (wal_consistency[rmid] && !XLogRecHasBlockImage()) throw error; . } *XLogRecHasBlockImageForRedo() checks whether bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO. For a backup image any of the followings is possible: 1. consistency should be checked. 2. page should restored. 3. both 1 and 2. Consistency check can be controlled by a guc parameter. But, standby should be conveyed whether an image should be restored. For that, we have used the new flag. Suggestions? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Declarative partitioning - another take
I have Continued with testing declarative partitioning with the latest patch. Got some more observation, given below -- Observation 1 : Getting overlap error with START with EXCLUSIVE in range partition. create table test_range_bound ( a int) partition by range(a); --creating a partition to contain records {1,2,3,4}, by default 1 is inclusive and 5 is exclusive create table test_range_bound_p1 partition of test_range_bound for values start (1) end (5); --now trying to create a partition by explicitly mentioning start is exclusive to contain records {5,6,7}, here trying to create with START with 4 as exclusive so range should be 5 to 8, but getting partition overlap error. create table test_range_bound_p2 partition of test_range_bound for values start (4) EXCLUSIVE end (8); ERROR: partition "test_range_bound_p2" would overlap partition "test_range_bound_p1" -- Observation 2 : able to create sub-partition out of the range set for main table, causing not able to insert data satisfying any of the partition. create table test_subpart (c1 int) partition by range (c1); create table test_subpart_p1 partition of test_subpart for values start (1) end (100) inclusive partition by range (c1); create table test_subpart_p1_sub1 partition of test_subpart_p1 for values start (101) end (200); \d+ test_subpart Table "public.test_subpart" Column | Type | Modifiers | Storage | Stats target | Description +-+---+-+--+- c1 | integer | | plain | | Partition Key: RANGE (c1) Partitions: test_subpart_p1 FOR VALUES START (1) END (100) INCLUSIVE \d+ test_subpart_p1 Table "public.test_subpart_p1" Column | Type | Modifiers | Storage | Stats target | Description +-+---+-+--+- c1 | integer | | plain | | Partition Of: test_subpart FOR VALUES START (1) END (100) INCLUSIVE Partition Key: RANGE (c1) Partitions: test_subpart_p1_sub1 FOR VALUES START (101) END (200) insert into test_subpart values (50); ERROR: no partition of relation "test_subpart_p1" found for row DETAIL: Failing row contains (50). insert into test_subpart values (150); ERROR: no partition of relation "test_subpart" found for row DETAIL: Failing row contains (150). -- Observation 3 : Getting cache lookup failed, when selecting list partition table containing array. CREATE TABLE test_array ( i int,j int[],k text[]) PARTITION BY LIST (j); CREATE TABLE test_array_p1 PARTITION OF test_array FOR VALUES IN ('{1}'); CREATE TABLE test_array_p2 PARTITION OF test_array FOR VALUES IN ('{2,2}'); INSERT INTO test_array (i,j[1],k[1]) VALUES (1,1,1); INSERT INTO test_array (i,j[1],j[2],k[1]) VALUES (2,2,2,2); postgres=# SELECT tableoid::regclass,* FROM test_array_p1; tableoid| i | j | k ---+---+-+- test_array_p1 | 1 | {1} | {1} (1 row) postgres=# SELECT tableoid::regclass,* FROM test_array_p2; tableoid| i | j | k ---+---+---+- test_array_p2 | 2 | {2,2} | {2} (1 row) postgres=# SELECT tableoid::regclass,* FROM test_array; ERROR: cache lookup failed for type 0 Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation On Fri, Sep 9, 2016 at 2:25 PM, Amit Langotewrote: > On 2016/09/06 22:04, Amit Langote wrote: > > Will fix. > > Here is an updated set of patches. > > In addition to fixing a couple of bugs reported by Ashutosh and Rajkumar, > there are a few of major changes: > > * change the way individual partition bounds are represented internally > and the way a collection of partition bounds associated with a > partitioned table is exposed to other modules. Especially list > partition bounds which are manipulated more efficiently as discussed > at [1]. > > * \d partitioned_table now shows partition count and \d+ lists partition > names and their bounds as follows: > > \d t6 >Table "public.t6" > Column | Type| Modifiers > .---+---+--- > a | integer | > b | character varying | > Partition Key: LIST (a) > Number of partitions: 3 (Use \d+ to list them.) > > \d+ t6 >Table "public.t6" > Column | Type| Modifiers | Storage | Stats target | > Description > .---+---+---+--+ > --+- > a | integer | | plain| | > b | character varying | | extended | | > Partition Key: LIST (a) > Partitions: t6_p1 FOR VALUES IN (1, 2, NULL), > t6_p2 FOR VALUES IN (4, 5), > t6_p3 FOR VALUES IN (3, 6) > > \d+ p > Table "public.p" > Column | Type | Modifiers | Storage | Stats target | Description >
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
On Wed, Sep 14, 2016 at 10:25 AM, Dilip Kumarwrote: > I have tested performance with approach 1 and approach 2. > > 1. Transaction (script.sql): I have used below transaction to run my > bench mark, We can argue that this may not be an ideal workload, but I > tested this to put more load on ClogControlLock during commit > transaction. > > --- > \set aid random (1,3000) > \set tid random (1,3000) > > BEGIN; > SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; > SAVEPOINT s1; > SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; > SAVEPOINT s2; > SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; > END; > --- > > 2. Results > ./pgbench -c $threads -j $threads -T 10 -M prepared postgres -f script.sql > scale factor: 300 > Clients head(tps)grouplock(tps) granular(tps) > --- - -- --- > 12829367 3932637421 > 18029777 3781036469 > 25628523 3741835882 > > > grouplock --> 1) Group mode to reduce CLOGControlLock contention > granular --> 2) Use granular locking model > > I will test with 3rd approach also, whenever I get time. > > 3. Summary: > 1. I can see on head we are gaining almost ~30 % performance at higher > client count (128 and beyond). > 2. group lock is ~5% better compared to granular lock. Forgot to mention that, this test is on unlogged tables. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proposal: speeding up GIN build with parallel workers
On 01/17/2016 10:03 PM, Jeff Janes wrote: On Fri, Jan 15, 2016 at 3:29 PM, Peter Geogheganwrote: On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan wrote: I have a draft implementation which divides the whole process between N parallel workers, see the patch attached. Instead of a full scan of the relation, I give each worker a range of blocks to read. I am currently working on a patch that allows B-Tree index builds to be performed in parallel. I think I'm a week or two away from posting it. Even without parallelism, wouldn't it be better if GIN indexes were built using tuplesort? I know way way less about the gin am than the nbtree am, but I imagine that a prominent cost for GIN index builds is constructing the main B-Tree (the one that's constructed over key values) itself. Couldn't tuplesort.c be adapted to cover this case? That would be much faster in general, particularly with the recent addition of abbreviated keys, while also leaving a clear path forward to performing the build in parallel. I think it would take a lot of changes to tuple sort to make this be a almost-always win. In the general case each GIN key occurs in many tuples, and the in-memory rbtree is good at compressing the tid list for each key to maximize the amount of data that can be in memory (although perhaps it could be even better, as it doesn't use varbyte encoding on the tid list the way the posting lists on disk do--it could do so in the bulk build case, where it receives the tid in order, but not feasibly in the pending-list clean-up case, where it doesn't get them in order) When I was testing building an index on a column of text identifiers where there were a couple million identifiers occurring a few dozen times each, building it as a gin index (using btree_gin so that plain text columns could be indexed) was quite a bit faster than building it as a regular btree index using tuple sort. I didn't really investigate this difference, but I assume it was due to the better memory usage leading to less IO. I've been wondering about this too, using tuplesort to build GIN indexes, for a long time. Surely the highly-optimized quicksort+merge code in tuplesort.c is faster than maintaining a red-black tree? Or so I thought. I wrote a quick prototype of using tuplesort.c for GIN index build. I tested it with a 500 MB table of integer arrays, so that the sort fit completely in memory. It's a lot slower than the current code. It turns out eliminating the duplicates early is really really important. So we want to keep the red-black tree, to eliminate the duplicates. Or add that capability to tuplesort.c, which might speed up Sort+Unique and aggregates in general, but that's a big effort. However, I still wonder, if we shouldn't use a merge approach when the tree doesn't fit in memory, like tuplesort.c does. Currently, when the tree is full, we flush it out to the index, by inserting all the entries. That can get quite expensive, I/O-wise. It also generates more WAL, compared to writing each page only once. If we flushed the tree to a tape instead, then we could perhaps use the machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to merge the tapes. I'm not sure if that works out, but I think it's worth some experimentation. But I do think this with aggregation would be worth it despite the amount of work involved. In addition to automatically benefiting from parallel sorts and any other future sort improvements, I think it would also generate better indexes. I have a theory that a problem with very large gin indexes is that repeatedly building maintenance_work_mem worth of rbtree entries and then merging them to disk yields highly fragmented btrees, in which logical adjacent key-space does not occupy physically nearby leaf pages, which then can causes problems either with access that follows a pattern (like range scans, except gin indexes can't really do those anyway) or further bulk operations. Yeah, there's that too. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] kqueue
On Wed, Sep 14, 2016 at 3:32 PM, Tom Lanewrote: > Michael Paquier writes: >> From an OSX laptop with -S, -c 1 and -M prepared (9 runs, removed the >> three best and three worst): >> - HEAD: 9356/9343/9369 >> - HEAD + patch: 9433/9413/9461.071168 >> This laptop has a lot of I/O overhead... Still there is a slight >> improvement here as well. Looking at the progress report, per-second >> TPS gets easier more frequently into 9500~9600 TPS with the patch. So >> at least I am seeing something. > > Which OSX version exactly? El Capitan 10.11.6. With -s 20 (300MB) and 1GB of shared_buffers so as everything is on memory. Actually re-running the tests now with no VMs around and no apps, I am getting close to 9650~9700TPS with patch, and 9300~9400TPS on HEAD, so that's unlikely only noise. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] kqueue
Michael Paquierwrites: > From an OSX laptop with -S, -c 1 and -M prepared (9 runs, removed the > three best and three worst): > - HEAD: 9356/9343/9369 > - HEAD + patch: 9433/9413/9461.071168 > This laptop has a lot of I/O overhead... Still there is a slight > improvement here as well. Looking at the progress report, per-second > TPS gets easier more frequently into 9500~9600 TPS with the patch. So > at least I am seeing something. Which OSX version exactly? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL consistency check facility
On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghoshwrote: > Master > --- > - If wal_consistency check is enabled or needs_backup is set in > XLogRecordAssemble(), we do a fpw. > - If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE, > which in turns set has_image flag while decoding the record. > - If a fpw needs to be restored during redo, i.e., needs_backup is true, > then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO. Here that should be if wal_consistency is true, no? > Standby > --- > - In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and > XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the > backup image. > - In checkConsistency, we only check if XLogRecHasBlockImage() returns true > when wal_consistency check is enabled for this rmid. My guess would have been that you do not need to check anymore for wal_consistency in checkConsistency, making the GUC value only used on master node. > *XLogRecHasBlockImageForRedo() checks whether bimg_info is set with > BKPIMAGE_IS_REQUIRED_FOR_REDO. Yes, that's more or less what you should have. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers