Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load
On Oct24, 2011, at 01:27 , Simon Riggs wrote: FATAL: could not access status of transaction 21110784 which, in pg_subtrans, is the first xid on a new subtrans page. So we have missed zeroing a page. pg_control shows ... Latest checkpoint's oldestActiveXID: 2111 which shows quite clearly that the pg_control file is later than it should be. But shouldn't pg_control be largely irrelevant in a hot backup scenario? Most (all?) of the information contained therein should be overwritten with the contents of the checkpoint referenced by the backup label, shouldn't it? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load
On Oct23, 2011, at 22:48 , Daniel Farina wrote: It doesn't seem meaningful for StartupCLOG (or, indeed, any of the hot-standby path functionality) to be called before that code is executed, but it is anyway right now. I think the idea is to check that the CLOG part which recovery *won't* overwrite is consistent (or rather, given the simplicity of the check, at least accessible) Heikki said the following somewhere else in this thread when I suggested something similar to your proposal: There are pretty clear rules on what state clog can be in. When you launch postmaster in a standby: * Any clog preceding the nextXid from the checkpoint record we start recovery from, must either be valid, or the clog file must be missing altogether (which can happen when it was vacuumed away while the backup in progress - if the clog is still needed at the end of backup it must not be missing, of course). * Any clog following nextXid can be garbled or missing. Recovery will overwrite any clog after nextXid from the WAL, but not the clog before it. I think Simon's theory that we're starting recovery from the wrong place, i.e. should start with an earlier WAL location, is probably correct. The question is, why? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Silent failure with invalid hba_file setting
On 19 October 2011 05:20, Tom Lane t...@sss.pgh.pa.us wrote: Peter Eisentraut pete...@gmx.net writes: I wouldn't add extra special checks for that. It might not be completely unreasonable to have a standby that no one can connect to, for example. Well, you couldn't monitor its state then, so I don't find that example very convincing. But if you were intent on having that, you could easily set up a pg_hba file containing only reject entries. I hadn't noticed you'd committed some changes around this until now. Thanks. -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message
On Mon, Oct 24, 2011 at 10:05 AM, Tom Lane t...@sss.pgh.pa.us wrote: as it seems to me that any client that is paranoid enough to care about sync rep had better already be handling the case of a connection loss during commit. Agreed, but that is a problem that by definition we can't help with. Also, the issue with connection loss is that you really can't know whether your transaction got committed without reconnecting and looking for evidence. There is no reason at all to inject such uncertainty into the cancel-SyncRepWaitForLSN case. We know the transaction got committed, I disagree. The whole point of synchronous replication is that the user is worried about the case where the primary goes away just after the commit is acknowledged to the client. Consider the following scenario: Someone has determined that it can't be reached from 90% of the corporate Internet, but the synchronous standby, which is naturally on another network, still has connectivity. So they log into the master and perform a fast shutdown. When they reconnect, the connection pooler (or other mechanism) redirects their connection to the standby, which has sense been promoted. ISTM that the client had darn well better go search for hard evidence about the transaction state. But I think that throwing an ERROR is likely to cause a LOT of client breakage, even if you have some special (human-invisible?) flag that indicates that you don't really mean it. If we must do something other than simulating a server disconnect, letting the command completion message go through and annotating it with a NOTICE or WARNING seems preferable. I think you're thinking narrowly of the SyncRepWaitForLSN case. What I'm trying to point out is that there's a boatload of post-commit code which is capable of sometimes throwing errors, and that's not ever going to go away completely. It might be that it'd work to deal with this by reducing the reported strength of all such cases from ERROR to WARNING. Not sure that that's a good idea, but it might work. It's hard to be sure that a systematic approach will work. For example, if we fail to can't nuke a memory context for some reason, it wouldn't be utterly crazy to just ignore the problem and try to soldier on. We've probably leaked some memory, but oh well. If we've failed to release a heavyweight lock we had better call LockReleaseAll() somehow, but the details of what gets sent to the client are negotiable and a WARNING is probably fine. On the other hand, if we experienced some failure that affects our ability to make the transaction globally visible (like we wrote the commit record but then fail trying to acquire ProcArrayLock to clear our xmin), it's hard to believe that anything other than PANIC is enough. Because of that and similar cases elsewhere, including for example inside the lock manager, I've long been feeling grumpy about this: /* Ensure we will have room to remember the lock */ if (num_held_lwlocks = MAX_SIMUL_LWLOCKS) elog(ERROR, too many LWLocks taken); It seems to me that the idea that the abort path is going to be able to recover from that situation is wildly optimistic. Fortunately, our coding practices are good enough that I think it never happens anyway, but if it does it should surely PANIC. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updated version of pg_receivexlog
On Mon, Oct 24, 2011 at 16:12, Jaime Casanova ja...@2ndquadrant.com wrote: On Mon, Oct 24, 2011 at 7:40 AM, Magnus Hagander mag...@hagander.net wrote: synchronous_standby_names='*' is prone to such confusion in general, but it seems that it's particularly surprising if a running pg_basebackup lets a commit in synchronous replication to proceed. Maybe we just need a warning in the docs. I think we should advise that synchronous_standby_names='*' is dangerous in general, and cite this as one reason for that. Hmm. i think this is common enough that we want to make sure we avoid it in code. Could we pass a parameter from the client indicating to the master that it refuses to be a sync slave? An optional keyword to the START_REPLICATION command, perhaps? can't you execute set synchronous_commit to off/local for this connection? This is a walsender connection, it doesn't take SQL. Plus it's the receiving end, and SET sync_commit is for the sending end. that said, we are reasonably safe in current implementations, because it always sets the flush location to invalidxlogptr, so it will not be considered for sync slave. Should we ever start accepting write as the point to sync against, the problem will show up, of course. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)
Hi! On Mon, Oct 17, 2011 at 12:38 PM, Jeff Davis pg...@j-davis.com wrote: I started implementing subtype_diff, and I noticed that it requires defining an extra function for each range type. Previously, the numeric types could just use a cast, which was convenient for user-defined range types. If you have any other ideas to make that cleaner, please let me know. Otherwise I'll just finish implementing subtype_diff. I think implementing subtype_diff for each datatype is ok. We could implement some universal function based on minus operator and casting to double precision. But such solution might be unacceptable in both *predictability (operator and casting function might do not the things we expect) and performance.* I'm beginning to think that we should just allow the user to specify their own gist_penalty function. Specifying just the subtype_diff doesn't save much time, and it can only be limiting. Additionally, it's harder for users to understand the purpose of the function. If we allow user to specify own gist_penalty function, then such function should deal with: 1) GiST-specific data structures such as GISTENTRY. 2) Decomposing ranges using range_deserialize. 3) Inifinities, which we could handle in general penalty functions. Thats why I prefere to implement subtype_diff. -- With best regards, Alexander Korotkov.
Re: [HACKERS] ALTER TABLE ONLY ...DROP CONSTRAINT is broken in HEAD.
On Mon, Sep 12, 2011 at 11:53 AM, Alexey Klyukin al...@commandprompt.com wrote: This works in 9.1, but not in HEAD: CREATE TABLE parent(id INTEGER, CONSTRAINT id_check CHECK(id1)); CREATE TABLE child() INHERITS(parent); ALTER TABLE ONLY parent DROP CONSTRAINT id_check; I'm getting: ERROR: relation 16456 has non-inherited constraint id_check where 16456 is the oid of the child table. It seems that the pg_constraint scan at ATExecDropConstraint (tablecmds.c:6751) is re-reading those tuples that were updated in the previous iterations of this scan, at least that's what I've observed in gdb. I'm not sure how to fix this yet. Woops, seems this got overlooked. It's been fixed, though: see commit c0f03aae0469e758964faac0fb741685170c39a5. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant
On Fri, Sep 23, 2011 at 4:05 PM, Dan McGee d...@archlinux.org wrote: On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee d...@archlinux.org wrote: [ patch ] I suppose it's Tom who really needs to comment on this, but I'm not too enthusiastic about this approach. Duplicating the Linux kernel calculation into our code means that we could drift if the formula changes again. Why would the formula ever change? This seems like a different excuse way of really saying you don't appreciate the hacky approach, which I can understand completely. However, it simply doesn't make sense for them to change this formula, as it scales the -17 to 16 old range to the new -1000 to 1000 range. Those endpoints won't be changing unless a third method is introduced, in which case this whole thing is mute and we'd need to fix it yet again. I like Tom's previous suggestion (I think) of allowing both constants to be defined - if they are, then we try oom_score_adj first and fall back to oom_adj if that fails. For bonus points, we could have postmaster stat() its own oom_score_adj file before forking and set a global variable to indicate the results. That way we'd only ever need to test once per postmaster startup (at least until someone figures out a way to swap out the running kernel without stopping the server...!). This would be fine, it just seems unreasonably complicated, not to mention unnecessary. I might as well point this out [1]. I don't think a soul out there has built without defining this to 0 (if they define it at all), and not even that many people are using it. Is it all that bad of an idea to just force it to 0 for both knobs and be done with it? Did we do anything about this? Anyone else have an opinion on what ought to be done? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On 24.10.2011 15:29, Fujii Masao wrote: +listitem + para + Copy the pg_control file from the cluster directory to the global + sub-directory of the backup. For example: + programlisting + cp $PGDATA/global/pg_control /mnt/server/backupdir/global + /programlisting + /para +/listitem Why is this step required? The control file is overwritten by information from the backup_label anyway, no? +listitem + para + Again connect to the database as a superuser, and execute + functionpg_stop_backup/. This terminates the backup mode, but does not + perform a switch to the next WAL segment, create a backup history file and + wait for all required WAL segments to be archived, + unlike that during normal processing. + /para +/listitem How do you ensure that all the required WAL segments have been archived, then? + /orderedlist +/para + +para + You cannot use the applicationpg_basebackup/ tool to take the backup + from the standby. +/para Why not? We have cascading replication now. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message
On Sun, Oct 23, 2011 at 6:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Well, there is a general problem that anything which throws an ERROR too late in the commit path is Evil; and sync rep makes that worse to the extent that it adds more stuff late in the commit path, but it didn't invent the problem. BTW, it strikes me that if we want to do something about that, it ought to be possible; but it has to be built into error handling, not a localized hack for sync rep. Consider a design along these lines: we invent a global flag that gets set at some appropriate point in RecordTransactionCommit (probably right where we exit the commit critical section) and is not cleared until we send a suitable message to the client --- I think either command-complete or an error message would qualify, but that would have to be analyzed more carefully than I've done so far. If elog.c is told to send an error message while this flag is set, then it does something special to inform the client that this was a post-commit error and the xact is in fact committed. My inclination for the something special would be to add a new error message field, but that could be difficult for clients to examine depending on what sort of driver infrastructure they're dealing with. You could also imagine emitting a separate NOTICE or WARNING message, which is analogous to the current hack in SyncRepWaitForLSN, but seems pretty ugly because it requires clients to re-associate that event with the later error message. (But it might be worth doing anyway for human users, even if we provide a different flag mechanism that is intended for program consumption.) Or maybe we could override the SQLSTATE with some special value. Or something else. Given infrastructure like this, it would be reasonable for SyncRepWaitForLSN to just throw an ERROR if it gets an interrupt, instead of trying to kluge its own solution. I actually think that emitting a NOTICE or WARNING and then slamming the connection shut is quite elegant, as it seems to me that any client that is paranoid enough to care about sync rep had better already be handling the case of a connection loss during commit. I realize that handling query cancellation in a somewhat different way is a wart, though, and I'm not necessarily opposed to changing the behavior. But I think that throwing an ERROR is likely to cause a LOT of client breakage, even if you have some special (human-invisible?) flag that indicates that you don't really mean it. If we must do something other than simulating a server disconnect, letting the command completion message go through and annotating it with a NOTICE or WARNING seems preferable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On 24.10.2011 15:29, Fujii Masao wrote: In your patch, FPW is always WAL-logged at startup even when FPW has not been changed since last shutdown. I don't think that's required. I changed the recovery code so that it keeps track of last FPW indicated by WAL record. Then, at end of startup, if that FPW is equal to FPW specified in postgresql.conf (which means that FPW has not been changed since last shutdown or crash), WAL-logging of FPW is skipped. This change prevents unnecessary WAL-logging. Thought? One problem with this whole FPW-tracking is that pg_lesslog makes it fail. I'm not sure what we need to do about that - maybe just add a warning to the docs. But it leaves a bit bad feeling in my mouth. Usually we try to make features work orthogonally, without dependencies to other settings. Now this feature requires that full_page_writes is turned on in the master, and also that you don't use pg_lesslog to compress the WAL segments or your base backup might be corrupt. The procedure to take a backup from the standby seems more complicated than taking it on the master - there are more steps to follow. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.1] unusable for large views
On Mon, Oct 24, 2011 at 4:57 AM, Omar Bettin o.bet...@informaticaindustriale.it wrote: I have tried 9.1.1 win64 version and when I am trying to declare a cursor for a very large view (lot of joins and aggregate functions), postgres is using around 3GB of memory and the query never returns. Hmm. A 59-table join is pretty enormous. I wish we had a better way to handle these kinds of queries. Odds are good that the join order doesn't matter much, and in an ideal world we would be able to notice that and just use some simple heuristic to pick a tolerably good one. As it is, I am a bit surprised to hear that GEQO isn't bailing you out. Can you EXPLAIN a query against that view, or does even that wipe out? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
R: [HACKERS] [9.1] unusable for large views (SOLVED)
...sorry guys... was a misconfiguration of database. 9.1.1 is working good. is ~4% faster than 9.0.5 for same query. Thanks a lot. Regards -Messaggio originale- Da: Pavel Stehule [mailto:pavel.steh...@gmail.com] Inviato: lunedì 24 ottobre 2011 12:13 A: Omar Bettin Cc: pgsql-hackers@postgresql.org Oggetto: Re: [HACKERS] [9.1] unusable for large views Hello please, send a result of explain analyze on 9.1.1 and older please, use http://explain.depesz.com/ Regards Pavel Stehule 2011/10/24 Omar Bettin o.bet...@informaticaindustriale.it: Hello, I have tried 9.1.1 win64 version and when I am trying to declare a cursor for a very large view (lot of joins and aggregate functions), postgres is using around 3GB of memory and the query never returns. Same proble selecting from the view without cursor. Same query worked fine from 8.3.3 to 9.0.5. Should I change some configuration params to have the same behavior as previous versions? Tried on Win2008 server R2 64bit 8GB RAM. also on Win7 64bit 8GB RAM. default postgresql.conf Regards, The view (!) CREATE OR REPLACE VIEW articoli_rip_ven20110227_view AS SELECT a.pk_id, a.articolo, a.articolo_f, a.origine, ( SELECT ditte.attivita FROM ditte WHERE ditte.partitario = a.cod_fornit) AS dummy_1, a.descrizion, a.taglia, a.peso_spec, a.giacenza_colli, a.giacenza, rim.colli AS rimanenza_colli, rim.quantita AS rimanenza_qta, rimass.colli AS rimanenza_ass_colli, rimass.quantita AS rimanenza_ass_qta, COALESCE(b20.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b21.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b22.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b23.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b24.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b25.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b26.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b27.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b28.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b29.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b30.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b31.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b32.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b33.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b34.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b35.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b36.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b37.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b38.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b39.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b40.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b41.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b42.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b43.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b44.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b45.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b46.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b47.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b48.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b49.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b50.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b51.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b52.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b53.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b54.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b55.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b56.quantita, NULL::numeric, 0::numeric)::double precision + (COALESCE(rim.quantita, NULL::numeric, 0::numeric) + COALESCE(rimass.quantita, NULL::numeric, 0::numeric))::double precision - (COALESCE(b1.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b2.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b3.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b4.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b5.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b6.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b7.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b8.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b9.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b10.quantita, NULL::numeric, 0::numeric)::double
Re: [HACKERS] autovacuum and orphaned large objects
On Mon, Oct 24, 2011 at 10:25 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: On 24-10-2011 10:57, Robert Haas wrote: I think the main reason why vacuumlo is a contrib module rather than in core is that it is just a heuristic, and it might not be what everyone wants to do. You could store a bunch of large objects in the database and use the returned OIDs to generate links that you email to users, and then when the user clicks on the link we retrieve the corresponding LO and send it to the user over HTTP. In that design, there are no tables in the database at all, yet the large objects aren't orphaned. Uau, what a strange method to solve a problem and possibly bloat your database. No, I'm not suggesting that we forbid it. The proposed method could cleanup orphaned LO in 95% (if not 99%) of the use cases. I've never heard someone using LO like you describe it. It seems strange that someone distributes an OID number but (s)he does not store its reference at the same database. Yes, it is a possibility but ... I guess we could make it an optional behavior, but once you go that far then you have to wonder whether what's really needed here is a general-purpose task scheduler. I mean, the autovacuum launcher's idea about how often to vacuum the database won't necessarily match the user's idea of how often they want to vacuum away large objects - and if the user is doing something funky (like storing arrays of large object OIDs, or inexplicably storing them using numeric or int8) then putting it in the backend removes a considerable amount of flexibility. Another case where vacuumlo will fall over is if you have a very, very large table with an OID column, but with lots of duplicate values so that the number of OIDs actually referenced is much smaller. You might end up doing a table scan on the large table every time this logic kicks in, and that might suck. I'm sort of unexcited about the idea of doing a lot of engineering around this; it seems to me that the only reasons we still have a separate large object facility rather than just letting everyone go through regular tables with toastable columns are (1) the size limit is 2GB rather than 1GB and (2) you can read and write parts of objects rather than the whole thing. If we're going to do some more engineering here, I'd rather set our sights a little higher. Complaints I often hear about the large object machinery include (1) 2GB is still not enough, (2) 4 billion large objects is not enough, (3) the performance is inadequate, particularly with large numbers of large objects from possibly-unrelated subsystems slammed into a single table, and (4) it would be nice to be able to partial reads and writes on any toastable field, not just large objects. I'm not saying that the problem you're complaining about isn't worth fixing in the abstract, and if it seemed like a nice, clean fix I'd be all in favor, but I just don't think it's going to be very simple, and for the amount of work involved I'd rather get a little bit more bang for the buck. Of course, you don't have to agree with me on any of this; I'm just giving you my take on it. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
R: [HACKERS] [9.1] unusable for large views (SOLVED)
Hi Tom, ...are about two hours I am trying to communicate that the problem has been solved, but I do not see the messages in the mailing list... Anyway, the problems was a bad installation of database (pgsql functions). 9.1.1 is working good. is 4% to 8% faster than 9.0.5. Thanks a lot to everyone. Regards, Omar -Messaggio originale- Da: Tom Lane [mailto:t...@sss.pgh.pa.us] Inviato: lunedì 24 ottobre 2011 16:46 A: Omar Bettin Cc: pgsql-hackers@postgresql.org Oggetto: Re: [HACKERS] [9.1] unusable for large views Omar Bettin o.bet...@informaticaindustriale.it writes: I have tried 9.1.1 win64 version and when I am trying to declare a cursor for a very large view (lot of joins and aggregate functions), postgres is using around 3GB of memory and the query never returns. Could we see a self-contained test case? I'm not about to try to reverse-engineer the schema that goes with such a monster query. http://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Mon, Oct 24, 2011 at 11:33 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 24.10.2011 15:29, Fujii Masao wrote: In your patch, FPW is always WAL-logged at startup even when FPW has not been changed since last shutdown. I don't think that's required. I changed the recovery code so that it keeps track of last FPW indicated by WAL record. Then, at end of startup, if that FPW is equal to FPW specified in postgresql.conf (which means that FPW has not been changed since last shutdown or crash), WAL-logging of FPW is skipped. This change prevents unnecessary WAL-logging. Thought? One problem with this whole FPW-tracking is that pg_lesslog makes it fail. I'm not sure what we need to do about that - maybe just add a warning to the docs. But it leaves a bit bad feeling in my mouth. Usually we try to make features work orthogonally, without dependencies to other settings. Now this feature requires that full_page_writes is turned on in the master, and also that you don't use pg_lesslog to compress the WAL segments or your base backup might be corrupt. The procedure to take a backup from the standby seems more complicated than taking it on the master - there are more steps to follow. Doing it on the master isn't as easy as I'd like it to be, either. But it's not really clear how to make it simpler. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.1] unusable for large views
Omar Bettin o.bet...@informaticaindustriale.it writes: I have tried 9.1.1 win64 version and when I am trying to declare a cursor for a very large view (lot of joins and aggregate functions), postgres is using around 3GB of memory and the query never returns. Could we see a self-contained test case? I'm not about to try to reverse-engineer the schema that goes with such a monster query. http://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and orphaned large objects
Euler Taveira de Oliveira eu...@timbira.com writes: The main point of autovacuum is maintenance tasks. Currently, it executes VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo functionality into it. I'm not terribly thrilled with that because (a) large objects seem like mostly a legacy feature from here, and (b) it's hard to see how to implement it without imposing overhead on everybody, whether they use LOs or not. This is especially problematic if you're proposing that cleanup triggers not be required. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message
Robert Haas robertmh...@gmail.com writes: On Sun, Oct 23, 2011 at 6:34 PM, Tom Lane t...@sss.pgh.pa.us wrote: BTW, it strikes me that if we want to do something about that, it ought to be possible; but it has to be built into error handling, not a localized hack for sync rep. I actually think that emitting a NOTICE or WARNING and then slamming the connection shut is quite elegant, No, it's a horrid crock, whose only saving grace is that it was implementable with two or three lines localized to SyncRepWaitForLSN ... or at least, we thought that until Fujii-san started pointing out the bugs in it. It's not convenient for clients at all, it does not fit well into the backend structure (which is the reason for the bugs), and it forces session termination unnecessarily, or at least it would if we'd been consistent and applied the method to query-cancel as well. as it seems to me that any client that is paranoid enough to care about sync rep had better already be handling the case of a connection loss during commit. Agreed, but that is a problem that by definition we can't help with. Also, the issue with connection loss is that you really can't know whether your transaction got committed without reconnecting and looking for evidence. There is no reason at all to inject such uncertainty into the cancel-SyncRepWaitForLSN case. We know the transaction got committed, and there's no reason to make the client guess about that, nor to make it parse WARNING messages for which we didn't even get the assignment of a unique SQLSTATE right (thus making the problem insoluble anyhow). But I think that throwing an ERROR is likely to cause a LOT of client breakage, even if you have some special (human-invisible?) flag that indicates that you don't really mean it. If we must do something other than simulating a server disconnect, letting the command completion message go through and annotating it with a NOTICE or WARNING seems preferable. I think you're thinking narrowly of the SyncRepWaitForLSN case. What I'm trying to point out is that there's a boatload of post-commit code which is capable of sometimes throwing errors, and that's not ever going to go away completely. It might be that it'd work to deal with this by reducing the reported strength of all such cases from ERROR to WARNING. Not sure that that's a good idea, but it might work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updated version of pg_receivexlog
On Mon, Oct 24, 2011 at 13:46, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: + /* + * Looks like an xlog file. Parse it's position. s/it's/its/ + */ + if (sscanf(dirent-d_name, %08X%08X%08X, tli, log, seg) != 3) + { + fprintf(stderr, _(%s: could not parse xlog filename \%s\\n), + progname, dirent-d_name); + disconnect_and_exit(1); + } + log *= XLOG_SEG_SIZE; That multiplication by XLOG_SEG_SIZE could overflow, if logid is very high. It seems completely unnecessary, anyway, How do you mean completely unnecessary? We'd have to change the points that use it to divide by XLOG_SEG_SIZE otherwise, no? That might be a way to get around the overflow, but I'm not sure that's what you mean? s/IDENFITY_SYSTEM/IDENTIFY_SYSTEM/ (two occurrences) Oops. In pg_basebackup, it would be a good sanity check to check that the systemid returned by IDENTIFY_SYSTEM in the main connection and the WAL-streaming connection match. Just to be sure that some connection pooler didn't hijack one of the connections and point to a different server. And better check timelineid too while you're at it. That's a good idea. Will fix. How does this interact with synchronous replication? If a base backup that streams WAL is in progress, and you have synchronous_standby_names set to '*', I believe the in-progress backup will count as a standby for that purpose. That might give a false sense of security. Ah yes. Did not think of that. Yes, it will have this problem. synchronous_standby_names='*' is prone to such confusion in general, but it seems that it's particularly surprising if a running pg_basebackup lets a commit in synchronous replication to proceed. Maybe we just need a warning in the docs. I think we should advise that synchronous_standby_names='*' is dangerous in general, and cite this as one reason for that. Hmm. i think this is common enough that we want to make sure we avoid it in code. Could we pass a parameter from the client indicating to the master that it refuses to be a sync slave? An optional keyword to the START_REPLICATION command, perhaps? -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
On Sun, Oct 23, 2011 at 7:01 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas robertmh...@gmail.com wrote: Also, this line is kind of expensive: if (!visibilitymap_test(scandesc-heapRelation, ItemPointerGetBlockNumber(tid), node-ioss_VMBuffer)) Around 2%. But I don't see any way to avoid that, or even make it cheaper. Could we cache by ItemPointerGetBlockNumber(tid) the results of those tests, for groups of tids on the same index page? How useful this would be would depend on how well-clustered the table and index are. I thought about that, but the existing code is so ridiculously cheap that it's hard to believe a caching layer would save enough to pay for itself. I mean, I presume that the cost attributed to that line has to be associated with either (a) one of the pointer deferences, (b) the expense of evaluating ItemPointerGetBlockNumber(), (c) setting up the function call, or perhaps (d) overhead incurred as a result of branch mis-prediction. The actual time spent *inside* visibilitymap_test will be attributed to that function, not this one. If you add up the time for this line and visibilitymap_test(), it's like 10% of the runtime, which seems pretty significant. But it's 10% of the runtime that is spent basically a handful of arithmetic operations and then reading a byte from shared memory. It's astonishing to find that so expensive on a test with just one backend running. If you stick some kind of cache in there, it's going to involve adding a branch that isn't there now, and I think that's going to be pricey given how hot this code apparently is. Also, I'm not sure it's safe. Suppose that we lock the index page, return a tuple, check the visibility map, and find the page all visible. Another transaction comes along, adds a tuple to the index page, and clears the visibility map bit. We then go back, relock the index page, and return another tuple. We'd better notice that the visibility map bit has now been cleared, or we're in trouble. I wonder if it might help to create some method for the index to return all the matching keys on a single index page in one call. If you're dealing with an MVCC snapshot, any new tuples added after the start of the scan can't be visible anyway. That would save the overhead of unlocking and relocking the buffer once per tuple, and probably some overhead associated with validating and decoding the possibly-changed page contents each time. If you did it that way, it would also be safe to do what you're proposing - if a bunch of the tuples on the index page are also on the same heap page, you could do one visibility map check for all of them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updated version of pg_receivexlog
On Mon, Oct 24, 2011 at 7:40 AM, Magnus Hagander mag...@hagander.net wrote: synchronous_standby_names='*' is prone to such confusion in general, but it seems that it's particularly surprising if a running pg_basebackup lets a commit in synchronous replication to proceed. Maybe we just need a warning in the docs. I think we should advise that synchronous_standby_names='*' is dangerous in general, and cite this as one reason for that. Hmm. i think this is common enough that we want to make sure we avoid it in code. Could we pass a parameter from the client indicating to the master that it refuses to be a sync slave? An optional keyword to the START_REPLICATION command, perhaps? can't you execute set synchronous_commit to off/local for this connection? -- Jaime Casanova www.2ndQuadrant.com Professional PostgreSQL: Soporte 24x7 y capacitación -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Updated version of pg_receivexlog
+ /* +* Looks like an xlog file. Parse it's position. s/it's/its/ +*/ + if (sscanf(dirent-d_name, %08X%08X%08X, tli, log, seg) != 3) + { + fprintf(stderr, _(%s: could not parse xlog filename \%s\\n), + progname, dirent-d_name); + disconnect_and_exit(1); + } + log *= XLOG_SEG_SIZE; That multiplication by XLOG_SEG_SIZE could overflow, if logid is very high. It seems completely unnecessary, anyway, s/IDENFITY_SYSTEM/IDENTIFY_SYSTEM/ (two occurrences) In pg_basebackup, it would be a good sanity check to check that the systemid returned by IDENTIFY_SYSTEM in the main connection and the WAL-streaming connection match. Just to be sure that some connection pooler didn't hijack one of the connections and point to a different server. And better check timelineid too while you're at it. How does this interact with synchronous replication? If a base backup that streams WAL is in progress, and you have synchronous_standby_names set to '*', I believe the in-progress backup will count as a standby for that purpose. That might give a false sense of security. synchronous_standby_names='*' is prone to such confusion in general, but it seems that it's particularly surprising if a running pg_basebackup lets a commit in synchronous replication to proceed. Maybe we just need a warning in the docs. I think we should advise that synchronous_standby_names='*' is dangerous in general, and cite this as one reason for that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] EXECUTE tab completion
On Mon, Oct 24, 2011 at 01:26, Tom Lane t...@sss.pgh.pa.us wrote: Andreas Karlsson andr...@proxel.se writes: Thanks for cleaning up the code to some sanity, I should have done so myself when I noticed the problem. A new version is attached. Committed with minor adjustments --- I didn't see any need to make this wait for the next commitfest. Thanks - I was planning to pick that one up along with my TABLE patch, but was too busy with pgconf.eu over the past couple of weeks.. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] loss of transactions in streaming replication
On Fri, Oct 21, 2011 at 12:01 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Oct 20, 2011 at 9:51 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote: OK, so this is an artifact of the changes to make libpq communication bidirectional. But I'm still confused about where the error is coming from. In your OP, you wrote In 9.2dev and 9.1, when walreceiver detects an error while sending data to WAL stream, it always emits ERROR even if there are data available in the receive buffer. So that implied to me that this is only going to trigger if you have a shutdown together with an awkwardly-timed error. But your scenario for reproducing this problem doesn't seem to involve an error. Yes, my scenario doesn't cause any real error. My original description was misleading. The following would be closer to the truth: In 9.2dev and 9.1, when walreceiver detects the termination of replication connection while sending data to WAL stream, it always emits ERROR even if there are data available in the receive buffer. Ah, OK. I think I now agree that this is a bug and that we should fix and back-patch. The patch that I posted before is well-formed enough to be adopted? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and orphaned large objects
On Mon, Oct 24, 2011 at 12:56 AM, Euler Taveira de Oliveira eu...@timbira.com wrote: The main point of autovacuum is maintenance tasks. Currently, it executes VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo functionality into it. While dealing with large objects (LO), we have lo contrib module that helps with LO maintenance but has some limitations (does not handle DROP TABLE and TRUNCATE cases) and vacuumlo that does an excellent job but have to be executed outside DBMS. The proposal is to clean up LO when autovacuum triggers VACUUM; cleanup LO routine will starts after(?) VACUUM command. In a near future I want to propose that orphaned LO be cleaned up by VACUUM but that a history for another thread... Comments? I think the main reason why vacuumlo is a contrib module rather than in core is that it is just a heuristic, and it might not be what everyone wants to do. You could store a bunch of large objects in the database and use the returned OIDs to generate links that you email to users, and then when the user clicks on the link we retrieve the corresponding LO and send it to the user over HTTP. In that design, there are no tables in the database at all, yet the large objects aren't orphaned. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.1] unusable for large views
On 24/10/11 10:57, Omar Bettin wrote: [monster query] I see that your problem is already solved, but incidentially I'm working on a join order planning module and I'm looking for real-life examples of humongous queries like that to benchmark against them. Any chance you could share the schema, or at least part of it, that goes with this query? Or perhaps you have more of these queries? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unreproducible bug in snapshot import code
Hi All, I have not been able to reproduce this error, but wanted to report this in case it might be useful. Commit -id: 0f39d5050dc0dce99258381f33f1832c437aff85 Configure options: --prefix=/mnt/storage/gurjeet/dev/builds//master/db --enable-debug --enable-cassert CFLAGS=-O0 --enable-depend --enable-thread-safety --with-openssl Used VPATH to build postgres. Attached files: terminal1.txt termonal2.txt config.log Starting from line 89 of terminal1.txt we see this snippet: postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02C8-1'; ERROR: SET TRANSACTION SNAPSHOT must be called before any query postgres=# rollback; ROLLBACK As you can see the SET TRANSACTION SNAPSHOT was the first statement in that transaction, and yet the ERROR message says that it is not. That snapshot id was generated in another session (terminal2.txt), and was generated outside any transaction; that's the only peculiar thing I can say about that snapshot-id. As is evident from the rest of the lines in terminal1.txt, I have not been able to reproduce this error again; I tried even with a clean build of the sources. I have taken a look at the code and everything tells me that the error should have been: ERROR: a snapshot-importing transaction must have isolation level SERIALIZABLE or REPEATABLE READ Maybe it was not a clean build the first time, and the code may have wrongly linked with previously compiled .o files. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company psql (9.2devel) Type help for help. postgres=# begin transaction; BEGIN postgres=# set iso postgres=# set trans transaction transaction_deferrable transaction_isolation transaction_read_only transform_null_equals postgres=# set transaction isolation level read only; ERROR: syntax error at or near only LINE 1: set transaction isolation level read only; ^ postgres=# set transaction isolation level readonly; ERROR: syntax error at or near readonly LINE 1: set transaction isolation level readonly; ^ postgres=# set transaction ISOLATION LEVEL READ postgres=# set transaction ISOLATION LEVEL READ REPEATABLESERIALIZABLE postgres=# set transaction ISOLATION LEVEL READ COMMITTEDUNCOMMITTED postgres=# set transaction ISOLATION LEVEL READ committed; ERROR: current transaction is aborted, commands ignored until end of transaction block postgres=# rollback; ROLLBACK postgres=# begin transaction; BEGIN postgres=# set transaction ISOLATION LEVEL READ committed; SET postgres=# rollback; ROLLBACK postgres=# begin transaction; BEGIN postgres=# select 1; ?column? -- 1 (1 row) postgres=# set transaction ISOLATION LEVEL READ committed; SET postgres=# rollback; ROLLBACK postgres=# begin transaction; BEGIN postgres=# set transaction ISOLATION LEVEL serializable;; SET postgres=# rollback; ROLLBACK postgres=# begin transaction; BEGIN postgres=# select 1; ?column? -- 1 (1 row) postgres=# set transaction ISOLATION LEVEL serializable;; ERROR: SET TRANSACTION ISOLATION LEVEL must be called before any query postgres=# rollback; ROLLBACK (reverse-i-search)`be': select count(*) from clustermem^Crs ; postgres=# begin transaction; BEGIN postgres=# set transaction ISOLATION LEVEL repeatable read; SET postgres=# rollback; ROLLBACK postgres=# begin transaction; BEGIN postgres=# set transaction ISOLATION LEVEL repeatable read; SET postgres=# rollback; ROLLBACK postgres=# begin transaction; BEGIN postgres=# select 1; ?column? -- 1 (1 row) postgres=# set transaction ISOLATION LEVEL repeatable read; ERROR: SET TRANSACTION ISOLATION LEVEL must be called before any query postgres=# set transaction ISOLATION LEVEL ; READ REPEATABLESERIALIZABLE postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02C8-1'; ERROR: SET TRANSACTION SNAPSHOT must be called before any query postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02C9-1'; ERROR: a snapshot-importing transaction must have isolation level SERIALIZABLE or REPEATABLE READ postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction ISOLATION LEVEL repeatable read; SET postgres=# set transaction snapshot '02C9-1'; SET postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02CA-1'; ERROR: a snapshot-importing transaction must have isolation level SERIALIZABLE or REPEATABLE READ postgres=# rollack; ERROR: syntax error at or near rollack LINE 1: rollack; ^ postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction ISOLATION LEVEL repeatable read; SET postgres=#
R: [HACKERS] [9.1] unusable for large views (SOLVED)
...sorry guys... was a bad configuration of database. 9.1.1 is working good. is 4% to 8% faster than 9.0.5. Thanks a lot. Regards Omar P.s. attached EXPLAIN Hmm. A 59-table join is pretty enormous and is not the biggest, basically are delivery notes for one day seen in vertical. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TABLE tab completion
On Mon, Sep 26, 2011 at 10:37, Magnus Hagander mag...@hagander.net wrote: On Sun, Sep 25, 2011 at 15:06, Dean Rasheed dean.a.rash...@gmail.com wrote: On 24 September 2011 11:59, Magnus Hagander mag...@hagander.net wrote: TABLE tab completion in psql only completes to tables, not views. but the TABLE command works fine for both tables and views (and also sequences). Seems we should just complete it to relations and not tables - or can anyone see a particular reason why we shouldn't? Doesn't that mean that DROP TABLE tab would offer up views as well as tables, which would be incorrect? Meh - you are correct, of course. I guess that's why we have code review :-) So - not a oneliner, but how about something like this? (Happy to have someone point out a neater way of doing it, not entirely fluent in how we do the tab completion..) Rebased on top of the changes Tom made to the infrastructure, and applied. -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.1] unusable for large views (SOLVED)
2011/10/24 Omar Bettin o.bet...@informaticaindustriale.it: ...sorry guys... was a bad configuration of database. 9.1.1 is working good. is 4% to 8% faster than 9.0.5. Thanks a lot. Regards Omar P.s. attached EXPLAIN attachment is missing Pavel Hmm. A 59-table join is pretty enormous and is not the biggest, basically are delivery notes for one day seen in vertical. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.1] unusable for large views
Hello please, send a result of explain analyze on 9.1.1 and older please, use http://explain.depesz.com/ Regards Pavel Stehule 2011/10/24 Omar Bettin o.bet...@informaticaindustriale.it: Hello, I have tried 9.1.1 win64 version and when I am trying to declare a cursor for a very large view (lot of joins and aggregate functions), postgres is using around 3GB of memory and the query never returns. Same proble selecting from the view without cursor. Same query worked fine from 8.3.3 to 9.0.5. Should I change some configuration params to have the same behavior as previous versions? Tried on Win2008 server R2 64bit 8GB RAM. also on Win7 64bit 8GB RAM. default postgresql.conf Regards, The view (!) CREATE OR REPLACE VIEW articoli_rip_ven20110227_view AS SELECT a.pk_id, a.articolo, a.articolo_f, a.origine, ( SELECT ditte.attivita FROM ditte WHERE ditte.partitario = a.cod_fornit) AS dummy_1, a.descrizion, a.taglia, a.peso_spec, a.giacenza_colli, a.giacenza, rim.colli AS rimanenza_colli, rim.quantita AS rimanenza_qta, rimass.colli AS rimanenza_ass_colli, rimass.quantita AS rimanenza_ass_qta, COALESCE(b20.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b21.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b22.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b23.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b24.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b25.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b26.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b27.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b28.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b29.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b30.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b31.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b32.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b33.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b34.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b35.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b36.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b37.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b38.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b39.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b40.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b41.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b42.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b43.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b44.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b45.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b46.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b47.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b48.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b49.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b50.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b51.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b52.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b53.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b54.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b55.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b56.quantita, NULL::numeric, 0::numeric)::double precision + (COALESCE(rim.quantita, NULL::numeric, 0::numeric) + COALESCE(rimass.quantita, NULL::numeric, 0::numeric))::double precision - (COALESCE(b1.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b2.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b3.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b4.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b5.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b6.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b7.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b8.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b9.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b10.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b11.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b12.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b13.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b14.quantita, NULL::numeric, 0::numeric)::double precision + COALESCE(b15.quantita, NULL::numeric, 0::numeric)::double precision +
Re: [HACKERS] Unreproducible bug in snapshot import code
Gurjeet Singh singh.gurj...@gmail.com writes: Starting from line 89 of terminal1.txt we see this snippet: postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02C8-1'; ERROR: SET TRANSACTION SNAPSHOT must be called before any query postgres=# rollback; ROLLBACK As is evident from the rest of the lines in terminal1.txt, I have not been able to reproduce this error again; I tried even with a clean build of the sources. Given the weird capitalization of the BEGIN command, I'm guessing that you used tab-completion to enter it. I wonder if this could have been affected by the execution of some query in support of tab completion? I couldn't reproduce it either on the basis of that guess, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Range Types - typo + NULL string constructor
On 17.10.2011 01:09, Jeff Davis wrote: On Sat, 2011-10-15 at 01:46 +0300, Heikki Linnakangas wrote: * The binary i/o format includes the length of the lower and upper bounds twice, once explicitly in range_send, and second time within the send-function of the subtype. Seems wasteful. Any ideas how to fix that? How else do I know how much the underlying send function will consume? Oh, never mind. I was misreading the code, it's not sending the length twice. * range_constructor_internal - I think it would be better to move logic to figure out the the arguments into the callers. Done. The comment above range_constructor0() is now outdated. * The gist support functions frequently call range_deserialize(), which does catalog lookups. Isn't that horrendously expensive? Yes, it was. I have introduced a cached structure that avoids syscache lookups when it's the same range as the last lookup (the common case). Hmm, I don't think that's safe. After Oid wraparound, a range type oid might get reused for some other range type, and the cache would return stale values. Extremely unlikely to happen by accident, but could be exploited by an attacker. * What exactly is canonical function supposed to return? It's not clear what format one should choose as the canonical format when writing a custom range type. And even for the built-in types, it would be good to explain which types use which canonical format (I saw the discussions on that, so I understand that might still be subject to change). The canonical function is just supposed to return a new range such that two equal values will have equal representations. I have listed the built-in discrete range types and their canonical form. As far as describing what a custom range type is supposed to use for the canonical form, I don't know which part is exactly unclear. There aren't too many rules to defining one -- the only guideline is that ranges of equal value going in should be put in one canonical representation. Ok. The name canonical certainly hints at that, but it would be good to explicitly state that guideline. As the text stands, it would seem that a canonical function that maps [1,7] to [1,8), and also vice versa, [1,8) to [1,7], would be valid. That would be pretty silly, but it would be good to say something like The canonical output for two values that are equal, like [1,7] and [1,8), must be equal. It doesn't matter which representation you choose to be the canonical one, as long as two equal values with different formattings are always mapped to the same value with same formatting -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unreproducible bug in snapshot import code
Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011: Starting from line 89 of terminal1.txt we see this snippet: postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02C8-1'; ERROR: SET TRANSACTION SNAPSHOT must be called before any query postgres=# rollback; ROLLBACK As you can see the SET TRANSACTION SNAPSHOT was the first statement in that transaction, and yet the ERROR message says that it is not. Maybe the tab-completion feature issued a query before the set transaction command. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and orphaned large objects
On 24-10-2011 10:57, Robert Haas wrote: I think the main reason why vacuumlo is a contrib module rather than in core is that it is just a heuristic, and it might not be what everyone wants to do. You could store a bunch of large objects in the database and use the returned OIDs to generate links that you email to users, and then when the user clicks on the link we retrieve the corresponding LO and send it to the user over HTTP. In that design, there are no tables in the database at all, yet the large objects aren't orphaned. Uau, what a strange method to solve a problem and possibly bloat your database. No, I'm not suggesting that we forbid it. The proposed method could cleanup orphaned LO in 95% (if not 99%) of the use cases. I've never heard someone using LO like you describe it. It seems strange that someone distributes an OID number but (s)he does not store its reference at the same database. Yes, it is a possibility but ... -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Monster query
Hi, Since the data are of third parties, will prepare a database suitable for the purpose. In any case, the compressed backup will be around 20 MB. Regards, Omar Bettin -Messaggio originale- Da: Jan Urbański [mailto:wulc...@wulczer.org] Inviato: lunedì 24 ottobre 2011 18:56 A: Omar Bettin Cc: pgsql-hackers@postgresql.org Oggetto: Re: [HACKERS] [9.1] unusable for large views On 24/10/11 10:57, Omar Bettin wrote: [monster query] I see that your problem is already solved, but incidentially I'm working on a join order planning module and I'm looking for real-life examples of humongous queries like that to benchmark against them. Any chance you could share the schema, or at least part of it, that goes with this query? Or perhaps you have more of these queries? Cheers, Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant
Robert Haas robertmh...@gmail.com writes: [ oom_score_adj business ] Did we do anything about this? Anyone else have an opinion on what ought to be done? I held off doing anything because it didn't seem like we had consensus. OTOH, it may well be that it's not important enough to demand real consensus, and he who does the work gets to make the choices. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] autovacuum and orphaned large objects
On 24-10-2011 11:36, Tom Lane wrote: Euler Taveira de Oliveiraeu...@timbira.com writes: The main point of autovacuum is maintenance tasks. Currently, it executes VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo functionality into it. I'm not terribly thrilled with that because (a) large objects seem like mostly a legacy feature from here, and Right, but there isn't a solution for 1 GB column data besides LO. (b) it's hard to see how to implement it without imposing overhead on everybody, whether they use LOs or not. This is especially problematic if you're proposing that cleanup triggers not be required. I was thinking about starting the LO cleanup after autovacuum finishes the VACUUM command (so no trigger, no new mechanism). And about the overhead imposed, it will only execute the cleanup code in the tables that have oid/lo columns (this information will be collected when the autovacuum collects table information). -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Monster query
On 24/10/11 19:16, Omar Bettin wrote: Hi, Since the data are of third parties, will prepare a database suitable for the purpose. In any case, the compressed backup will be around 20 MB. If you are able to prepare a database dump that doesn't contain private data, it would be awesome. If it's 20 MB please just post a link to the archive. Thank you very much! Jan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message
Tom Lane t...@sss.pgh.pa.us wrote: It might be that it'd work to deal with this by reducing the reported strength of all such cases from ERROR to WARNING. Not sure that that's a good idea, but it might work. Throwing an error on commit of a transaction when its work has actually been persisted seems as bad as failing to persist the work of a transaction when the commit appears successful. If we know we have met the persistence guarantees and some problem occurs after that, then it seems to me we should issue a warning and indicate success. If we don't know whether the work will be persisted, it seems to me we can block until we find out (if that's feasible) or break the connection without indicating one way or the other. The work of the transaction must either all be persisted or not (I'm assuming this isn't a problem here), and any user must be prepared to deal with a broken connection before response on a commit attempt. I don't know whether changing all failures beyond the persistence point in a commit to warnings is sufficient, but I think it is required for anything which doesn't break the connection. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Separating bgwriter and checkpointer
On 19.10.2011 17:58, Simon Riggs wrote: On Wed, Oct 19, 2011 at 3:29 PM, Fujii Masaomasao.fu...@gmail.com wrote: On Wed, Oct 19, 2011 at 9:45 PM, Robert Haasrobertmh...@gmail.com wrote: I don't really see any reason to break the monitoring view just because we did some internal refactoring. I'd rather have backward compatibility. Fair enough. The patch doesn't change any document, but at least the description of pg_stat_bgwriter seems to need to be changed. Thanks for the review. Will follow up on suggestions. The patch looks sane, it's mostly just moving existing code around, but there's one thing that's been bothering me about this whole idea from the get-go: If the bgwriter and checkpointer are two different processes, whenever bgwriter writes out a page it needs to send an fsync-request to the checkpointer. We avoided that when both functions were performed by the same process, but now we have to send and absorb a fsync-request message for every single write() that happens in the system, except for those done at checkpoints. Isn't that very expensive? Does it make the fsync-request queue a bottleneck on some workloads? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant
On Mon, Oct 24, 2011 at 1:17 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: [ oom_score_adj business ] Did we do anything about this? Anyone else have an opinion on what ought to be done? I held off doing anything because it didn't seem like we had consensus. OTOH, it may well be that it's not important enough to demand real consensus, and he who does the work gets to make the choices. Half a loaf is better than none. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inserting heap tuples in bulk in COPY
On 25.09.2011 16:03, Dean Rasheed wrote: On 25 September 2011 09:43, Kohei KaiGaikai...@kaigai.gr.jp wrote: Hi Heikki, I checked your patch, then I have a comment and two questions here. 2011/9/14 Heikki Linnakangasheikki.linnakan...@enterprisedb.com: Attached is a new version of the patch. It is now complete, including WAL replay code. Hi, I had a quick look at the patch as well and spotted an oversight: the multi-insert code branch fails to invoke AFTER ROW triggers. Thanks! Here's an updated version of the patch, fixing that, and all the other issues pointed out this far. I extracted the code that sets oid and tuple headers, and invokes the toaster, into a new function that's shared by heap_insert() and heap_multi_insert(). Tom objected to merging heap_insert() and heap_multi_insert() into one complicated function, and I think he was right on that, but sharing this code to prepare a tuple still makes sense. IMHO it makes heap_insert() slightly more readable too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 24,29 --- 24,30 * heap_getnext - retrieve next tuple in scan * heap_fetch - retrieve tuple with given tid * heap_insert - insert tuple into a relation + * heap_multi_insert - insert multiple tuples into a relation * heap_delete - delete a tuple from a relation * heap_update - replace a tuple in a relation with another tuple * heap_markpos - mark scan position *** *** 79,84 static HeapScanDesc heap_beginscan_internal(Relation relation, --- 80,87 int nkeys, ScanKey key, bool allow_strat, bool allow_sync, bool is_bitmapscan); + static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup, + TransactionId xid, CommandId cid, int options); static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from, Buffer newbuf, HeapTuple newtup, bool all_visible_cleared, bool new_all_visible_cleared); *** *** 1866,1920 heap_insert(Relation relation, HeapTuple tup, CommandId cid, Buffer vmbuffer = InvalidBuffer; bool all_visible_cleared = false; - if (relation-rd_rel-relhasoids) - { - #ifdef NOT_USED - /* this is redundant with an Assert in HeapTupleSetOid */ - Assert(tup-t_data-t_infomask HEAP_HASOID); - #endif - - /* - * If the object id of this tuple has already been assigned, trust the - * caller. There are a couple of ways this can happen. At initial db - * creation, the backend program sets oids for tuples. When we define - * an index, we set the oid. Finally, in the future, we may allow - * users to set their own object ids in order to support a persistent - * object store (objects need to contain pointers to one another). - */ - if (!OidIsValid(HeapTupleGetOid(tup))) - HeapTupleSetOid(tup, GetNewOid(relation)); - } - else - { - /* check there is not space for an OID */ - Assert(!(tup-t_data-t_infomask HEAP_HASOID)); - } - - tup-t_data-t_infomask = ~(HEAP_XACT_MASK); - tup-t_data-t_infomask2 = ~(HEAP2_XACT_MASK); - tup-t_data-t_infomask |= HEAP_XMAX_INVALID; - HeapTupleHeaderSetXmin(tup-t_data, xid); - HeapTupleHeaderSetCmin(tup-t_data, cid); - HeapTupleHeaderSetXmax(tup-t_data, 0); /* for cleanliness */ - tup-t_tableOid = RelationGetRelid(relation); - /* ! * If the new tuple is too big for storage or contains already toasted ! * out-of-line attributes from some other relation, invoke the toaster. * * Note: below this point, heaptup is the data we actually intend to store * into the relation; tup is the caller's original untoasted data. */ ! if (relation-rd_rel-relkind != RELKIND_RELATION) ! { ! /* toast table entries should never be recursively toasted */ ! Assert(!HeapTupleHasExternal(tup)); ! heaptup = tup; ! } ! else if (HeapTupleHasExternal(tup) || tup-t_len TOAST_TUPLE_THRESHOLD) ! heaptup = toast_insert_or_update(relation, tup, NULL, options); ! else ! heaptup = tup; /* * We're about to do the actual insert -- but check for conflict first, --- 1869,1882 Buffer vmbuffer = InvalidBuffer; bool all_visible_cleared = false; /* ! * Fill in tuple header fields, assign an OID, and toast the tuple if ! * necessary. * * Note: below this point, heaptup is the data we actually intend to store * into the relation; tup is the caller's original untoasted data. */ ! heaptup = heap_prepare_insert(relation, tup, xid, cid, options); /* * We're about to do the actual insert -- but check for conflict first, *** *** 2035,2041 heap_insert(Relation relation, HeapTuple tup, CommandId cid, */ CacheInvalidateHeapTuple(relation, heaptup, NULL); ! pgstat_count_heap_insert(relation); /* * If heaptup is a private copy, release it. Don't
Re: [HACKERS] So, is COUNT(*) fast now?
Tom Lane t...@sss.pgh.pa.us wrote: I had wondered whether it'd be worth optimizing that along the lines of slot_getallattrs(). But most indexes probably have only one column, or anyway not enough to make for a useful savings. From a heavily-used production database: cir= select indnatts, count(*) from pg_index group by indnatts order by indnatts; indnatts | count --+--- 1 | 200 2 | 684 3 | 155 4 |76 5 |43 6 |13 7 | 2 9 | 1 (8 rows) This includes system table and TOAST table indexes (which seem to have two columns). There are over 400 user tables, each of which has a primary key, so most primary keys in our database are more than one column. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unreproducible bug in snapshot import code
On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera alvhe...@commandprompt.comwrote: Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011: Starting from line 89 of terminal1.txt we see this snippet: postgres=# rollback; ROLLBACK postgres=# begin TRANSACTION ; BEGIN postgres=# set transaction snapshot '02C8-1'; ERROR: SET TRANSACTION SNAPSHOT must be called before any query postgres=# rollback; ROLLBACK As you can see the SET TRANSACTION SNAPSHOT was the first statement in that transaction, and yet the ERROR message says that it is not. Maybe the tab-completion feature issued a query before the set transaction command. I have tried reproducing the bug starting from 1 and 2 transactions before the one shown in snippet, and I used tab-completion to get the same screen-output as termonal1.txt and yet it's not reproducible. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] So, is COUNT(*) fast now?
Kevin Grittner kevin.gritt...@wicourts.gov writes: Tom Lane t...@sss.pgh.pa.us wrote: I had wondered whether it'd be worth optimizing that along the lines of slot_getallattrs(). But most indexes probably have only one column, or anyway not enough to make for a useful savings. From a heavily-used production database: cir= select indnatts, count(*) from pg_index group by indnatts order by indnatts; indnatts | count --+--- 1 | 200 2 | 684 3 | 155 4 |76 5 |43 6 |13 7 | 2 9 | 1 (8 rows) This includes system table and TOAST table indexes (which seem to have two columns). Yeah, TOAST indexes are 2-column. It would be best to exclude those from your counts, since it seems pretty unlikely that anyone will care how fast nodeIndexonlyscan.c is for scans on toast tables. There are over 400 user tables, each of which has a primary key, so most primary keys in our database are more than one column. It doesn't look to me like the mean is above 2 (unless you have many fewer toast tables than I suspect), so trying to optimize many-column cases isn't going to help. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations
On Wed, October 19, 2011 15:01, Kerem Kat wrote: Adding CORRESPONDING to Set Operations Initial patch, filename: corresponding_clause_v2.patch I had a quick look at the behaviour of this patch. Btw, the examples in your email were typoed (one select is missing): SELECT 1 a, 2 b, 3 c UNION CORRESPONDING 4 b, 5 d, 6 c, 7 f; should be: SELECT 1 a, 2 b, 3 c UNION CORRESPONDING select 4 b, 5 d, 6 c, 7 f; and SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) 4 b, 5 d, 6 c, 7 f; should be: SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) select 4 b, 5 d, 6 c, 7 f; But there is also a small bug, I think: the order in the CORRESPONDING BY list should be followed, according to the standard (foundation, p. 408): 2) If corresponding column list is specified, then let SL be a select list of those column names explicitly appearing in the corresponding column list in the order that these column names appear in the corresponding column list. Every column name in the corresponding column list shall be a column name of both T1 and T2. That would make this wrong, I think: SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(c,b) select 5 d, 6 c, 7 f, 4 b ; b | c ---+--- 2 | 3 4 | 6 (2 rows) i.e., I think it should show columns in the order c, b (and not b, c); the order of the CORRESPONDING BY phrase. (but maybe I'm misreading the text of the standard; I find it often difficult to follow) Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
Tom Lane t...@sss.pgh.pa.us wrote: Yeah, TOAST indexes are 2-column. It would be best to exclude those from your counts, since it seems pretty unlikely that anyone will care how fast nodeIndexonlyscan.c is for scans on toast tables. User indexes (excluding toast): indnatts | count --+--- 1 | 200 2 | 222 3 | 155 4 |76 5 |43 6 |13 7 | 2 9 | 1 (8 rows) System indexes (excluding toast): indnatts | count --+--- 1 |46 2 |24 3 | 9 4 | 5 (4 rows) It doesn't look to me like the mean is above 2 (unless you have many fewer toast tables than I suspect), so trying to optimize many-column cases isn't going to help. The mean is 2.4 (give or take a little depending on whether you include system tables). I have no idea where the optimization becomes worthwhile, but the assertion that most indexes probably have a single column worried me. I'm sure there are databases where that is true (especially for those who insist on adding a meaningless surrogate key column to every table), but there are many where it isn't true. I would guess that our average of 2.4 is higher than most, though. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
Copy/paste problems -- the first set includes the system tables except for toast. User tables would be the difference between the results below. Sorry. -Kevin Kevin Grittner kevin.gritt...@wicourts.gov wrote: Tom Lane t...@sss.pgh.pa.us wrote: Yeah, TOAST indexes are 2-column. It would be best to exclude those from your counts, since it seems pretty unlikely that anyone will care how fast nodeIndexonlyscan.c is for scans on toast tables. User indexes (excluding toast): indnatts | count --+--- 1 | 200 2 | 222 3 | 155 4 |76 5 |43 6 |13 7 | 2 9 | 1 (8 rows) System indexes (excluding toast): indnatts | count --+--- 1 |46 2 |24 3 | 9 4 | 5 (4 rows) It doesn't look to me like the mean is above 2 (unless you have many fewer toast tables than I suspect), so trying to optimize many-column cases isn't going to help. The mean is 2.4 (give or take a little depending on whether you include system tables). I have no idea where the optimization becomes worthwhile, but the assertion that most indexes probably have a single column worried me. I'm sure there are databases where that is true (especially for those who insist on adding a meaningless surrogate key column to every table), but there are many where it isn't true. I would guess that our average of 2.4 is higher than most, though. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
On Mon, Oct 24, 2011 at 2:15 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: It doesn't look to me like the mean is above 2 (unless you have many fewer toast tables than I suspect), so trying to optimize many-column cases isn't going to help. The mean is 2.4 (give or take a little depending on whether you include system tables). I have no idea where the optimization becomes worthwhile, but the assertion that most indexes probably have a single column worried me. I'm sure there are databases where that is true (especially for those who insist on adding a meaningless surrogate key column to every table), but there are many where it isn't true. I would guess that our average of 2.4 is higher than most, though. Keep in mind that the existence of index-only scans is going to encourage people to try to create covering indexes on columns that aren't indexed today. It's not clear how much of a win that will be; for certainly workloads, HOT might make it backfire spectacularly. But even though Tom's statement that most indexes are one column might be a slight exaggeration, I suspect it probably is true that the optimizations he's talking about for large numbers of columns won't produce any material benefit even for a 3 or 4 column index. Which makes me think maybe we should focus our efforts elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
Robert Haas robertmh...@gmail.com writes: But even though Tom's statement that most indexes are one column might be a slight exaggeration, I suspect it probably is true that the optimizations he's talking about for large numbers of columns won't produce any material benefit even for a 3 or 4 column index. Which makes me think maybe we should focus our efforts elsewhere. Right. If we thought the average was something like ten, it might be worth pursuing optimizations similar to slot_getallattrs. If it's around two or three, almost certainly not. Your point about people trying to create wider indexes to exploit index-only scans is an interesting one, but I think it's premature to optimize on the basis of hypotheses about what people might do in future. Not sure about your other idea of returning multiple tuples per amgettuple call. The trouble with that is that it will add complexity (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan will have to buffer those tuples, keep track of whether it's fetching forward or backward, etc etc. Plus another layer of the same in indexam.c (index_getnext etc). I'm not at all convinced that it's likely to be a net win. I wonder how trustworthy the measure of the visibilitymap_test call site as a consumer of cycles really is. I've frequently noticed that oprofile blames remarkably large fractions of the runtime on individual statements that appear to be quite trivial. I'm not sure if that represents real hardware-level effects such as cache line switching, or whether it's just measurement artifacts. Keep in mind that sampling-based measurements are always subject to sampling artifacts. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
On 10/24/11 12:35 PM, Tom Lane wrote: Your point about people trying to create wider indexes to exploit index-only scans is an interesting one, but I think it's premature to optimize on the basis of hypotheses about what people might do in future. I don't think that this is hypothetical at all. I know *I'll* be doing it, and we can expect users who are familiar with MySQL and Oracle to do it as well. No, it won't be the majority of our users, who are using ORMs and thus don't really think about indexing at all. But it will be a significant number of users who are performance-sensitive ... such as most or all of our data warehousing users. Mind you, we're pretty much talking exclusively about users whose tables don't fit in memory ... usually tables which are 10X or more the size of memory. One case which is going to be critical to test is the join table, i.e. the table which supports many-to-many joins and consists only of keys from the respective two other tables. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote: Your point about people trying to create wider indexes to exploit index-only scans is an interesting one, but I think it's premature to optimize on the basis of hypotheses about what people might do in future. Well, I don't think it's too much of a stretch to guess that people will try to use covering indexes; that's common practice on other products and a frequent and a not-uncommon heartfelt request from people with large (non-memory-resident) databases they want to migrate to PostgreSQL. Exactly to what degree they'll do that, and how well it will work, is another question. But I have little doubt that it will be tried. Not sure about your other idea of returning multiple tuples per amgettuple call. The trouble with that is that it will add complexity (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan will have to buffer those tuples, keep track of whether it's fetching forward or backward, etc etc. Plus another layer of the same in indexam.c (index_getnext etc). I'm not at all convinced that it's likely to be a net win. I definitely agree that you don't want two layers of caching, but I don't see why we'd need that. I wasn't thinking of changing index_getnext() at all, but rather adding a new API that fills a buffer (or a pair of buffers, maybe) with index tuples and heap TIDs. It should spit them out in the same order that multiple index_getnext() calls would have done and leave the scan position wherever it would have ended up after a number of index_getnext_tid() calls equal to the number of TIDs returned. Any user of the API (and it might be just nodeIndexonlyscan.c) would just need to keep track of the number of tuples returned and the number consumed to date. This actually gets into a wider architectural discussion, which is whether the fact that the whole executor (modulo bitmap scans and a few other special cases) operates on one tuple at a time is a good design... but my brain hurts just thinking about that. I wonder how trustworthy the measure of the visibilitymap_test call site as a consumer of cycles really is. I've frequently noticed that oprofile blames remarkably large fractions of the runtime on individual statements that appear to be quite trivial. I'm not sure if that represents real hardware-level effects such as cache line switching, or whether it's just measurement artifacts. Keep in mind that sampling-based measurements are always subject to sampling artifacts. I'm not sure either. I guess we could try short-circuiting visibilitymap_test and see what that does to performance (let's leave correct answers out of it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
Josh Berkus j...@agliodbs.com wrote: On 10/24/11 12:35 PM, Tom Lane wrote: Your point about people trying to create wider indexes to exploit index-only scans is an interesting one, but I think it's premature to optimize on the basis of hypotheses about what people might do in future. I don't think that this is hypothetical at all. I know *I'll* be doing it, and we can expect users who are familiar with MySQL and Oracle to do it as well. And Sybase, and MS SQL Server. And others, most likely. We've never gotten around to narrowing the indexes to which we added extra columns to overcome performance problems through covering index techniques when we were using Sybase, so they're already here. :-) One case which is going to be critical to test is the join table, i.e. the table which supports many-to-many joins and consists only of keys from the respective two other tables. Yeah, that is an important use of covering indexes for us. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] So, is COUNT(*) fast now?
Robert Haas robertmh...@gmail.com writes: On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote: I wonder how trustworthy the measure of the visibilitymap_test call site as a consumer of cycles really is. I'm not sure either. I guess we could try short-circuiting visibilitymap_test and see what that does to performance (let's leave correct answers out of it). That would conflate the cost of the call with the cost of the function. Maybe you could try manually inlining the visibility test? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Idea: Always consistent in-database cache using SSI mechanisms
Hackers, After Hekki's talk on PgConf.EU about SSI, some idea comes to my mind. Coundn't be predicate locking implementation in SSI be used for in-database cache invalidation. It could be possible to implement in-database cache which will acquire predicate locks like SSI transactions. In case of any conflich with other transactions corresponding cache invalidates. Therefore, it might be possible to get acceleration of caching without risk of inconsistent answers. Actually, I don't understand SSI in details. So, probably I mess up things. Does my idea any matter? -- With best regards, Alexander Korotkov.
Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms
Alexander Korotkov aekorot...@gmail.com wrote: Coundn't be predicate locking implementation in SSI be used for in-database cache invalidation. It would not necessarily be limited to *in-database* caches. The main thing would be to design a good API to the predicate locking portion of SSI, which I think is about 80% of the SSI code. Dan and I both have an interest in such further use, and there have been others who have talked about potential uses for the non-blocking predicate locking. I think the API would need to be based around a listen/notify model. It could be possible to implement in-database cache which will acquire predicate locks like SSI transactions. In case of any conflich with other transactions corresponding cache invalidates. Therefore, it might be possible to get acceleration of caching without risk of inconsistent answers. I had not thought of that potential use. At first glance, I think it has possibilities, but only if the above-mentioned API was formalized *and* there was some way to configure a cluster for serializable transactions only. Long-range, I have hopes for both. Actually, I don't understand SSI in details. So, probably I mess up things. Does my idea any matter? Sure! Besides having the available development time, I think the biggest obstacle is having enough plausible use cases for predicate lock access to do a good job defining the API. While we made some effort to keep the predicate locking and serializable behavior separate in the implementation, it wasn't clear where the bright line was, so there is bound to be some rearrangement needed when we figure that out. The more ideas we have in front of us on how predicate locks might be useful, the better the API design is likely to be. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI implementation question
Tom Lane t...@sss.pgh.pa.us wrote: I don't understand the SSI code well enough to tell if this is sufficient or not, so I hope you guys will take a closer look at the issue when you have time. I will definitely give it a look. Right now we have a perfect storm of time demands due to some recently-passed legislation combined with the need to wrap up some Annual Plan items. Within a few weeks I should find time for a more careful review and testing. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms
On Tue, Oct 25, 2011 at 1:46 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Alexander Korotkov aekorot...@gmail.com wrote: Coundn't be predicate locking implementation in SSI be used for in-database cache invalidation. It would not necessarily be limited to *in-database* caches. The main thing would be to design a good API to the predicate locking portion of SSI, which I think is about 80% of the SSI code. Dan and I both have an interest in such further use, and there have been others who have talked about potential uses for the non-blocking predicate locking. I think the API would need to be based around a listen/notify model. It could be possible to implement in-database cache which will acquire predicate locks like SSI transactions. In case of any conflich with other transactions corresponding cache invalidates. Therefore, it might be possible to get acceleration of caching without risk of inconsistent answers. I had not thought of that potential use. At first glance, I think it has possibilities, but only if the above-mentioned API was formalized *and* there was some way to configure a cluster for serializable transactions only. Long-range, I have hopes for both. Sure, it would be rather better to implement that through API. Actually, I don't understand SSI in details. So, probably I mess up things. Does my idea any matter? Sure! Besides having the available development time, I think the biggest obstacle is having enough plausible use cases for predicate lock access to do a good job defining the API. While we made some effort to keep the predicate locking and serializable behavior separate in the implementation, it wasn't clear where the bright line was, so there is bound to be some rearrangement needed when we figure that out. The more ideas we have in front of us on how predicate locks might be useful, the better the API design is likely to be. Thanks for feedback on my idea. I'll share ideas about more possible usage of that API if I have any. -- With best regards, Alexander Korotkov.
Re: [HACKERS] Online base backup from the hot-standby
Thanks for the review! On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 24.10.2011 15:29, Fujii Masao wrote: + listitem + para + Copy the pg_control file from the cluster directory to the global + sub-directory of the backup. For example: + programlisting + cp $PGDATA/global/pg_control /mnt/server/backupdir/global + /programlisting + /para + /listitem Why is this step required? The control file is overwritten by information from the backup_label anyway, no? Yes, when recovery starts, the control file is overwritten. But before that, we retrieve the minimum recovery point from the control file. Then it's used as the backup end location. During recovery, pg_stop_backup() cannot write an end-of-backup record. So, in standby-only backup, other way to retrieve the backup end location (instead of an end-of-backup record) is required. Ishiduka-san used the control file as that, according to your suggestion ;) http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php + listitem + para + Again connect to the database as a superuser, and execute + functionpg_stop_backup/. This terminates the backup mode, but does not + perform a switch to the next WAL segment, create a backup history file and + wait for all required WAL segments to be archived, + unlike that during normal processing. + /para + /listitem How do you ensure that all the required WAL segments have been archived, then? The patch doesn't provide any capability to ensure that, IOW assumes that's a user responsibility. If a user wants to ensure that, he/she needs to calculate the backup start and end WAL files from the result of pg_start_backup() and pg_stop_backup() respectively, and needs to wait until those files have appeared in the archive. Also if the required WAL file has not been archived yet, a user might need to execute pg_switch_xlog() in the master. If we change pg_stop_backup() so that, even during recovery, it waits until all required WAL files have been archived, we would need to WAL-log the completion of WAL archiving in the master. This enables the standby to check whether specified WAL files have been archived. We should change the patch in this way? But even if we change, you still might need to execute pg_switch_xlog() in the master additionally, and pg_stop_backup() might keep waiting infinitely if the master is not in progress. + /orderedlist + /para + + para + You cannot use the applicationpg_basebackup/ tool to take the backup + from the standby. + /para Why not? We have cascading replication now. Because no one has implemented that feature. Yeah, we have cascading replication, but without adopting the standby-only backup patch, pg_basebackup cannot execute do_pg_start_backup() and do_pg_stop_backup() during recovery. So we can think that the patch that Ishiduka-san proposed is the first step to extend pg_basebackup so that it can take backup from the standby. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Separating bgwriter and checkpointer
On Mon, Oct 24, 2011 at 11:40 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: The patch looks sane, it's mostly just moving existing code around, but there's one thing that's been bothering me about this whole idea from the get-go: If the bgwriter and checkpointer are two different processes, whenever bgwriter writes out a page it needs to send an fsync-request to the checkpointer. We avoided that when both functions were performed by the same process, but now we have to send and absorb a fsync-request message for every single write() that happens in the system, except for those done at checkpoints. Isn't that very expensive? Does it make the fsync-request queue a bottleneck on some workloads? That is a reasonable question and one I considered. I did some benchmarking earlier to see the overhead of that. Basically, its very small, much, much smaller than I thought. The benefit of allowing the bgwriter to continue working during long fsyncs easily outweighs the loss of doing more fsync-requests. Both of those overheads/problems occur at the same time so there is the overhead is always covered. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Online base backup from the hot-standby
On Tue, Oct 25, 2011 at 12:33 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: One problem with this whole FPW-tracking is that pg_lesslog makes it fail. I'm not sure what we need to do about that - maybe just add a warning to the docs. But it leaves a bit bad feeling in my mouth. Usually we try to make features work orthogonally, without dependencies to other settings. Now this feature requires that full_page_writes is turned on in the master, and also that you don't use pg_lesslog to compress the WAL segments or your base backup might be corrupt. Right, pg_lesslog users cannot use the documented procedure. They need to do more complex one; 1. Execute pg_start_backup() in the master, and save its return value. 2. Wait until the backup starting checkpoint record has been replayed in the standby. You can do this by comparing the return value of pg_start_backup() with pg_last_replay_location(). 3. Do the documented standby-only backup procedure. 4. Execute pg_stop_backup() in the master. This is complicated, but I'm not sure how we can simplify it. Anyway we can document this procedure for pg_lesslog users. We should? The procedure to take a backup from the standby seems more complicated than taking it on the master - there are more steps to follow. Extending pg_basebackup so that it can take a backup from the standby would make the procedure simple to a certain extent, I think. Though a user still needs to enable FPW in the master and must not use pg_lesslog. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers