Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-24 Thread Florian Pflug
On Oct24, 2011, at 01:27 , Simon Riggs wrote:
 FATAL:  could not access status of transaction 21110784
 which, in pg_subtrans, is the first xid on a new subtrans page. So we
 have missed zeroing a page.
 
 pg_control shows ... Latest checkpoint's oldestActiveXID:  2111
 which shows quite clearly that the pg_control file is later than it should be.

But shouldn't pg_control be largely irrelevant in a hot backup scenario? Most
(all?) of the information contained therein should be overwritten with the
contents of the checkpoint referenced by the backup label, shouldn't it?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Backup with rsync fails at pg_clog if under load

2011-10-24 Thread Florian Pflug
On Oct23, 2011, at 22:48 , Daniel Farina wrote:
 It doesn't seem meaningful for StartupCLOG (or, indeed, any of the
 hot-standby path functionality) to be called before that code is
 executed, but it is anyway right now.

I think the idea is to check that the CLOG part which recovery *won't*
overwrite is consistent (or rather, given the simplicity of the check,
at least accessible)

Heikki said the following somewhere else in this thread when I suggested
something similar to your proposal:

 There are pretty clear rules on what state clog can be in. When you launch 
 postmaster in a standby:
 
 * Any clog preceding the nextXid from the checkpoint record we start 
 recovery from, must either be valid, or the clog file must be missing 
 altogether (which can happen when it was vacuumed away while the backup in 
 progress - if the clog is still needed at the end of backup it must not be 
 missing, of course).
 * Any clog following nextXid can be garbled or missing.
 
 Recovery will overwrite any clog after nextXid from the WAL, but not the 
 clog before it.

I think Simon's theory that we're starting recovery from the wrong place,
i.e. should start with an earlier WAL location, is probably correct. The
question is, why?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Silent failure with invalid hba_file setting

2011-10-24 Thread Thom Brown
On 19 October 2011 05:20, Tom Lane t...@sss.pgh.pa.us wrote:
 Peter Eisentraut pete...@gmx.net writes:
 I wouldn't add extra special checks for that.  It might not be
 completely unreasonable to have a standby that no one can connect to,
 for example.

 Well, you couldn't monitor its state then, so I don't find that example
 very convincing.  But if you were intent on having that, you could
 easily set up a pg_hba file containing only reject entries.

I hadn't noticed you'd committed some changes around this until now. Thanks.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 10:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 as it seems to me that any
 client that is paranoid enough to care about sync rep had better
 already be handling the case of a connection loss during commit.

 Agreed, but that is a problem that by definition we can't help with.
 Also, the issue with connection loss is that you really can't know
 whether your transaction got committed without reconnecting and looking
 for evidence.  There is no reason at all to inject such uncertainty
 into the cancel-SyncRepWaitForLSN case.  We know the transaction got
 committed,

I disagree.  The whole point of synchronous replication is that the
user is worried about the case where the primary goes away just after
the commit is acknowledged to the client.  Consider the following
scenario: Someone has determined that it can't be reached from 90% of
the corporate Internet, but the synchronous standby, which is
naturally on another network, still has connectivity.  So they log
into the master and perform a fast shutdown.  When they reconnect, the
connection pooler (or other mechanism) redirects their connection to
the standby, which has sense been promoted.  ISTM that the client had
darn well better go search for hard evidence about the transaction
state.

 But I think that throwing an ERROR is likely to cause a LOT of client
 breakage, even if you have some special (human-invisible?) flag that
 indicates that you don't really mean it.  If we must do something
 other than simulating a server disconnect, letting the command
 completion message go through and annotating it with a NOTICE or
 WARNING seems preferable.

 I think you're thinking narrowly of the SyncRepWaitForLSN case.  What
 I'm trying to point out is that there's a boatload of post-commit code
 which is capable of sometimes throwing errors, and that's not ever
 going to go away completely.

 It might be that it'd work to deal with this by reducing the reported
 strength of all such cases from ERROR to WARNING.  Not sure that that's
 a good idea, but it might work.

It's hard to be sure that a systematic approach will work.  For
example, if we fail to can't nuke a memory context for some reason, it
wouldn't be utterly crazy to just ignore the problem and try to
soldier on.  We've probably leaked some memory, but oh well.  If we've
failed to release a heavyweight lock we had better call
LockReleaseAll() somehow, but the details of what gets sent to the
client are negotiable and a WARNING is probably fine.  On the other
hand, if we experienced some failure that affects our ability to make
the transaction globally visible (like we wrote the commit record but
then fail trying to acquire ProcArrayLock to clear our xmin), it's
hard to believe that anything other than PANIC is enough.

Because of that and similar cases elsewhere, including for example
inside the lock manager, I've long been feeling grumpy about this:

/* Ensure we will have room to remember the lock */
if (num_held_lwlocks = MAX_SIMUL_LWLOCKS)
elog(ERROR, too many LWLocks taken);

It seems to me that the idea that the abort path is going to be able
to recover from that situation is wildly optimistic.  Fortunately, our
coding practices are good enough that I think it never happens anyway,
but if it does it should surely PANIC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Magnus Hagander
On Mon, Oct 24, 2011 at 16:12, Jaime Casanova ja...@2ndquadrant.com wrote:
 On Mon, Oct 24, 2011 at 7:40 AM, Magnus Hagander mag...@hagander.net wrote:

 synchronous_standby_names='*' is prone to such confusion in general, but it
 seems that it's particularly surprising if a running pg_basebackup lets a
 commit in synchronous replication to proceed. Maybe we just need a warning
 in the docs. I think we should advise that synchronous_standby_names='*' is
 dangerous in general, and cite this as one reason for that.

 Hmm. i think this is common enough that we want to make sure we avoid
 it in code.

 Could we pass a parameter from the client indicating to the master
 that it refuses to be a sync slave? An optional keyword to the
 START_REPLICATION command, perhaps?


 can't you execute set synchronous_commit to off/local for this connection?

This is a walsender connection, it doesn't take SQL. Plus it's the
receiving end, and SET sync_commit is for the sending end.

that said, we are reasonably safe in current implementations, because
it always sets the flush location to invalidxlogptr, so it will not be
considered for sync slave. Should we ever start accepting write as
the point to sync against, the problem will show up, of course.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: GiST for range types (was Re: [HACKERS] Range Types - typo + NULL string constructor)

2011-10-24 Thread Alexander Korotkov
Hi!

On Mon, Oct 17, 2011 at 12:38 PM, Jeff Davis pg...@j-davis.com wrote:

  I started implementing subtype_diff, and I noticed that it requires
  defining an extra function for each range type. Previously, the numeric
  types could just use a cast, which was convenient for user-defined range
  types.
 
  If you have any other ideas to make that cleaner, please let me know.
  Otherwise I'll just finish implementing subtype_diff.

I think implementing subtype_diff for each datatype is ok.  We could
implement some universal function based on minus operator and casting to
double precision. But such solution might be unacceptable in both
*predictability
(operator and casting function might do not the things we expect) and
performance.*

I'm beginning to think that we should just allow the user to specify
 their own gist_penalty function. Specifying just the subtype_diff
 doesn't save much time, and it can only be limiting. Additionally, it's
 harder for users to understand the purpose of the function.

If we allow user to specify own gist_penalty function, then such function
should deal with:
1) GiST-specific data structures such as GISTENTRY.
2) Decomposing ranges using range_deserialize.
3) Inifinities, which we could handle in general penalty functions.
Thats why I prefere to implement subtype_diff.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] ALTER TABLE ONLY ...DROP CONSTRAINT is broken in HEAD.

2011-10-24 Thread Robert Haas
On Mon, Sep 12, 2011 at 11:53 AM, Alexey Klyukin
al...@commandprompt.com wrote:
 This works in 9.1, but not in HEAD:

 CREATE TABLE parent(id INTEGER, CONSTRAINT id_check CHECK(id1));
 CREATE TABLE child() INHERITS(parent);

 ALTER TABLE ONLY parent DROP CONSTRAINT id_check;

 I'm getting:
 ERROR:  relation 16456 has non-inherited constraint id_check
 where 16456 is the oid of the child table.

 It seems that the pg_constraint scan at ATExecDropConstraint 
 (tablecmds.c:6751) is re-reading those tuples that were updated in the 
 previous iterations of this scan, at least that's what I've observed in gdb. 
 I'm not sure how to fix this yet.

Woops, seems this got overlooked.  It's been fixed, though: see commit
c0f03aae0469e758964faac0fb741685170c39a5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-10-24 Thread Robert Haas
On Fri, Sep 23, 2011 at 4:05 PM, Dan McGee d...@archlinux.org wrote:
 On Fri, Sep 23, 2011 at 2:49 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Sep 19, 2011 at 4:36 PM, Dan McGee d...@archlinux.org wrote:
 [ patch ]

 I suppose it's Tom who really needs to comment on this, but I'm not
 too enthusiastic about this approach.  Duplicating the Linux kernel
 calculation into our code means that we could drift if the formula
 changes again.
 Why would the formula ever change? This seems like a different excuse
 way of really saying you don't appreciate the hacky approach, which I
 can understand completely. However, it simply doesn't make sense for
 them to change this formula, as it scales the -17 to 16 old range to
 the new -1000 to 1000 range. Those endpoints won't be changing unless
 a third method is introduced, in which case this whole thing is mute
 and we'd need to fix it yet again.

 I like Tom's previous suggestion (I think) of allowing both constants
 to be defined - if they are, then we try oom_score_adj first and fall
 back to oom_adj if that fails.  For bonus points, we could have
 postmaster stat() its own oom_score_adj file before forking and set a
 global variable to indicate the results.  That way we'd only ever need
 to test once per postmaster startup (at least until someone figures
 out a way to swap out the running kernel without stopping the
 server...!).
 This would be fine, it just seems unreasonably complicated, not to
 mention unnecessary. I might as well point this out [1]. I don't think
 a soul out there has built without defining this to 0 (if they define
 it at all), and not even that many people are using it. Is it all that
 bad of an idea to just force it to 0 for both knobs and be done with
 it?

Did we do anything about this?  Anyone else have an opinion on what
ought to be done?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Heikki Linnakangas

On 24.10.2011 15:29, Fujii Masao wrote:

+listitem
+ para
+  Copy the pg_control file from the cluster directory to the global
+  sub-directory of the backup. For example:
+ programlisting
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ /programlisting
+ /para
+/listitem


Why is this step required? The control file is overwritten by 
information from the backup_label anyway, no?



+listitem
+ para
+  Again connect to the database as a superuser, and execute
+  functionpg_stop_backup/. This terminates the backup mode, but does 
not
+  perform a switch to the next WAL segment, create a backup history file 
and
+  wait for all required WAL segments to be archived,
+  unlike that during normal processing.
+ /para
+/listitem


How do you ensure that all the required WAL segments have been archived, 
then?



+   /orderedlist
+/para
+
+para
+ You cannot use the applicationpg_basebackup/ tool to take the backup
+ from the standby.
+/para


Why not? We have cascading replication now.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Robert Haas
On Sun, Oct 23, 2011 at 6:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Well, there is a general problem that anything which throws an ERROR
 too late in the commit path is Evil; and sync rep makes that worse to
 the extent that it adds more stuff late in the commit path, but it
 didn't invent the problem.

 BTW, it strikes me that if we want to do something about that, it ought
 to be possible; but it has to be built into error handling, not a
 localized hack for sync rep.

 Consider a design along these lines: we invent a global flag that gets
 set at some appropriate point in RecordTransactionCommit (probably right
 where we exit the commit critical section) and is not cleared until we
 send a suitable message to the client --- I think either
 command-complete or an error message would qualify, but that would have
 to be analyzed more carefully than I've done so far.  If elog.c is told
 to send an error message while this flag is set, then it does something
 special to inform the client that this was a post-commit error and the
 xact is in fact committed.

 My inclination for the something special would be to add a new error
 message field, but that could be difficult for clients to examine
 depending on what sort of driver infrastructure they're dealing with.
 You could also imagine emitting a separate NOTICE or WARNING message,
 which is analogous to the current hack in SyncRepWaitForLSN, but seems
 pretty ugly because it requires clients to re-associate that event with
 the later error message.  (But it might be worth doing anyway for human
 users, even if we provide a different flag mechanism that is intended
 for program consumption.)  Or maybe we could override the SQLSTATE with
 some special value.  Or something else.

 Given infrastructure like this, it would be reasonable for
 SyncRepWaitForLSN to just throw an ERROR if it gets an interrupt,
 instead of trying to kluge its own solution.

I actually think that emitting a NOTICE or WARNING and then slamming
the connection shut is quite elegant, as it seems to me that any
client that is paranoid enough to care about sync rep had better
already be handling the case of a connection loss during commit.  I
realize that handling query cancellation in a somewhat different way
is a wart, though, and I'm not necessarily opposed to changing the
behavior.

But I think that throwing an ERROR is likely to cause a LOT of client
breakage, even if you have some special (human-invisible?) flag that
indicates that you don't really mean it.  If we must do something
other than simulating a server disconnect, letting the command
completion message go through and annotating it with a NOTICE or
WARNING seems preferable.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Heikki Linnakangas

On 24.10.2011 15:29, Fujii Masao wrote:

In your patch, FPW is always WAL-logged at startup even when FPW has
not been changed since last shutdown. I don't think that's required.
I changed the recovery code so that it keeps track of last FPW indicated
by WAL record. Then, at end of startup, if that FPW is equal to FPW
specified in postgresql.conf (which means that FPW has not been changed
since last shutdown or crash), WAL-logging of FPW is skipped. This change
prevents unnecessary WAL-logging. Thought?


One problem with this whole FPW-tracking is that pg_lesslog makes it 
fail. I'm not sure what we need to do about that - maybe just add a 
warning to the docs. But it leaves a bit bad feeling in my mouth. 
Usually we try to make features work orthogonally, without dependencies 
to other settings. Now this feature requires that full_page_writes is 
turned on in the master, and also that you don't use pg_lesslog to 
compress the WAL segments or your base backup might be corrupt. The 
procedure to take a backup from the standby seems more complicated than 
taking it on the master - there are more steps to follow.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 4:57 AM, Omar Bettin
o.bet...@informaticaindustriale.it wrote:
 I have tried 9.1.1 win64 version and when I am trying to declare a cursor
 for a very large view (lot of joins and aggregate functions),

 postgres is using around 3GB of memory and the query never returns.

Hmm.  A 59-table join is pretty enormous.

I wish we had a better way to handle these kinds of queries.  Odds are
good that the join order doesn't matter much, and in an ideal world we
would be able to notice that and just use some simple heuristic to
pick a tolerably good one.  As it is, I am a bit surprised to hear
that GEQO isn't bailing you out.

Can you EXPLAIN a query against that view, or does even that wipe out?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


R: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Omar Bettin
...sorry guys...

was a misconfiguration of database.

9.1.1 is working good.

is ~4% faster than 9.0.5 for same query.

Thanks a lot.

Regards




-Messaggio originale-
Da: Pavel Stehule [mailto:pavel.steh...@gmail.com] 
Inviato: lunedì 24 ottobre 2011 12:13
A: Omar Bettin
Cc: pgsql-hackers@postgresql.org
Oggetto: Re: [HACKERS] [9.1] unusable for large views

Hello

please, send a result of explain analyze on 9.1.1 and older

please, use http://explain.depesz.com/

Regards

Pavel Stehule

2011/10/24 Omar Bettin o.bet...@informaticaindustriale.it:
 Hello,



 I have tried 9.1.1 win64 version and when I am trying to declare a cursor
 for a very large view (lot of joins and aggregate functions),

 postgres is using around 3GB of memory and the query never returns.



 Same proble selecting from the view without cursor.



 Same query worked fine from 8.3.3 to 9.0.5.



 Should I change some configuration params to have the same behavior as
 previous versions?







 Tried on Win2008 server R2  64bit 8GB RAM.

 also on Win7 64bit 8GB RAM.

 default postgresql.conf



 Regards,





 The view (!)



 CREATE OR REPLACE VIEW articoli_rip_ven20110227_view AS

  SELECT a.pk_id, a.articolo, a.articolo_f, a.origine, ( SELECT
 ditte.attivita

FROM ditte

   WHERE ditte.partitario = a.cod_fornit) AS dummy_1, a.descrizion,
 a.taglia, a.peso_spec, a.giacenza_colli, a.giacenza, rim.colli AS
 rimanenza_colli, rim.quantita AS rimanenza_qta, rimass.colli AS
 rimanenza_ass_colli, rimass.quantita AS rimanenza_ass_qta,
 COALESCE(b20.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b21.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b22.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b23.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b24.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b25.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b26.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b27.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b28.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b29.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b30.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b31.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b32.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b33.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b34.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b35.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b36.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b37.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b38.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b39.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b40.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b41.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b42.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b43.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b44.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b45.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b46.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b47.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b48.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b49.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b50.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b51.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b52.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b53.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b54.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b55.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b56.quantita, NULL::numeric, 0::numeric)::double precision +
 (COALESCE(rim.quantita, NULL::numeric, 0::numeric) +
 COALESCE(rimass.quantita, NULL::numeric, 0::numeric))::double precision -
 (COALESCE(b1.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b2.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b3.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b4.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b5.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b6.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b7.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b8.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b9.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b10.quantita, NULL::numeric, 0::numeric)::double 

Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 10:25 AM, Euler Taveira de Oliveira
eu...@timbira.com wrote:
 On 24-10-2011 10:57, Robert Haas wrote:

 I think the main reason why vacuumlo is a contrib module rather than
 in core is that it is just a heuristic, and it might not be what
 everyone wants to do.  You could store a bunch of large objects in the
 database and use the returned OIDs to generate links that you email to
 users, and then when the user clicks on the link we retrieve the
 corresponding LO and send it to the user over HTTP.  In that design,
 there are no tables in the database at all, yet the large objects
 aren't orphaned.

 Uau, what a strange method to solve a problem and possibly bloat your
 database. No, I'm not suggesting that we forbid it. The proposed method
 could cleanup orphaned LO in 95% (if not 99%) of the use cases.

 I've never heard someone using LO like you describe it. It seems strange
 that someone distributes an OID number but (s)he does not store its
 reference at the same database. Yes, it is a possibility but ...

I guess we could make it an optional behavior, but once you go that
far then you have to wonder whether what's really needed here is a
general-purpose task scheduler.  I mean, the autovacuum launcher's
idea about how often to vacuum the database won't necessarily match
the user's idea of how often they want to vacuum away large objects -
and if the user is doing something funky (like storing arrays of large
object OIDs, or inexplicably storing them using numeric or int8) then
putting it in the backend removes a considerable amount of
flexibility.  Another case where vacuumlo will fall over is if you
have a very, very large table with an OID column, but with lots of
duplicate values so that the number of OIDs actually referenced is
much smaller.  You might end up doing a table scan on the large table
every time this logic kicks in, and that might suck.

I'm sort of unexcited about the idea of doing a lot of engineering
around this; it seems to me that the only reasons we still have a
separate large object facility rather than just letting everyone go
through regular tables with toastable columns are (1) the size limit
is 2GB rather than 1GB and (2) you can read and write parts of objects
rather than the whole thing.  If we're going to do some more
engineering here, I'd rather set our sights a little higher.
Complaints I often hear about the large object machinery include (1)
2GB is still not enough, (2) 4 billion large objects is not enough,
(3) the performance is inadequate, particularly with large numbers of
large objects from possibly-unrelated subsystems slammed into a single
table, and (4) it would be nice to be able to partial reads and writes
on any toastable field, not just large objects.  I'm not saying that
the problem you're complaining about isn't worth fixing in the
abstract, and if it seemed like a nice, clean fix I'd be all in favor,
but I just don't think it's going to be very simple, and for the
amount of work involved I'd rather get a little bit more bang for the
buck.

Of course, you don't have to agree with me on any of this; I'm just
giving you my take on it.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


R: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Omar Bettin
Hi Tom,

...are about two hours I am trying to communicate that the problem has been
solved, but I do not see the messages in the mailing list...

Anyway,
the problems was a bad installation of database (pgsql functions).

9.1.1 is working good.

is 4% to 8% faster than 9.0.5.

Thanks a lot to everyone.

Regards,

Omar


-Messaggio originale-
Da: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Inviato: lunedì 24 ottobre 2011 16:46
A: Omar Bettin
Cc: pgsql-hackers@postgresql.org
Oggetto: Re: [HACKERS] [9.1] unusable for large views 

Omar Bettin o.bet...@informaticaindustriale.it writes:
 I have tried 9.1.1 win64 version and when I am trying to declare a cursor
 for a very large view (lot of joins and aggregate functions),

 postgres is using around 3GB of memory and the query never returns.

Could we see a self-contained test case?  I'm not about to try to
reverse-engineer the schema that goes with such a monster query.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 11:33 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 24.10.2011 15:29, Fujii Masao wrote:

 In your patch, FPW is always WAL-logged at startup even when FPW has
 not been changed since last shutdown. I don't think that's required.
 I changed the recovery code so that it keeps track of last FPW indicated
 by WAL record. Then, at end of startup, if that FPW is equal to FPW
 specified in postgresql.conf (which means that FPW has not been changed
 since last shutdown or crash), WAL-logging of FPW is skipped. This change
 prevents unnecessary WAL-logging. Thought?

 One problem with this whole FPW-tracking is that pg_lesslog makes it fail.
 I'm not sure what we need to do about that - maybe just add a warning to the
 docs. But it leaves a bit bad feeling in my mouth. Usually we try to make
 features work orthogonally, without dependencies to other settings. Now this
 feature requires that full_page_writes is turned on in the master, and also
 that you don't use pg_lesslog to compress the WAL segments or your base
 backup might be corrupt. The procedure to take a backup from the standby
 seems more complicated than taking it on the master - there are more steps
 to follow.

Doing it on the master isn't as easy as I'd like it to be, either.

But it's not really clear how to make it simpler.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Tom Lane
Omar Bettin o.bet...@informaticaindustriale.it writes:
 I have tried 9.1.1 win64 version and when I am trying to declare a cursor
 for a very large view (lot of joins and aggregate functions),

 postgres is using around 3GB of memory and the query never returns.

Could we see a self-contained test case?  I'm not about to try to
reverse-engineer the schema that goes with such a monster query.

http://wiki.postgresql.org/wiki/Guide_to_reporting_problems

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Tom Lane
Euler Taveira de Oliveira eu...@timbira.com writes:
 The main point of autovacuum is maintenance tasks. Currently, it executes 
 VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo 
 functionality into it.

I'm not terribly thrilled with that because (a) large objects seem like
mostly a legacy feature from here, and (b) it's hard to see how to
implement it without imposing overhead on everybody, whether they use
LOs or not.  This is especially problematic if you're proposing that
cleanup triggers not be required.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Sun, Oct 23, 2011 at 6:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, it strikes me that if we want to do something about that, it ought
 to be possible; but it has to be built into error handling, not a
 localized hack for sync rep.

 I actually think that emitting a NOTICE or WARNING and then slamming
 the connection shut is quite elegant,

No, it's a horrid crock, whose only saving grace is that it was
implementable with two or three lines localized to SyncRepWaitForLSN
... or at least, we thought that until Fujii-san started pointing out
the bugs in it.  It's not convenient for clients at all, it does not fit
well into the backend structure (which is the reason for the bugs), and
it forces session termination unnecessarily, or at least it would if
we'd been consistent and applied the method to query-cancel as well.

 as it seems to me that any
 client that is paranoid enough to care about sync rep had better
 already be handling the case of a connection loss during commit.

Agreed, but that is a problem that by definition we can't help with.
Also, the issue with connection loss is that you really can't know
whether your transaction got committed without reconnecting and looking
for evidence.  There is no reason at all to inject such uncertainty
into the cancel-SyncRepWaitForLSN case.  We know the transaction got
committed, and there's no reason to make the client guess about that,
nor to make it parse WARNING messages for which we didn't even get the
assignment of a unique SQLSTATE right (thus making the problem
insoluble anyhow).

 But I think that throwing an ERROR is likely to cause a LOT of client
 breakage, even if you have some special (human-invisible?) flag that
 indicates that you don't really mean it.  If we must do something
 other than simulating a server disconnect, letting the command
 completion message go through and annotating it with a NOTICE or
 WARNING seems preferable.

I think you're thinking narrowly of the SyncRepWaitForLSN case.  What
I'm trying to point out is that there's a boatload of post-commit code
which is capable of sometimes throwing errors, and that's not ever
going to go away completely.

It might be that it'd work to deal with this by reducing the reported
strength of all such cases from ERROR to WARNING.  Not sure that that's
a good idea, but it might work.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Magnus Hagander
On Mon, Oct 24, 2011 at 13:46, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 +               /*
 +                * Looks like an xlog file. Parse it's position.

 s/it's/its/

 +                */
 +               if (sscanf(dirent-d_name, %08X%08X%08X, tli, log,
 seg) != 3)
 +               {
 +                       fprintf(stderr, _(%s: could not parse xlog
 filename \%s\\n),
 +                                       progname, dirent-d_name);
 +                       disconnect_and_exit(1);
 +               }
 +               log *= XLOG_SEG_SIZE;

 That multiplication by XLOG_SEG_SIZE could overflow, if logid is very high.
 It seems completely unnecessary, anyway,

How do you mean completely unnecessary? We'd have to change the points
that use it to divide by XLOG_SEG_SIZE otherwise, no? That might be a
way to get around the overflow, but I'm not sure that's what you mean?


 s/IDENFITY_SYSTEM/IDENTIFY_SYSTEM/ (two occurrences)

Oops.


 In pg_basebackup, it would be a good sanity check to check that the systemid
 returned by IDENTIFY_SYSTEM in the main connection and the WAL-streaming
 connection match. Just to be sure that some connection pooler didn't hijack
 one of the connections and point to a different server. And better check
 timelineid too while you're at it.

That's a good idea. Will fix.


 How does this interact with synchronous replication? If a base backup that
 streams WAL is in progress, and you have synchronous_standby_names set to
 '*', I believe the in-progress backup will count as a standby for that
 purpose. That might give a false sense of security.

Ah yes. Did not think of that. Yes, it will have this problem.


 synchronous_standby_names='*' is prone to such confusion in general, but it
 seems that it's particularly surprising if a running pg_basebackup lets a
 commit in synchronous replication to proceed. Maybe we just need a warning
 in the docs. I think we should advise that synchronous_standby_names='*' is
 dangerous in general, and cite this as one reason for that.

Hmm. i think this is common enough that we want to make sure we avoid
it in code.

Could we pass a parameter from the client indicating to the master
that it refuses to be a sync slave? An optional keyword to the
START_REPLICATION command, perhaps?


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas
On Sun, Oct 23, 2011 at 7:01 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas robertmh...@gmail.com wrote:

 Also, this line is kind of expensive:

        if (!visibilitymap_test(scandesc-heapRelation,
                                ItemPointerGetBlockNumber(tid),
                                node-ioss_VMBuffer))

 Around 2%.  But I don't see any way to avoid that, or even make it cheaper.

 Could we cache by ItemPointerGetBlockNumber(tid) the results of those
 tests, for groups of tids on the same index page?

 How useful this would be would depend on how well-clustered the table
 and index are.

I thought about that, but the existing code is so ridiculously cheap
that it's hard to believe a caching layer would save enough to pay for
itself.  I mean, I presume that the cost attributed to that line has
to be associated with either (a) one of the pointer deferences, (b)
the expense of evaluating ItemPointerGetBlockNumber(), (c) setting up
the function call, or perhaps (d) overhead incurred as a result of
branch mis-prediction.  The actual time spent *inside*
visibilitymap_test will be attributed to that function, not this one.

If you add up the time for this line and visibilitymap_test(), it's
like 10% of the runtime, which seems pretty significant.  But it's 10%
of the runtime that is spent basically a handful of arithmetic
operations and then reading a byte from shared memory.  It's
astonishing to find that so expensive on a test with just one backend
running.  If you stick some kind of cache in there, it's going to
involve adding a branch that isn't there now, and I think that's going
to be pricey given how hot this code apparently is.

Also, I'm not sure it's safe.  Suppose that we lock the index page,
return a tuple, check the visibility map, and find the page all
visible.  Another transaction comes along, adds a tuple to the index
page, and clears the visibility map bit.  We then go back, relock the
index page, and return another tuple.  We'd better notice that the
visibility map bit has now been cleared, or we're in trouble.

I wonder if it might help to create some method for the index to
return all the matching keys on a single index page in one call.   If
you're dealing with an MVCC snapshot, any new tuples added after the
start of the scan can't be visible anyway.  That would save the
overhead of unlocking and relocking the buffer once per tuple, and
probably some overhead associated with validating and decoding the
possibly-changed page contents each time.  If you did it that way, it
would also be safe to do what you're proposing - if a bunch of the
tuples on the index page are also on the same heap page, you could do
one visibility map check for all of them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Jaime Casanova
On Mon, Oct 24, 2011 at 7:40 AM, Magnus Hagander mag...@hagander.net wrote:

 synchronous_standby_names='*' is prone to such confusion in general, but it
 seems that it's particularly surprising if a running pg_basebackup lets a
 commit in synchronous replication to proceed. Maybe we just need a warning
 in the docs. I think we should advise that synchronous_standby_names='*' is
 dangerous in general, and cite this as one reason for that.

 Hmm. i think this is common enough that we want to make sure we avoid
 it in code.

 Could we pass a parameter from the client indicating to the master
 that it refuses to be a sync slave? An optional keyword to the
 START_REPLICATION command, perhaps?


can't you execute set synchronous_commit to off/local for this connection?

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Updated version of pg_receivexlog

2011-10-24 Thread Heikki Linnakangas

+   /*
+* Looks like an xlog file. Parse it's position.


s/it's/its/


+*/
+   if (sscanf(dirent-d_name, %08X%08X%08X, tli, log, seg) != 
3)
+   {
+   fprintf(stderr, _(%s: could not parse xlog filename 
\%s\\n),
+   progname, dirent-d_name);
+   disconnect_and_exit(1);
+   }
+   log *= XLOG_SEG_SIZE;


That multiplication by XLOG_SEG_SIZE could overflow, if logid is very 
high. It seems completely unnecessary, anyway,


s/IDENFITY_SYSTEM/IDENTIFY_SYSTEM/ (two occurrences)

In pg_basebackup, it would be a good sanity check to check that the 
systemid returned by IDENTIFY_SYSTEM in the main connection and the 
WAL-streaming connection match. Just to be sure that some connection 
pooler didn't hijack one of the connections and point to a different 
server. And better check timelineid too while you're at it.


How does this interact with synchronous replication? If a base backup 
that streams WAL is in progress, and you have synchronous_standby_names 
set to '*', I believe the in-progress backup will count as a standby for 
that purpose. That might give a false sense of security. 
synchronous_standby_names='*' is prone to such confusion in general, but 
it seems that it's particularly surprising if a running pg_basebackup 
lets a commit in synchronous replication to proceed. Maybe we just need 
a warning in the docs. I think we should advise that 
synchronous_standby_names='*' is dangerous in general, and cite this as 
one reason for that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] EXECUTE tab completion

2011-10-24 Thread Magnus Hagander
On Mon, Oct 24, 2011 at 01:26, Tom Lane t...@sss.pgh.pa.us wrote:
 Andreas Karlsson andr...@proxel.se writes:
 Thanks for cleaning up the code to some sanity, I should have done so
 myself when I noticed the problem.

 A new version is attached.

 Committed with minor adjustments --- I didn't see any need to make this
 wait for the next commitfest.

Thanks - I was planning to pick that one up along with my TABLE patch,
but was too busy with pgconf.eu over the past couple of weeks..

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] loss of transactions in streaming replication

2011-10-24 Thread Fujii Masao
On Fri, Oct 21, 2011 at 12:01 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 9:51 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Oct 20, 2011 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote:
 OK, so this is an artifact of the changes to make libpq communication
 bidirectional.  But I'm still confused about where the error is coming
 from.  In your OP, you wrote In 9.2dev and 9.1, when walreceiver
 detects an error while sending data to WAL stream, it always emits
 ERROR even if there are data available in the receive buffer.  So
 that implied to me that this is only going to trigger if you have a
 shutdown together with an awkwardly-timed error.  But your scenario
 for reproducing this problem doesn't seem to involve an error.

 Yes, my scenario doesn't cause any real error. My original description was
 misleading. The following would be closer to the truth:

    In 9.2dev and 9.1, when walreceiver detects the termination of 
 replication
    connection while sending data to WAL stream, it always emits ERROR
    even if there are data available in the receive buffer.

 Ah, OK.  I think I now agree that this is a bug and that we should fix
 and back-patch.

The patch that I posted before is well-formed enough to be adopted?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 12:56 AM, Euler Taveira de Oliveira
eu...@timbira.com wrote:
 The main point of autovacuum is maintenance tasks. Currently, it executes
 VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo
 functionality into it. While dealing with large objects (LO), we have lo
 contrib module that helps with LO maintenance but has some limitations (does
 not handle DROP TABLE and TRUNCATE cases) and vacuumlo that does an
 excellent job but have to be executed outside DBMS. The proposal is to clean
 up LO when autovacuum triggers VACUUM; cleanup LO routine will starts
 after(?) VACUUM command.

 In a near future I want to propose that orphaned LO be cleaned up by VACUUM
 but that a history for another thread...

 Comments?

I think the main reason why vacuumlo is a contrib module rather than
in core is that it is just a heuristic, and it might not be what
everyone wants to do.  You could store a bunch of large objects in the
database and use the returned OIDs to generate links that you email to
users, and then when the user clicks on the link we retrieve the
corresponding LO and send it to the user over HTTP.  In that design,
there are no tables in the database at all, yet the large objects
aren't orphaned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Jan Urbański
On 24/10/11 10:57, Omar Bettin wrote:
 
 [monster query]

I see that your problem is already solved, but incidentially I'm working
on a join order planning module and I'm looking for real-life examples
of humongous queries like that to benchmark against them.

Any chance you could share the schema, or at least part of it, that goes
with this query? Or perhaps you have more of these queries?

Cheers,
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Gurjeet Singh
Hi All,

I have not been able to reproduce this error, but wanted to report this
in case it might be useful.

Commit -id: 0f39d5050dc0dce99258381f33f1832c437aff85

Configure options:
--prefix=/mnt/storage/gurjeet/dev/builds//master/db --enable-debug
--enable-cassert CFLAGS=-O0 --enable-depend --enable-thread-safety
--with-openssl

Used VPATH to build postgres.

Attached files: terminal1.txt termonal2.txt config.log

Starting from line 89 of terminal1.txt we see this snippet:

postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02C8-1';
ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
postgres=# rollback;
ROLLBACK


As you can see the SET TRANSACTION SNAPSHOT was the first statement in
that transaction, and yet the ERROR message says that it is not. That
snapshot id was generated in another session (terminal2.txt), and was
generated outside any transaction; that's the only peculiar thing I can say
about that snapshot-id.

As is evident from the rest of the lines in terminal1.txt, I have not
been able to reproduce this error again; I tried even with a clean build of
the sources.

I have taken a look at the code and everything tells me that the error
should have been:

ERROR:  a snapshot-importing transaction must have isolation level
SERIALIZABLE or REPEATABLE READ

Maybe it was not a clean build the first time, and the code may have
wrongly linked with previously compiled .o files.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
psql (9.2devel)
Type help for help.

postgres=# begin transaction;
BEGIN
postgres=# set iso

postgres=# set trans
transaction transaction_deferrable  transaction_isolation   
transaction_read_only   transform_null_equals   
postgres=# set transaction isolation level read only;
ERROR:  syntax error at or near only
LINE 1: set transaction isolation level read only;
 ^
postgres=# set transaction isolation level readonly;
ERROR:  syntax error at or near readonly
LINE 1: set transaction isolation level readonly;
^
postgres=# set transaction 
ISOLATION LEVEL  READ 
postgres=# set transaction ISOLATION LEVEL 
READ  REPEATABLESERIALIZABLE  
postgres=# set transaction ISOLATION LEVEL READ 
COMMITTEDUNCOMMITTED  
postgres=# set transaction ISOLATION LEVEL READ committed;
ERROR:  current transaction is aborted, commands ignored until end of 
transaction block
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL READ committed;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# set transaction ISOLATION LEVEL READ committed;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL serializable;;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# set transaction ISOLATION LEVEL serializable;;
ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
postgres=# rollback;
ROLLBACK
(reverse-i-search)`be': select count(*) from clustermem^Crs ;
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# rollback;
ROLLBACK
postgres=# begin transaction;
BEGIN
postgres=# select 1;
 ?column? 
--
1
(1 row)

postgres=# set transaction ISOLATION LEVEL repeatable read;
ERROR:  SET TRANSACTION ISOLATION LEVEL must be called before any query
postgres=# set transaction ISOLATION LEVEL ;
READ  REPEATABLESERIALIZABLE  
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02C8-1';
ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02C9-1';
ERROR:  a snapshot-importing transaction must have isolation level SERIALIZABLE 
or REPEATABLE READ
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# set transaction snapshot '02C9-1';
SET
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction snapshot '02CA-1';
ERROR:  a snapshot-importing transaction must have isolation level SERIALIZABLE 
or REPEATABLE READ
postgres=# rollack;
ERROR:  syntax error at or near rollack
LINE 1: rollack;
^
postgres=# rollback;
ROLLBACK
postgres=# begin TRANSACTION ;
BEGIN
postgres=# set transaction ISOLATION LEVEL repeatable read;
SET
postgres=# 

R: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Omar Bettin
...sorry guys...

was a bad configuration of database.

9.1.1 is working good.

is 4% to 8% faster than 9.0.5.

Thanks a lot.

Regards

Omar

P.s.
attached EXPLAIN


Hmm.  A 59-table join is pretty enormous

and is not the biggest, basically are delivery notes for one day seen in
vertical.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TABLE tab completion

2011-10-24 Thread Magnus Hagander
On Mon, Sep 26, 2011 at 10:37, Magnus Hagander mag...@hagander.net wrote:
 On Sun, Sep 25, 2011 at 15:06, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 24 September 2011 11:59, Magnus Hagander mag...@hagander.net wrote:
 TABLE tab completion in psql only completes to tables, not views. but
 the TABLE command works fine for both tables and views (and also
 sequences).

 Seems we should just complete it to relations and not tables - or can
 anyone see a particular reason why we shouldn't?


 Doesn't that mean that DROP TABLE tab would offer up views as well
 as tables, which would be incorrect?

 Meh - you are correct, of course. I guess that's why we have code review :-)

 So - not a oneliner, but how about something like this?

 (Happy to have someone point out a neater way of doing it, not
 entirely fluent in how we do the tab completion..)

Rebased on top of the changes Tom made to the infrastructure, and applied.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.1] unusable for large views (SOLVED)

2011-10-24 Thread Pavel Stehule
2011/10/24 Omar Bettin o.bet...@informaticaindustriale.it:
 ...sorry guys...

 was a bad configuration of database.

 9.1.1 is working good.

 is 4% to 8% faster than 9.0.5.

 Thanks a lot.

 Regards

 Omar

 P.s.
 attached EXPLAIN


attachment is missing

Pavel



Hmm.  A 59-table join is pretty enormous

 and is not the biggest, basically are delivery notes for one day seen in
 vertical.


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.1] unusable for large views

2011-10-24 Thread Pavel Stehule
Hello

please, send a result of explain analyze on 9.1.1 and older

please, use http://explain.depesz.com/

Regards

Pavel Stehule

2011/10/24 Omar Bettin o.bet...@informaticaindustriale.it:
 Hello,



 I have tried 9.1.1 win64 version and when I am trying to declare a cursor
 for a very large view (lot of joins and aggregate functions),

 postgres is using around 3GB of memory and the query never returns.



 Same proble selecting from the view without cursor.



 Same query worked fine from 8.3.3 to 9.0.5.



 Should I change some configuration params to have the same behavior as
 previous versions?







 Tried on Win2008 server R2  64bit 8GB RAM.

 also on Win7 64bit 8GB RAM.

 default postgresql.conf



 Regards,





 The view (!)



 CREATE OR REPLACE VIEW articoli_rip_ven20110227_view AS

  SELECT a.pk_id, a.articolo, a.articolo_f, a.origine, ( SELECT
 ditte.attivita

    FROM ditte

   WHERE ditte.partitario = a.cod_fornit) AS dummy_1, a.descrizion,
 a.taglia, a.peso_spec, a.giacenza_colli, a.giacenza, rim.colli AS
 rimanenza_colli, rim.quantita AS rimanenza_qta, rimass.colli AS
 rimanenza_ass_colli, rimass.quantita AS rimanenza_ass_qta,
 COALESCE(b20.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b21.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b22.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b23.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b24.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b25.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b26.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b27.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b28.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b29.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b30.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b31.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b32.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b33.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b34.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b35.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b36.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b37.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b38.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b39.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b40.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b41.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b42.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b43.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b44.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b45.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b46.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b47.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b48.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b49.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b50.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b51.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b52.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b53.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b54.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b55.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b56.quantita, NULL::numeric, 0::numeric)::double precision +
 (COALESCE(rim.quantita, NULL::numeric, 0::numeric) +
 COALESCE(rimass.quantita, NULL::numeric, 0::numeric))::double precision -
 (COALESCE(b1.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b2.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b3.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b4.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b5.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b6.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b7.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b8.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b9.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b10.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b11.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b12.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b13.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b14.quantita, NULL::numeric, 0::numeric)::double precision +
 COALESCE(b15.quantita, NULL::numeric, 0::numeric)::double precision +
 

Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Tom Lane
Gurjeet Singh singh.gurj...@gmail.com writes:
 Starting from line 89 of terminal1.txt we see this snippet:

 postgres=# rollback;
 ROLLBACK
 postgres=# begin TRANSACTION ;
 BEGIN
 postgres=# set transaction snapshot '02C8-1';
 ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
 postgres=# rollback;
 ROLLBACK

 As is evident from the rest of the lines in terminal1.txt, I have not
 been able to reproduce this error again; I tried even with a clean build of
 the sources.

Given the weird capitalization of the BEGIN command, I'm guessing that
you used tab-completion to enter it.  I wonder if this could have been
affected by the execution of some query in support of tab completion?
I couldn't reproduce it either on the basis of that guess, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types - typo + NULL string constructor

2011-10-24 Thread Heikki Linnakangas

On 17.10.2011 01:09, Jeff Davis wrote:

On Sat, 2011-10-15 at 01:46 +0300, Heikki Linnakangas wrote:

* The binary i/o format includes the length of the lower and upper
bounds twice, once explicitly in range_send, and second time within the
send-function of the subtype. Seems wasteful.


Any ideas how to fix that? How else do I know how much the underlying
send function will consume?


Oh, never mind. I was misreading the code, it's not sending the length 
twice.



* range_constructor_internal - I think it would be better to move logic
to figure out the the arguments into the callers.


Done.


The comment above range_constructor0() is now outdated.


* The gist support functions frequently call range_deserialize(), which
does catalog lookups. Isn't that horrendously expensive?


Yes, it was. I have introduced a cached structure that avoids syscache
lookups when it's the same range as the last lookup (the common case).


Hmm, I don't think that's safe. After Oid wraparound, a range type oid 
might get reused for some other range type, and the cache would return 
stale values. Extremely unlikely to happen by accident, but could be 
exploited by an attacker.



* What exactly is canonical function supposed to return? It's not clear
what format one should choose as the canonical format when writing a
custom range type. And even for the built-in types, it would be good to
explain which types use which canonical format (I saw the discussions on
that, so I understand that might still be subject to change).


The canonical function is just supposed to return a new range such that
two equal values will have equal representations. I have listed the
built-in discrete range types and their canonical form.

As far as describing what a custom range type is supposed to use for the
canonical form, I don't know which part is exactly unclear. There aren't
too many rules to defining one -- the only guideline is that ranges of
equal value going in should be put in one canonical representation.


Ok. The name canonical certainly hints at that, but it would be good 
to explicitly state that guideline. As the text stands, it would seem 
that a canonical function that maps [1,7] to [1,8), and also vice 
versa, [1,8) to [1,7], would be valid. That would be pretty silly, 
but it would be good to say something like The canonical output for two 
values that are equal, like [1,7] and [1,8), must be equal. It doesn't 
matter which representation you choose to be the canonical one, as long 
as two equal values with different formattings are always mapped to the 
same value with same formatting


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Alvaro Herrera

Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011:

 Starting from line 89 of terminal1.txt we see this snippet:
 
 postgres=# rollback;
 ROLLBACK
 postgres=# begin TRANSACTION ;
 BEGIN
 postgres=# set transaction snapshot '02C8-1';
 ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
 postgres=# rollback;
 ROLLBACK
 
 
 As you can see the SET TRANSACTION SNAPSHOT was the first statement in
 that transaction, and yet the ERROR message says that it is not.

Maybe the tab-completion feature issued a query before the set
transaction command.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Euler Taveira de Oliveira

On 24-10-2011 10:57, Robert Haas wrote:

I think the main reason why vacuumlo is a contrib module rather than
in core is that it is just a heuristic, and it might not be what
everyone wants to do.  You could store a bunch of large objects in the
database and use the returned OIDs to generate links that you email to
users, and then when the user clicks on the link we retrieve the
corresponding LO and send it to the user over HTTP.  In that design,
there are no tables in the database at all, yet the large objects
aren't orphaned.

Uau, what a strange method to solve a problem and possibly bloat your 
database. No, I'm not suggesting that we forbid it. The proposed method could 
cleanup orphaned LO in 95% (if not 99%) of the use cases.


I've never heard someone using LO like you describe it. It seems strange that 
someone distributes an OID number but (s)he does not store its reference at 
the same database. Yes, it is a possibility but ...



--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Monster query

2011-10-24 Thread Omar Bettin
Hi,

Since the data are of third parties, will prepare a database suitable for the 
purpose.
In any case, the compressed backup will be around 20 MB.

Regards,

Omar Bettin


-Messaggio originale-
Da: Jan Urbański [mailto:wulc...@wulczer.org] 
Inviato: lunedì 24 ottobre 2011 18:56
A: Omar Bettin
Cc: pgsql-hackers@postgresql.org
Oggetto: Re: [HACKERS] [9.1] unusable for large views

On 24/10/11 10:57, Omar Bettin wrote:
 
 [monster query]

I see that your problem is already solved, but incidentially I'm working on a 
join order planning module and I'm looking for real-life examples of humongous 
queries like that to benchmark against them.

Any chance you could share the schema, or at least part of it, that goes with 
this query? Or perhaps you have more of these queries?

Cheers,
Jan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-10-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 [ oom_score_adj business ]

 Did we do anything about this?  Anyone else have an opinion on what
 ought to be done?

I held off doing anything because it didn't seem like we had consensus.
OTOH, it may well be that it's not important enough to demand real
consensus, and he who does the work gets to make the choices.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autovacuum and orphaned large objects

2011-10-24 Thread Euler Taveira de Oliveira

On 24-10-2011 11:36, Tom Lane wrote:

Euler Taveira de Oliveiraeu...@timbira.com  writes:

The main point of autovacuum is maintenance tasks. Currently, it executes
VACUUM and ANALYZE commands. I want to propose that we incorporate vacuumlo
functionality into it.


I'm not terribly thrilled with that because (a) large objects seem like
mostly a legacy feature from here, and


Right, but there isn't a solution for  1 GB column data besides LO.


(b) it's hard to see how to
implement it without imposing overhead on everybody, whether they use
LOs or not.  This is especially problematic if you're proposing that
cleanup triggers not be required.

I was thinking about starting the LO cleanup after autovacuum finishes the 
VACUUM command (so no trigger, no new mechanism). And about the overhead 
imposed, it will only execute the cleanup code in the tables that have oid/lo 
columns (this information will be collected when the autovacuum collects table 
information).



--
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Monster query

2011-10-24 Thread Jan Urbański
On 24/10/11 19:16, Omar Bettin wrote:
 Hi,
 
 Since the data are of third parties, will prepare a database suitable for the 
 purpose.
 In any case, the compressed backup will be around 20 MB.

If you are able to prepare a database dump that doesn't contain private
data, it would be awesome.

If it's 20 MB please just post a link to the archive.

Thank you very much!
Jan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] termination of backend waiting for sync rep generates a junk log message

2011-10-24 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 It might be that it'd work to deal with this by reducing the
 reported strength of all such cases from ERROR to WARNING.  Not
 sure that that's a good idea, but it might work.
 
Throwing an error on commit of a transaction when its work has
actually been persisted seems as bad as failing to persist the work
of a transaction when the commit appears successful.  If we know we
have met the persistence guarantees and some problem occurs after
that, then it seems to me we should issue a warning and indicate
success.  If we don't know whether the work will be persisted, it
seems to me we can block until we find out (if that's feasible) or
break the connection without indicating one way or the other.  The
work of the transaction must either all be persisted or not (I'm
assuming this isn't a problem here), and any user must be prepared
to deal with a broken connection before response on a commit
attempt.
 
I don't know whether changing all failures beyond the persistence
point in a commit to warnings is sufficient, but I think it is
required for anything which doesn't break the connection.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Separating bgwriter and checkpointer

2011-10-24 Thread Heikki Linnakangas

On 19.10.2011 17:58, Simon Riggs wrote:

On Wed, Oct 19, 2011 at 3:29 PM, Fujii Masaomasao.fu...@gmail.com  wrote:

On Wed, Oct 19, 2011 at 9:45 PM, Robert Haasrobertmh...@gmail.com  wrote:

I don't really see any reason to break the monitoring view just
because we did some internal refactoring.  I'd rather have backward
compatibility.


Fair enough.

The patch doesn't change any document, but at least the description
of pg_stat_bgwriter seems to need to be changed.


Thanks for the review.

Will follow up on suggestions.


The patch looks sane, it's mostly just moving existing code around, but 
there's one thing that's been bothering me about this whole idea from 
the get-go:


If the bgwriter and checkpointer are two different processes, whenever 
bgwriter writes out a page it needs to send an fsync-request to the 
checkpointer. We avoided that when both functions were performed by the 
same process, but now we have to send and absorb a fsync-request message 
for every single write() that happens in the system, except for those 
done at checkpoints. Isn't that very expensive? Does it make the 
fsync-request queue a bottleneck on some workloads?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Use new oom_score_adj without a new compile-time constant

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 1:17 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 [ oom_score_adj business ]

 Did we do anything about this?  Anyone else have an opinion on what
 ought to be done?

 I held off doing anything because it didn't seem like we had consensus.
 OTOH, it may well be that it's not important enough to demand real
 consensus, and he who does the work gets to make the choices.

Half a loaf is better than none.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inserting heap tuples in bulk in COPY

2011-10-24 Thread Heikki Linnakangas

On 25.09.2011 16:03, Dean Rasheed wrote:

On 25 September 2011 09:43, Kohei KaiGaikai...@kaigai.gr.jp  wrote:

Hi Heikki,

I checked your patch, then I have a comment and two questions here.

2011/9/14 Heikki Linnakangasheikki.linnakan...@enterprisedb.com:


Attached is a new version of the patch. It is now complete, including WAL
replay code.


Hi,

I had a quick look at the patch as well and spotted an oversight: the
multi-insert code branch fails to invoke AFTER ROW triggers.


Thanks! Here's an updated version of the patch, fixing that, and all the 
other issues pointed out this far.


I extracted the code that sets oid and tuple headers, and invokes the 
toaster, into a new function that's shared by heap_insert() and 
heap_multi_insert(). Tom objected to merging heap_insert() and 
heap_multi_insert() into one complicated function, and I think he was 
right on that, but sharing this code to prepare a tuple still makes 
sense. IMHO it makes heap_insert() slightly more readable too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***
*** 24,29 
--- 24,30 
   *		heap_getnext	- retrieve next tuple in scan
   *		heap_fetch		- retrieve tuple with given tid
   *		heap_insert		- insert tuple into a relation
+  *		heap_multi_insert - insert multiple tuples into a relation
   *		heap_delete		- delete a tuple from a relation
   *		heap_update		- replace a tuple in a relation with another tuple
   *		heap_markpos	- mark scan position
***
*** 79,84  static HeapScanDesc heap_beginscan_internal(Relation relation,
--- 80,87 
  		int nkeys, ScanKey key,
  		bool allow_strat, bool allow_sync,
  		bool is_bitmapscan);
+ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
+ 	TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  ItemPointerData from, Buffer newbuf, HeapTuple newtup,
  bool all_visible_cleared, bool new_all_visible_cleared);
***
*** 1866,1920  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  	Buffer		vmbuffer = InvalidBuffer;
  	bool		all_visible_cleared = false;
  
- 	if (relation-rd_rel-relhasoids)
- 	{
- #ifdef NOT_USED
- 		/* this is redundant with an Assert in HeapTupleSetOid */
- 		Assert(tup-t_data-t_infomask  HEAP_HASOID);
- #endif
- 
- 		/*
- 		 * If the object id of this tuple has already been assigned, trust the
- 		 * caller.	There are a couple of ways this can happen.  At initial db
- 		 * creation, the backend program sets oids for tuples. When we define
- 		 * an index, we set the oid.  Finally, in the future, we may allow
- 		 * users to set their own object ids in order to support a persistent
- 		 * object store (objects need to contain pointers to one another).
- 		 */
- 		if (!OidIsValid(HeapTupleGetOid(tup)))
- 			HeapTupleSetOid(tup, GetNewOid(relation));
- 	}
- 	else
- 	{
- 		/* check there is not space for an OID */
- 		Assert(!(tup-t_data-t_infomask  HEAP_HASOID));
- 	}
- 
- 	tup-t_data-t_infomask = ~(HEAP_XACT_MASK);
- 	tup-t_data-t_infomask2 = ~(HEAP2_XACT_MASK);
- 	tup-t_data-t_infomask |= HEAP_XMAX_INVALID;
- 	HeapTupleHeaderSetXmin(tup-t_data, xid);
- 	HeapTupleHeaderSetCmin(tup-t_data, cid);
- 	HeapTupleHeaderSetXmax(tup-t_data, 0);		/* for cleanliness */
- 	tup-t_tableOid = RelationGetRelid(relation);
- 
  	/*
! 	 * If the new tuple is too big for storage or contains already toasted
! 	 * out-of-line attributes from some other relation, invoke the toaster.
  	 *
  	 * Note: below this point, heaptup is the data we actually intend to store
  	 * into the relation; tup is the caller's original untoasted data.
  	 */
! 	if (relation-rd_rel-relkind != RELKIND_RELATION)
! 	{
! 		/* toast table entries should never be recursively toasted */
! 		Assert(!HeapTupleHasExternal(tup));
! 		heaptup = tup;
! 	}
! 	else if (HeapTupleHasExternal(tup) || tup-t_len  TOAST_TUPLE_THRESHOLD)
! 		heaptup = toast_insert_or_update(relation, tup, NULL, options);
! 	else
! 		heaptup = tup;
  
  	/*
  	 * We're about to do the actual insert -- but check for conflict first,
--- 1869,1882 
  	Buffer		vmbuffer = InvalidBuffer;
  	bool		all_visible_cleared = false;
  
  	/*
! 	 * Fill in tuple header fields, assign an OID, and toast the tuple if
! 	 * necessary.
  	 *
  	 * Note: below this point, heaptup is the data we actually intend to store
  	 * into the relation; tup is the caller's original untoasted data.
  	 */
! 	heaptup = heap_prepare_insert(relation, tup, xid, cid, options);
  
  	/*
  	 * We're about to do the actual insert -- but check for conflict first,
***
*** 2035,2041  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  	 */
  	CacheInvalidateHeapTuple(relation, heaptup, NULL);
  
! 	pgstat_count_heap_insert(relation);
  
  	/*
  	 * If heaptup is a private copy, release it.  Don't 

Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 I had wondered whether it'd be worth optimizing that along the
 lines of slot_getallattrs().  But most indexes probably have only
 one column, or anyway not enough to make for a useful savings.
 
From a heavily-used production database:
 
cir= select indnatts, count(*) from pg_index group by indnatts
order by indnatts;
 indnatts | count 
--+---
1 |   200
2 |   684
3 |   155
4 |76
5 |43
6 |13
7 | 2
9 | 1
(8 rows)
 
This includes system table and TOAST table indexes (which seem to
have two columns).  There are over 400 user tables, each of which
has a primary key, so most primary keys in our database are more
than one column.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unreproducible bug in snapshot import code

2011-10-24 Thread Gurjeet Singh
On Mon, Oct 24, 2011 at 1:08 PM, Alvaro Herrera
alvhe...@commandprompt.comwrote:


 Excerpts from Gurjeet Singh's message of lun oct 24 13:55:44 -0300 2011:

  Starting from line 89 of terminal1.txt we see this snippet:
 
  postgres=# rollback;
  ROLLBACK
  postgres=# begin TRANSACTION ;
  BEGIN
  postgres=# set transaction snapshot '02C8-1';
  ERROR:  SET TRANSACTION SNAPSHOT must be called before any query
  postgres=# rollback;
  ROLLBACK
 
 
  As you can see the SET TRANSACTION SNAPSHOT was the first statement
 in
  that transaction, and yet the ERROR message says that it is not.

 Maybe the tab-completion feature issued a query before the set
 transaction command.


I have tried reproducing the bug starting from 1 and 2 transactions before
the one shown in snippet, and I used tab-completion to get the same
screen-output as termonal1.txt and yet it's not reproducible.

Regards,
-- 
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 I had wondered whether it'd be worth optimizing that along the
 lines of slot_getallattrs().  But most indexes probably have only
 one column, or anyway not enough to make for a useful savings.
 
 From a heavily-used production database:
 
 cir= select indnatts, count(*) from pg_index group by indnatts
 order by indnatts;
  indnatts | count 
 --+---
 1 |   200
 2 |   684
 3 |   155
 4 |76
 5 |43
 6 |13
 7 | 2
 9 | 1
 (8 rows)
 
 This includes system table and TOAST table indexes (which seem to
 have two columns).

Yeah, TOAST indexes are 2-column.  It would be best to exclude those
from your counts, since it seems pretty unlikely that anyone will care
how fast nodeIndexonlyscan.c is for scans on toast tables.

 There are over 400 user tables, each of which
 has a primary key, so most primary keys in our database are more
 than one column.

It doesn't look to me like the mean is above 2 (unless you have many
fewer toast tables than I suspect), so trying to optimize many-column
cases isn't going to help.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] (PATCH) Adding CORRESPONDING to Set Operations

2011-10-24 Thread Erik Rijkers
On Wed, October 19, 2011 15:01, Kerem Kat wrote:
 Adding CORRESPONDING to Set Operations
 Initial patch, filename: corresponding_clause_v2.patch

I had a quick look at the behaviour of this patch.

Btw, the examples in your email were typoed (one select is missing):

 SELECT 1 a, 2 b, 3 c UNION CORRESPONDING 4 b, 5 d, 6 c, 7 f;
should be:
  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING select 4 b, 5 d, 6 c, 7 f;

and

 SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) 4 b, 5 d, 6 c, 7 f;
should be:
  SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(b) select 4 b, 5 d, 6 c, 7 f;




But there is also a small bug, I think: the order in the CORRESPONDING BY list 
should be followed,
according to the standard (foundation, p. 408):

2) If corresponding column list is specified, then let SL be a select list 
of those column
names explicitly appearing in the corresponding column list in the order 
that these
column names appear in the corresponding column list. Every column name 
in the
corresponding column list shall be a column name of both T1 and T2.

That would make this wrong, I think:

SELECT 1 a, 2 b, 3 c UNION CORRESPONDING BY(c,b) select 5 d, 6 c, 7 f, 4 b ;

 b | c
---+---
 2 | 3
 4 | 6
(2 rows)

i.e., I think it should show columns in the order c, b (and not b, c); the 
order of the
CORRESPONDING BY phrase.

(but maybe I'm misreading the text of the standard; I find it often difficult 
to follow)


Thanks,


Erik Rijkers











-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 Yeah, TOAST indexes are 2-column.  It would be best to exclude
 those from your counts, since it seems pretty unlikely that anyone
 will care how fast nodeIndexonlyscan.c is for scans on toast
 tables.
 
User indexes (excluding toast):
 
 indnatts | count 
--+---
1 |   200
2 |   222
3 |   155
4 |76
5 |43
6 |13
7 | 2
9 | 1
(8 rows)
 
System indexes (excluding toast):
 
 indnatts | count 
--+---
1 |46
2 |24
3 | 9
4 | 5
(4 rows)
 
 It doesn't look to me like the mean is above 2 (unless you have
 many fewer toast tables than I suspect), so trying to optimize
 many-column cases isn't going to help.
 
The mean is 2.4 (give or take a little depending on whether you
include system tables).  I have no idea where the optimization
becomes worthwhile, but the assertion that most indexes probably
have a single column worried me.  I'm sure there are databases where
that is true (especially for those who insist on adding a
meaningless surrogate key column to every table), but there are many
where it isn't true.  I would guess that our average of 2.4 is
higher than most, though.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Copy/paste problems -- the first set includes the system tables
except for toast.  User tables would be the difference between the
results below.  Sorry.
 
-Kevin
 
 
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
  
 Yeah, TOAST indexes are 2-column.  It would be best to exclude
 those from your counts, since it seems pretty unlikely that
 anyone will care how fast nodeIndexonlyscan.c is for scans on
 toast tables.
  
 User indexes (excluding toast):
  
  indnatts | count 
 --+---
 1 |   200
 2 |   222
 3 |   155
 4 |76
 5 |43
 6 |13
 7 | 2
 9 | 1
 (8 rows)
  
 System indexes (excluding toast):
  
  indnatts | count 
 --+---
 1 |46
 2 |24
 3 | 9
 4 | 5
 (4 rows)
  
 It doesn't look to me like the mean is above 2 (unless you have
 many fewer toast tables than I suspect), so trying to optimize
 many-column cases isn't going to help.
  
 The mean is 2.4 (give or take a little depending on whether you
 include system tables).  I have no idea where the optimization
 becomes worthwhile, but the assertion that most indexes probably
 have a single column worried me.  I'm sure there are databases
 where that is true (especially for those who insist on adding a
 meaningless surrogate key column to every table), but there are
 many where it isn't true.  I would guess that our average of 2.4
 is higher than most, though.
  
 -Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 2:15 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 It doesn't look to me like the mean is above 2 (unless you have
 many fewer toast tables than I suspect), so trying to optimize
 many-column cases isn't going to help.

 The mean is 2.4 (give or take a little depending on whether you
 include system tables).  I have no idea where the optimization
 becomes worthwhile, but the assertion that most indexes probably
 have a single column worried me.  I'm sure there are databases where
 that is true (especially for those who insist on adding a
 meaningless surrogate key column to every table), but there are many
 where it isn't true.  I would guess that our average of 2.4 is
 higher than most, though.

Keep in mind that the existence of index-only scans is going to
encourage people to try to create covering indexes on columns that
aren't indexed today.  It's not clear how much of a win that will be;
for certainly workloads, HOT might make it backfire spectacularly.

But even though Tom's statement that most indexes are one column might
be a slight exaggeration, I suspect it probably is true that the
optimizations he's talking about for large numbers of columns won't
produce any material benefit even for a 3 or 4 column index.  Which
makes me think maybe we should focus our efforts elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 But even though Tom's statement that most indexes are one column might
 be a slight exaggeration, I suspect it probably is true that the
 optimizations he's talking about for large numbers of columns won't
 produce any material benefit even for a 3 or 4 column index.  Which
 makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Josh Berkus
On 10/24/11 12:35 PM, Tom Lane wrote:
 Your point about people trying to create wider indexes to exploit
 index-only scans is an interesting one, but I think it's premature to
 optimize on the basis of hypotheses about what people might do in
 future.

I don't think that this is hypothetical at all.  I know *I'll* be doing
it, and we can expect users who are familiar with MySQL and Oracle to do
it as well.

No, it won't be the majority of our users, who are using ORMs and thus
don't really think about indexing at all.  But it will be a significant
number of users who are performance-sensitive ... such as most or all of
our data warehousing users.

Mind you, we're pretty much talking exclusively about users whose tables
don't fit in memory ... usually tables which are 10X or more the size of
memory.

One case which is going to be critical to test is the join table, i.e.
the table which supports many-to-many joins and consists only of keys
from the respective two other tables.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Robert Haas
On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Your point about people trying to create wider indexes to exploit
 index-only scans is an interesting one, but I think it's premature to
 optimize on the basis of hypotheses about what people might do in
 future.

Well, I don't think it's too much of a stretch to guess that people
will try to use covering indexes; that's common practice on other
products and a frequent and a not-uncommon heartfelt request from
people with large (non-memory-resident) databases they want to migrate
to PostgreSQL.  Exactly to what degree they'll do that, and how well
it will work, is another question.  But I have little doubt that it
will be tried.

 Not sure about your other idea of returning multiple tuples per
 amgettuple call.  The trouble with that is that it will add complexity
 (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
 will have to buffer those tuples, keep track of whether it's fetching
 forward or backward, etc etc.  Plus another layer of the same in
 indexam.c (index_getnext etc).  I'm not at all convinced that it's
 likely to be a net win.

I definitely agree that you don't want two layers of caching, but I
don't see why we'd need that.  I wasn't thinking of changing
index_getnext() at all, but rather adding a new API that fills a
buffer (or a pair of buffers, maybe) with index tuples and heap TIDs.
It should spit them out in the same order that multiple
index_getnext() calls would have done and leave the scan position
wherever it would have ended up after a number of index_getnext_tid()
calls equal to the number of TIDs returned.  Any user of the API (and
it might be just nodeIndexonlyscan.c) would just need to keep track of
the number of tuples returned and the number consumed to date.

This actually gets into a wider architectural discussion, which is
whether the fact that the whole executor (modulo bitmap scans and a
few other special cases) operates on one tuple at a time is a good
design...  but my brain hurts just thinking about that.

 I wonder how trustworthy the measure of the visibilitymap_test call site
 as a consumer of cycles really is.  I've frequently noticed that
 oprofile blames remarkably large fractions of the runtime on individual
 statements that appear to be quite trivial.  I'm not sure if that
 represents real hardware-level effects such as cache line switching,
 or whether it's just measurement artifacts.  Keep in mind that
 sampling-based measurements are always subject to sampling artifacts.

I'm not sure either.  I guess we could try short-circuiting
visibilitymap_test and see what that does to performance (let's leave
correct answers out of it).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Kevin Grittner
Josh Berkus j...@agliodbs.com wrote:
 
 On 10/24/11 12:35 PM, Tom Lane wrote:
 Your point about people trying to create wider indexes to exploit
 index-only scans is an interesting one, but I think it's
 premature to optimize on the basis of hypotheses about what
 people might do in future.
 
 I don't think that this is hypothetical at all.  I know *I'll* be
 doing it, and we can expect users who are familiar with MySQL and
 Oracle to do it as well.
 
And Sybase, and MS SQL Server.  And others, most likely.  We've
never gotten around to narrowing the indexes to which we added extra
columns to overcome performance problems through covering index
techniques when we were using Sybase, so they're already here.  :-)
 
 One case which is going to be critical to test is the join
 table, i.e. the table which supports many-to-many joins and
 consists only of keys from the respective two other tables.
 
Yeah, that is an important use of covering indexes for us.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] So, is COUNT(*) fast now?

2011-10-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wonder how trustworthy the measure of the visibilitymap_test call site
 as a consumer of cycles really is.

 I'm not sure either.  I guess we could try short-circuiting
 visibilitymap_test and see what that does to performance (let's leave
 correct answers out of it).

That would conflate the cost of the call with the cost of the function.
Maybe you could try manually inlining the visibility test?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-24 Thread Alexander Korotkov
Hackers,

After Hekki's talk on PgConf.EU about SSI, some idea comes to my mind.
Coundn't be predicate locking implementation in SSI be used for in-database
cache invalidation.
It could be possible to implement in-database cache which will acquire
predicate locks like SSI transactions. In case of any conflich with other
transactions corresponding cache invalidates. Therefore, it might be
possible to get acceleration of caching without risk of inconsistent
answers.
Actually, I don't understand SSI in details. So, probably I mess up things.
Does my idea any matter?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-24 Thread Kevin Grittner
Alexander Korotkov aekorot...@gmail.com wrote:
 
 Coundn't be predicate locking implementation in SSI be used for
 in-database cache invalidation.
 
It would not necessarily be limited to *in-database* caches.  The
main thing would be to design a good API to the predicate locking
portion of SSI, which I think is about 80% of the SSI code.  Dan and
I both have an interest in such further use, and there have been
others who have talked about potential uses for the non-blocking
predicate locking.  I think the API would need to be based around a
listen/notify model.
 
 It could be possible to implement in-database cache which will
 acquire predicate locks like SSI transactions. In case of any
 conflich with other transactions corresponding cache invalidates.
 Therefore, it might be possible to get acceleration of caching
 without risk of inconsistent answers.
 
I had not thought of that potential use.  At first glance, I think
it has possibilities, but only if the above-mentioned API was
formalized *and* there was some way to configure a cluster for
serializable transactions only.  Long-range, I have hopes for
both.
 
 Actually, I don't understand SSI in details. So, probably I mess
 up things. Does my idea any matter?
 
Sure!  Besides having the available development time, I think the
biggest obstacle is having enough plausible use cases for predicate
lock access to do a good job defining the API.  While we made some
effort to keep the predicate locking and serializable behavior
separate in the implementation, it wasn't clear where the bright
line was, so there is bound to be some rearrangement needed when we
figure that out.  The more ideas we have in front of us on how
predicate locks might be useful, the better the API design is likely
to be.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI implementation question

2011-10-24 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 I don't understand the SSI code well enough to tell if this is
 sufficient or not, so I hope you guys will take a closer look at
 the issue when you have time.
 
I will definitely give it a look.  Right now we have a perfect
storm of time demands due to some recently-passed legislation
combined with the need to wrap up some Annual Plan items.  Within
a few weeks I should find time for a more careful review and
testing.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea: Always consistent in-database cache using SSI mechanisms

2011-10-24 Thread Alexander Korotkov
On Tue, Oct 25, 2011 at 1:46 AM, Kevin Grittner kevin.gritt...@wicourts.gov
 wrote:

 Alexander Korotkov aekorot...@gmail.com wrote:

  Coundn't be predicate locking implementation in SSI be used for
  in-database cache invalidation.

 It would not necessarily be limited to *in-database* caches.  The
 main thing would be to design a good API to the predicate locking
 portion of SSI, which I think is about 80% of the SSI code.  Dan and
 I both have an interest in such further use, and there have been
 others who have talked about potential uses for the non-blocking
 predicate locking.  I think the API would need to be based around a
 listen/notify model.

  It could be possible to implement in-database cache which will
  acquire predicate locks like SSI transactions. In case of any
  conflich with other transactions corresponding cache invalidates.
  Therefore, it might be possible to get acceleration of caching
  without risk of inconsistent answers.

 I had not thought of that potential use.  At first glance, I think
 it has possibilities, but only if the above-mentioned API was
 formalized *and* there was some way to configure a cluster for
 serializable transactions only.  Long-range, I have hopes for
 both.


Sure, it would be rather better to implement that through API.

  Actually, I don't understand SSI in details. So, probably I mess
  up things. Does my idea any matter?

 Sure!  Besides having the available development time, I think the
 biggest obstacle is having enough plausible use cases for predicate
 lock access to do a good job defining the API.  While we made some
 effort to keep the predicate locking and serializable behavior
 separate in the implementation, it wasn't clear where the bright
 line was, so there is bound to be some rearrangement needed when we
 figure that out.  The more ideas we have in front of us on how
 predicate locks might be useful, the better the API design is likely
 to be.


Thanks for feedback on my idea. I'll share ideas about more possible usage
of that API if I have any.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Fujii Masao
Thanks for the review!

On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 24.10.2011 15:29, Fujii Masao wrote:

 +    listitem
 +     para
 +      Copy the pg_control file from the cluster directory to the global
 +      sub-directory of the backup. For example:
 + programlisting
 + cp $PGDATA/global/pg_control /mnt/server/backupdir/global
 + /programlisting
 +     /para
 +    /listitem

 Why is this step required? The control file is overwritten by information
 from the backup_label anyway, no?

Yes, when recovery starts, the control file is overwritten. But before that,
we retrieve the minimum recovery point from the control file. Then it's used
as the backup end location.

During recovery, pg_stop_backup() cannot write an end-of-backup record.
So, in standby-only backup, other way to retrieve the backup end location
(instead of an end-of-backup record) is required. Ishiduka-san used the
control file as that, according to your suggestion ;)
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php

 +    listitem
 +     para
 +      Again connect to the database as a superuser, and execute
 +      functionpg_stop_backup/. This terminates the backup mode, but
 does not
 +      perform a switch to the next WAL segment, create a backup history
 file and
 +      wait for all required WAL segments to be archived,
 +      unlike that during normal processing.
 +     /para
 +    /listitem

 How do you ensure that all the required WAL segments have been archived,
 then?

The patch doesn't provide any capability to ensure that, IOW assumes that's
a user responsibility. If a user wants to ensure that, he/she needs to calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files have
appeared in the archive. Also if the required WAL file has not been archived
yet, a user might need to execute pg_switch_xlog() in the master.

If we change pg_stop_backup() so that, even during recovery, it waits until
all required WAL files have been archived, we would need to WAL-log
the completion of WAL archiving in the master. This enables the standby to
check whether specified WAL files have been archived. We should change
the patch in this way? But even if we change, you still might need to execute
pg_switch_xlog() in the master additionally, and pg_stop_backup() might keep
waiting infinitely if the master is not in progress.

 +   /orderedlist
 +    /para
 +
 +    para
 +     You cannot use the applicationpg_basebackup/ tool to take the
 backup
 +     from the standby.
 +    /para

 Why not? We have cascading replication now.

Because no one has implemented that feature.

Yeah, we have cascading replication, but without adopting the standby-only
backup patch, pg_basebackup cannot execute do_pg_start_backup() and
do_pg_stop_backup() during recovery. So we can think that the patch that
Ishiduka-san proposed is the first step to extend pg_basebackup so that it
can take backup from the standby.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Separating bgwriter and checkpointer

2011-10-24 Thread Simon Riggs
On Mon, Oct 24, 2011 at 11:40 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 The patch looks sane, it's mostly just moving existing code around, but
 there's one thing that's been bothering me about this whole idea from the
 get-go:

 If the bgwriter and checkpointer are two different processes, whenever
 bgwriter writes out a page it needs to send an fsync-request to the
 checkpointer. We avoided that when both functions were performed by the same
 process, but now we have to send and absorb a fsync-request message for
 every single write() that happens in the system, except for those done at
 checkpoints. Isn't that very expensive? Does it make the fsync-request queue
 a bottleneck on some workloads?

That is a reasonable question and one I considered.

I did some benchmarking earlier to see the overhead of that.
Basically, its very small, much, much smaller than I thought.

The benefit of allowing the bgwriter to continue working during long
fsyncs easily outweighs the loss of doing more fsync-requests. Both of
those overheads/problems occur at the same time so there is the
overhead is always covered.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-10-24 Thread Fujii Masao
On Tue, Oct 25, 2011 at 12:33 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 One problem with this whole FPW-tracking is that pg_lesslog makes it fail.
 I'm not sure what we need to do about that - maybe just add a warning to the
 docs. But it leaves a bit bad feeling in my mouth. Usually we try to make
 features work orthogonally, without dependencies to other settings. Now this
 feature requires that full_page_writes is turned on in the master, and also
 that you don't use pg_lesslog to compress the WAL segments or your base
 backup might be corrupt.

Right, pg_lesslog users cannot use the documented procedure. They need to
do more complex one;

1. Execute pg_start_backup() in the master, and save its return value.
2. Wait until the backup starting checkpoint record has been replayed
in the standby. You can do this by comparing the return value of
pg_start_backup() with pg_last_replay_location().
3. Do the documented standby-only backup procedure.
4. Execute pg_stop_backup() in the master.

This is complicated, but I'm not sure how we can simplify it. Anyway we can
document this procedure for pg_lesslog users. We should?

 The procedure to take a backup from the standby
 seems more complicated than taking it on the master - there are more steps
 to follow.

Extending pg_basebackup so that it can take a backup from the standby would
make the procedure simple to a certain extent, I think. Though a user
still needs
to enable FPW in the master and must not use pg_lesslog.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers