Re: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL

2012-11-11 Thread Greg Smith

On 11/9/12 11:59 PM, Amit kapila wrote:


Please let me know if there are any objections or problems in above method of 
implementation,
else I can go ahead to prepare the patch for the coming CF.


It may be the case that the locking scheme Robert described is the best 
approach here.  It seems kind of heavy to me though.  I suspect that 
some more thinking about it might come up with something better.


Regardless, exactly how to prevent two concurrent processes from writing 
the same file feels like the last step in the small roadmap for what 
this feature needs.  If you wanted to work on it more, I'd suggest 
breaking it into chunks in this order:


1) Change to add scanning a .conf directory in the default configuration 
using include-dir.  This is a quick fix.  I predict most of the 
headaches around it will end up being for packagers rather than the core 
code to deal with.


You could submit this as a small thing to be evaluated on its own.  How 
it's done is going to be controversial.  Might as well get that fighting 
focused against a sample implementation as soon as possible.


2) Get familiar with navigating the GUC data and figuring out what, 
exactly, needs to be written out.  This should include something that 
navigates like things appear after a RESET, ignoring per-user or 
per-session changes when figuring out what goes there.  It seems 
inevitable that some amount of validating against the source 
information--what pg_settings labels source, sourcefile, and sourceline 
will be needed.  An example is the suggestion Magnus made for confirming 
that the include-dir is still active before writing something there.


3) Add the function to write a new file out.  Work out some test cases 
for that to confirm the logic and error checking in the previous step 
all works.


I'd next submit what you've got for (2) and (3) to review at this point, 
before complicating things further with the locking parts.


4) Make the file write atomic and able to work when multiple users try 
it at once.  You have to reach here successfully before the trivial 
around file locking comes into play.  I wouldn't even bother aiming for 
that part in a first patch.  It's obviously a solvable problem in a 
number of ways.  You need a rock solid way to figure out what to write 
there before that solution is useful though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Greg Smith

On 11/12/12 12:55 AM, Jesper Krogh wrote:

I'd just like some rough guard against
hardware/OS related data corruption.
and that is more likely to hit data-blocks constantly flying in and out
of the system.


I get that.  I think that some of the design ideas floating around since 
this feature was first proposed have been innovating in the hope of 
finding a clever halfway point here.  Ideally we'd be able to get online 
checksum conversion and up running easily, reliably, and without adding 
a lot of code.  I have given up on that now though.


The approach of doing a heavy per-table conversion with more state 
information than we'd like seems unavoidable, if you want to do it right 
and allow people to (slowly but surely) reach a trustworthy state.  I 
think we should stop searching for a clever way around and just do slog 
through doing it.  I've resigned myself to that now, and recently set 
aside a good block of time to beat my head against that particular wall 
over the next couple of months.



But I totally agree that the scheme described with integrating it into a
autovacuum process would
be very close to ideal, even on a database as the one I'm running.


I am sadly all too familiar with how challenging it is to keep a 2TB 
PostgreSQL database running reliably.  One of my recent catch phrases 
for talks is "if you have a big Postgres database, you also have a 
vacuum problem".  I think it's unreasonable to consider online 
conversion solutions that don't recognize that, and allow coordinating 
the work with the challenges of vacuuming larger systems too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Jesper Krogh

On 12/11/12 05:55, Greg Smith wrote:
The only guarantee I see that we can give for online upgrades is that 
after a VACUUM CHECKSUM sweep is done, and every page is known to both 
have a valid checksum on it and have its checksum bits set, *then* any 
page that doesn't have both set bits and a matching checksum is 
garbage.  Until reaching that point, any old data is suspect.  The 
idea of operating in an "we'll convert on write but never convert old 
pages" can't come up with any useful guarantees about data integrity 
that I can see.  As you say, you don't ever gain the ability to tell 
pages that were checksummed but have since been corrupted from ones 
that were corrupt all along in that path.


You're right about that, but I'd just like some rough guard against 
hardware/OS related data corruption.
and that is more likely to hit data-blocks constantly flying in and out 
of the system.
I'm currently running a +2TB database and the capabillity to just see 
some kind of corruption earlier
rather than later is a major benefit by itself. Currently corruption can 
go undetected if it just

happens to hit data-only parts of the database.

But I totally agree that the scheme described with integrating it into a 
autovacuum process would

be very close to ideal, even on a database as the one I'l running.

--
Jesper




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP checksums patch

2012-11-11 Thread Greg Smith

On 11/9/12 6:14 PM, Jeff Davis wrote:

On Mon, 2012-11-05 at 12:19 -0500, Robert Haas wrote:

Yeah.  I definitely think that we could shed an enormous amount of
complexity by deciding that this is, for now, an option that can only
be selected at initdb time.  That would remove approximately 85% of
everything I've ever disliked about this patch - without, I think,
precluding the possibility of improving things later.


That's certainly true, but it introduces one large problem: upgrading
would not work, which (in the past few releases) we've treated as a
major showstopper for many features.


If you have pages that were written with one datetime storage format, 
and you create a cluster using the other one, that will fail.  If 
checksums are done as an initdb time option, I see "checksummed" as 
being a block change on that level, and the precedent for not supporting 
it defensible.  pg_upgrade will need to check for a mismatch--new 
cluster has checksums turned on, old one doesn't--and abort out if that 
happens.  That can be lumped in with the other pg_controldata tests 
easily enough.


What I think this argues for, though, is being precise about naming what 
goes into pg_controldata.  Let's say the initial commit target is an 
initdb time switch, but later finer grained ones are expected to be 
available.  I think the output and source code labels on the initdb 
controldata part should refer to this as something like "checksums 
available" then.  The word "enabled" could be misleading when there's 
finer grained enabling coming later.  We don't want people to run 
pg_controldata, see "checksums:  enabled/on", and later discover they're 
not fully operational in that cluster yet.  Saying "checksums: 
available" suggests there's somewhere else that should be checked, to 
tell if they're available on a given database/table or not.


The sort of text I'm thinking of for the manual and/or warning is:

"You can't use pg_upgrade to migrate from a cluster where checksums are 
not available to one where they are.  This limitation may be removed by 
a future version."


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Greg Smith

On 11/11/12 2:56 PM, Jeff Davis wrote:

We could have a separate utility, pg_checksums, that can
alter the state and/or do an offline verification. And initdb would take
an option that would start everything out fully protected with
checksums.


Adding an initdb option to start out with everything checksummed seems 
an uncontroversial good first thing to have available.  It seems like a 
proper 9.3 target to aim at even if per-table upgrading gets bogged down 
in details.  I have an argument below that the area between initdb and 
per-table upgrades is fundamentally uncertain and therefore not worth 
chasing after, based on reasons you already started to outline.  There's 
not much useful middle ground there.


Won't a pg_checksums program just grow until it looks like a limited 
version of vacuum though?  It's going to iterate over most of the table; 
it needs the same cost controls as autovacuum (and to respect the load 
of concurrent autovacuum work) to keep I/O under control; and those cost 
control values might change if there's a SIGHUP to reload parameters. 
It looks so much like vacuum that I think there needs to be a really 
compelling reason to split it into something new.  Why can't this be yet 
another autovacuum worker that does its thing?


> In order to get to the fully-protected state, you still need to
> somehow make sure that all of the old data is checksummed.
>
> And the "fully protected" state is important in my opinion, because
> otherwise we aren't protected against corrupt page headers that say
> they have no checksum (even when it really should have a checksum).

I think it's useful to step back for a minute and consider the larger 
uncertainty an existing relation has, which amplifies just how ugly this 
situation is.  The best guarantee I think online checksumming can offer 
is to tell the user "after transaction id X, all new data in relation R 
is known to be checksummed".  Unless you do this at initdb time, any 
conversion case is going to have the possibility that a page is 
corrupted before you get to it--whether you're adding the checksum as 
part of a "let's add them while we're writing anyway" page update or the 
conversion tool is hitting it.


That's why I don't think anyone will find online conversion really 
useful until they've done a full sweep updating the old pages.  And if 
you accept that, a flexible checksum upgrade utility, one that co-exists 
with autovacuum activity costs, becomes a must.


One of the really common cases I was expecting here is that conversions 
are done by kicking off a slow background VACUUM CHECKSUM job that might 
run in pieces.  I was thinking of an approach like this:


-Initialize a last_checked_block value for each table
-Loop:
--Grab the next block after the last checked one
--When on the last block of the relation, grab an exclusive lock to 
protect against race conditions with extension

--If it's marked as checksummed and the checksum matches, skip it
---Otherwise, add a checksum and write it out
--When that succeeds, update last_checked_block
--If that was the last block, save some state saying the whole table is 
checkedsummed


With that logic, there is at least a forward moving pointer that removes 
the uncertainty around whether pages have been updated or not.  It will 
keep going usefully if interrupted too.  One obvious this way this can 
fail is if:


1) A late page in the relation is updated and a checksummed page written
2) The page is corrupted such that the "is this checksummed?" bits are 
not consistent anymore, along with other damage to it

3) The conversion process gets to this page eventually
4) The corruption of (2) isn't detected

But I think that this possibility--that a page might get quietly 
corrupted after checked once, but still in the middle of checking a 
relation--is both impossible to remove and a red herring.  How do we 
know that this page of the relation wasn't corrupted on disk before we 
even started?  We don't, and we can't.


The only guarantee I see that we can give for online upgrades is that 
after a VACUUM CHECKSUM sweep is done, and every page is known to both 
have a valid checksum on it and have its checksum bits set, *then* any 
page that doesn't have both set bits and a matching checksum is garbage. 
 Until reaching that point, any old data is suspect.  The idea of 
operating in an "we'll convert on write but never convert old pages" 
can't come up with any useful guarantees about data integrity that I can 
see.  As you say, you don't ever gain the ability to tell pages that 
were checksummed but have since been corrupted from ones that were 
corrupt all along in that path.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doc patch "only relevant" -> "relevant only"

2012-11-11 Thread Peter Eisentraut
On Tue, 2012-10-16 at 22:24 -0500, Karl O. Pinc wrote:
> In a number of places the docs read "only relevant",
> this patch reverses this to read "relevant only".

committed




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Matthew Gerber
On Sun, Nov 11, 2012 at 8:27 PM, Tom Lane  wrote:

> Noah Misch  writes:
> > So, I can reproduce the lower threshold, but the exception type does not
> agree
> > with the one Matthew observed.
>
> I finally got around to looking at the link you provided about error
> 0xC409, and realized that I'd been completely confusing it with
> stack overflow --- but actually, it's a report that something scribbled
> past the end of a finite-size local-variable array.  So I now think that
> Matthew's stumbled across two completely independent bugs, and we've
> fixed only one of them.  The 0xC409 error is something else, and
> possibly a lot worse since it could conceivably be a security issue.
>
> It still seems likely that the actual location of the bug is either
> in PostGIS or in the GIST index code, but without the ability to
> reproduce the failure it's awfully hard to find it.  Matthew, could
> you try a bit harder to find a self-contained test case that produces
> that error?
>
> regards, tom lane
>

Sure, it might take me a while to find time but I'll keep it on my list.

Matt


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Tom Lane
Noah Misch  writes:
> So, I can reproduce the lower threshold, but the exception type does not agree
> with the one Matthew observed.

I finally got around to looking at the link you provided about error
0xC409, and realized that I'd been completely confusing it with
stack overflow --- but actually, it's a report that something scribbled
past the end of a finite-size local-variable array.  So I now think that
Matthew's stumbled across two completely independent bugs, and we've
fixed only one of them.  The 0xC409 error is something else, and
possibly a lot worse since it could conceivably be a security issue.

It still seems likely that the actual location of the bug is either
in PostGIS or in the GIST index code, but without the ability to
reproduce the failure it's awfully hard to find it.  Matthew, could
you try a bit harder to find a self-contained test case that produces
that error?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-11 Thread Andres Freund
On 2012-11-10 16:24:22 -0500, Tom Lane wrote:
> Simon Riggs  writes:
> >> One thing that seems a bit annoying is the use of zero-based backup
> >> block indexes in RestoreBackupBlock, when most (not all) of the callers
> >> are referencing macros with one-based names (XLR_BKP_BLOCK_1 etc).
> >> That's a bug waiting to happen.  We could address it by changing
> >> RestoreBackupBlock to accept a one-based argument, but what I'm thinking
> >> of doing instead is getting rid of the one-based convention entirely;
> >> that is, remove XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4 and instead have
> >> all code use the XLR_SET_BKP_BLOCK() macro, which is zero-based.  One
> >> advantage of doing that is that it'll force inspection of all code
> >> related to this.
>
> > I wouldn't do that in a back branch, but I can see why its a good idea.
> 
> If any of this stuff were getting used by external modules, changing it
> would be problematic ... but fortunately, it isn't, because we lack
> support for plug-in rmgrs.  So I'm not worried about back-patching the
> change, and would prefer to keep the 9.x branches in sync.

XLR_BKP_BLOCK_* might be used by things like pg_lesslog and its surely
used by xlogdump. Not sure if either are worth that much attention, but
it seems worth noticing that such a change will break stuff.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Andrew Dunstan


On 11/11/2012 05:52 PM, Jeff Davis wrote:

On Sun, 2012-11-11 at 21:20 +0100, Pavel Stehule wrote:

I don't think so GUC are good for this purpouse, but I don't like
single purpouse statements too.

what do you think about enhancing ALTER DATABASE statement

some like

ALTER DATABASE name ENABLE CHECKSUMS and ALTER DATABASE name DISABLE CHECKSUMS

Per-database does sound easier than per-table. I'd have to think about
how that would affect shared catalogs though.

For now, I'm leaning toward an offline utility to turn checksums on or
off, called pg_checksums. It could do so lazily (just flip a switch to
"enabling" in pg_control), or it could do so eagerly and turn it into a
fully-protected instance.

For the first patch, it might just be an initdb-time option for
simplicity.



+1

I haven't followed this too closely, but I did wonder several days ago 
why this wasn't being made an initdb-time decision.


cheers

andrew




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] review: pgbench - aggregation of info written into log

2012-11-11 Thread Tomas Vondra
Hi,

attached is a v4 of the patch. There are not many changes, mostly some
simple tidying up, except for handling the Windows.

After a bit more investigation, it seems to me that we can't really get
the same behavior as in other systems - basically the timestamp is
unavailable so we can't log the interval start. So it seems to me the
best we can do is to disable this option on Windows (and this is done in
the patch). So when you try to use "--aggregate-interval" on Win it will
complain it's an unknown option.

Now that I think of it, there might be a better solution that would not
mimic the Linux/Unix behaviour exactly - we do have elapsed time since
the start of the benchmark, so maybe we should use this instead of the
timestamp? I mean on systems with reasonable gettimeofday we'd get

1345828501 5601 1542744 483552416 61 2573
1345828503 7884 1979812 565806736 60 1479
1345828505 7208 1979422 567277552 59 1391
1345828507 7685 1980268 569784714 60 1398
1345828509 7073 1979779 573489941 236 1411

but on Windows we'd get

0 5601 1542744 483552416 61 2573
2 7884 1979812 565806736 60 1479
4 7208 1979422 567277552 59 1391
6 7685 1980268 569784714 60 1398
8 7073 1979779 573489941 236 1411

i.e. the first column is "seconds since start of the test". H, and
maybe we could call 'gettimeofday' at the beginning, to get the
timestamp of the test start (AFAIK the issue on Windows is the
resolution, but it should be enough). Then we could add it up with the
elapsed time and we'd get the same output as on Linux.

And maybe this could be done in regular pgbench execution too? But I
really need help from someone who knows Windows and can test this.

Comments regarding Pavel's reviews are below:

On 2.10.2012 20:08, Pavel Stehule wrote:
> Hello
> 
> I did review of pgbench patch
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00737.php
> 
> * this patch is cleanly patched
> 
> * I had a problem with doc
> 
> make[1]: Entering directory `/home/pavel/src/postgresql/doc/src/sgml'
> openjade  -wall -wno-unused-param -wno-empty -wfully-tagged -D . -D .
> -c /usr/share/sgml/docbook/dsssl-stylesheets/catalog -d stylesheet.dsl
> -t sgml -i output-html -V html-index postgres.sgml
> openjade:pgbench.sgml:767:8:E: document type does not allow element
> "TITLE" here; missing one of "ABSTRACT", "AUTHORBLURB", "MSGSET",
> "CALLOUTLIST", "ITEMIZEDLIST", "ORDEREDLIST", "SEGMENTEDLIST",
> "VARIABLELIST", "CAUTION", "IMPORTANT", "NOTE", "TIP", "WARNING",
> "FORMALPARA", "BLOCKQUOTE", "EQUATION", "EXAMPLE", "FIGURE", "TABLE",
> "PROCEDURE", "SIDEBAR", "QANDASET", "REFSECT3" start-tag
> make[1]: *** [HTML.index] Error 1
> make[1]: *** Deleting file `HTML.index'
> make[1]: Leaving directory `/home/pavel/src/postgresql/doc/src/sgml

Hmmm, I've never got the docs to build at all, all I do get is a heap of
some SGML/DocBook related issues. Can you investigate a bit more where's
the issue? I don't remember messing with the docs in a way that might
cause this ... mostly copy'n'paste edits.

> * pgbench is compiled without warnings
> 
> * there is a few issues in source code
> 
> + 
> + /* should we aggregate the results or not? */
> + if (use_log_agg)
> + {
> + 
> + /* are we still in the same interval? if yes, 
> accumulate the
> +  * values (print them otherwise) */
> + if (agg->start_time + use_log_agg >= 
> INSTR_TIME_GET_DOUBLE(now))
> + {
> + 

Errr, so what are the issues here?

> 
> * a real time range for aggregation is dynamic (pgbench is not
> realtime application), so I am not sure if following constraint has
> sense
> 
>  +if ((duration > 0) && (use_log_agg > 0) && (duration % use_log_agg != 
> 0)) {
> + fprintf(stderr, "duration (%d) must be a multiple of aggregation
> interval (%d)\n", duration, use_log_agg);
> + exit(1);
> + }

I'm not sure what might be the issue here either. If the test duration
is set (using "-T" option), then this kicks in and says that the
duration should be a multiple of aggregation interval. Seems like a
sensible assumption to me. Or do you think this is check should be removed?

> * it doesn't flush last aggregated data (it is mentioned by Tomas:
> "Also, I've realized the last interval may not be logged at all - I'll
> take a look into this in the next version of the patch."). And it can
> be significant for higher values of "use_log_agg"

Yes, and I'm still not sure how to fix this in practice. But I do have
this on my TODO.

> 
> * a name of variable "use_log_agg" is not good - maybe "log_agg_interval" ?

I've changed it to agg_interval.

regards
Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 090c210..2093edc 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.

Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-11 Thread Tom Lane
[ this time *with* the patch ]

Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane



binVAAMKoREwG.bin
Description: restore-backup-blocks-individually-2.patch.gz

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inadequate thought about buffer locking during hot standby replay

2012-11-11 Thread Tom Lane
Attached is a complete draft patch against HEAD for this issue.  I found
a depressingly large number of pre-existing bugs:

Practically all WAL record types that touch multiple pages have some
bug of this type.  In addition to btree_xlog_split, I found that
heap_xlog_update, ginRedoDeletePage, spgRedoAddLeaf, spgRedoMoveLeafs,
spgRedoAddNode, spgRedoSplitTuple, and spgRedoPickSplit fail to hold
locks as required to make their updates safe for concurrent queries.
(I'm not totally sure about ginRedoDeletePage, but the original action
definitely locks the pages simultaneously, and it's not clear that it's
safe not to.)  Most of these are okay in cases without any full-page
images, but could fail if the wrong subset of the pages-to-be-touched
were processed by RestoreBkpBlocks.  Some had bugs even without that :-(

Worse than that, GIST WAL replay seems to be a total disaster from this
standpoint, and I'm not convinced that it's even fixable without
incompatible WAL changes.  There are several cases where index
insertion operations generate more than one WAL record while holding
exclusive lock on buffer(s).  If the lock continuity is actually
necessary to prevent concurrent queries from seeing inconsistent index
states, then we have a big problem, because WAL replay isn't designed to
hold any buffer lock for more than one WAL action.  This needs to be
looked at closer by somebody who's more up on the GIST code than I am.
The attached patch doesn't try very hard to make the GIST functions
safe, it just transposes them to the new API.

I also found that spgRedoAddNode could fail to update the parent
downlink at all, if the parent tuple is in the same page as either the
old or new split tuple --- it would get fooled by the LSN having been
advanced already.  This is an index corruption bug not just a
transient-failure problem.

Also, ginHeapTupleFastInsert's "merge lists" case fails to mark the old
tail page as a candidate for a full-page image; in the worst case this
could result in torn-page corruption.

I already pointed out the inconsistency in heap_xlog_freeze about
whether a cleanup lock is needed.  As is, this patch uses a cleanup
lock, but I suspect that a regular lock is sufficient --- comments?

Also, gistRedoPageDeleteRecord seems to be dead code?  I don't see
anything that emits XLOG_GIST_PAGE_DELETE.

regards, tom lane

#application/octet-stream [restore-backup-blocks-individually-2.patch.gz] 
/home/tgl/pgsql/restore-backup-blocks-individually-2.patch.gz


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Jeff Davis
On Sun, 2012-11-11 at 21:20 +0100, Pavel Stehule wrote:
> I don't think so GUC are good for this purpouse, but I don't like
> single purpouse statements too.
> 
> what do you think about enhancing ALTER DATABASE statement
> 
> some like
> 
> ALTER DATABASE name ENABLE CHECKSUMS and ALTER DATABASE name DISABLE CHECKSUMS

Per-database does sound easier than per-table. I'd have to think about
how that would affect shared catalogs though.

For now, I'm leaning toward an offline utility to turn checksums on or
off, called pg_checksums. It could do so lazily (just flip a switch to
"enabling" in pg_control), or it could do so eagerly and turn it into a
fully-protected instance.

For the first patch, it might just be an initdb-time option for
simplicity.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Pavel Stehule
Hello

>
>> > Does it make sense to store this information in pg_control? That doesn't
>> > require adding any new file, and it has the benefit that it's already
>> > checksummed. It's available during recovery and can be made available
>> > pretty easily in the places where we write data.
>> >
>> > And the next question is what commands to add to change state. Ideas:
>> >
>> >CHECKSUMS ENABLE; -- set state to "Enabling"
>> >CHECKSUMS DISABLE; -- set state to "Off"
>>
>> Don't like this, please make it a GUC.
>
> I'll see if you have ideas about how to resolve the problems with a GUC
> that I mentioned above. But if not, then what about using a utility,
> perhaps called pg_checksums? That way we wouldn't need new syntax.

I don't think so GUC are good for this purpouse, but I don't like
single purpouse statements too.

what do you think about enhancing ALTER DATABASE statement

some like

ALTER DATABASE name ENABLE CHECKSUMS and ALTER DATABASE name DISABLE CHECKSUMS

Regards

Pavel


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Matthew Gerber
On Sun, Nov 11, 2012 at 2:43 PM, Noah Misch  wrote:

> On Sun, Nov 11, 2012 at 10:10:31AM -0500, Matthew Gerber wrote:
> > > > Matthew Gerber  writes:
> > > > >> Here is the command that was executing when the 0xC409
> exception
> > > was
> > > > >> raised:
> > > > >> INSERT INTO places
> (bounding_box,country,full_name,id,name,type,url)
> > > > >> VALUES
> > > > >> (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085
> > > > >> 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789
> > > > >> 32.771786,-97.034085 32.771786))',4326),26918),'United
> > > States','Irving,
> > > > >> TX','dce44ec49eb788f5','Irving','city','
> > > > >> http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'),
>
> > A slight update on this:  previously, my insert code involved a long
> > "SELECT ... UNION ALL ... SELECT ... UNION ALL ..." command. If this
> > command was too long, I would get a stack depth exception thrown back to
> my
> > client application. I reduced the length of the command to eliminate the
> > client-side exceptions, but on some occasions I would still get the
> > 0xC409 crash on the server side. I have eliminated the long "SELECT
> ...
> > UNION ALL ... " command, and I now get no errors on either side. So it
> > seems like long commands like this were actually causing the server-side
> > crashes. The proper behavior would seem to be throwing the exception back
> > to the client application instead of crashing the server.
>
> Above, you quoted an INSERT ... VALUES of two rows.  Have you observed an
> exception-0xC409 crash with an INSERT ... VALUES query, or only with an
> "INSERT ... SELECT ... thousands of UNION" query?
>

Every time the server crashed with 0xC409, the log reported that it was
running the simple INSERT command (two rows) that I started this thread
with. However, this didn't make any sense to me given the simplicity of the
INSERT command and the fact that the error indicated a stack overflow. So I
removed the long "SELECT ... UNION ALL ..." command since it seemed more
relevant to the error, and the process has been running continuously for a
few days now.

To answer your question directly:  I was seeing the server crash when using
the simple INSERT and long "SELECT ... UNION ..." (these commands are
issued independently at different points in the program). Now my program is
only using the simple INSERT, and the crashes are gone.

Hope this helps...

Matt


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Jeff Davis
On Fri, 2012-11-09 at 09:57 -0800, Josh Berkus wrote:
> Huh?  Why would a GUC not make sense?  How else would you make sure that
> checksums where on when you started the system?

If we stored the information in pg_control, you could check with
pg_controldata. We could have a separate utility, pg_checksums, that can
alter the state and/or do an offline verification. And initdb would take
an option that would start everything out fully protected with
checksums.

The problem with a GUC is that checksums aren't really something you can
change by just changing the variable and restarting, unless you are only
using checksums opportunistically (only write checksums when a page is
dirtied and only verify a checksum if the header indicates that it's
present).

There are also usability issues. If someone has a fully-protected
instance, and turns the GUC off, and starts the server, they'll lose the
"fully-protected" status on the first write, and have to re-read all the
data to get back to fully protected. That just doesn't seem right to me.

> Well, large databases would tend to be stuck permanently in "Enabling",
> becuase the user would never vacuum old cold partitions in order to
> checksum them.  So we need to be prepared for this to be the end state
> for a lot of databases.

That may be true, but if that's the case, it's more like a 3-bit
checksum than a 16-bit checksum, because of the page-header corruption
problem. I don't know of any way to give those users more than that,
which won't be good enough for the set-at-initdb time users.

> In fact, we'd need three settings for the checksum GUC:
> 
> OFF -- don't checksum anything, equal to state (1) above
> 
> WRITES -- checksum pages which are being written anyway, but ignore
> tables which aren't touched.  Permanent "Enabling" state.
> 
> ALL -- checksum everything you can.  particularly, autovacuum would
> checksum any table which was not already checksummed at the next vacuum
> of that table.  Goal is to get to state 3 above.

That's slightly more eager, but it's basically the same as the WRITES
state. In order to get to the fully-protected state, you still need to
somehow make sure that all of the old data is checksummed.

And the "fully protected" state is important in my opinion, because
otherwise we aren't protected against corrupt page headers that say they
have no checksum (even when it really should have a checksum).

> > Does it make sense to store this information in pg_control? That doesn't
> > require adding any new file, and it has the benefit that it's already
> > checksummed. It's available during recovery and can be made available
> > pretty easily in the places where we write data.
> > 
> > And the next question is what commands to add to change state. Ideas:
> > 
> >CHECKSUMS ENABLE; -- set state to "Enabling"
> >CHECKSUMS DISABLE; -- set state to "Off"
> 
> Don't like this, please make it a GUC.

I'll see if you have ideas about how to resolve the problems with a GUC
that I mentioned above. But if not, then what about using a utility,
perhaps called pg_checksums? That way we wouldn't need new syntax.

> As there's no such thing as system-wide vacuum, we're going to have to
> track whether a table is "fully checksummed" in the system catalogs.

It seems like this is going down the road of per-table checksums. I'm
not opposed to that, but that has a low chance of making 9.3.

Let's try to do something simpler now that leaves open the possibility
of more flexibility later. I'm inclined to agree with Robert that the
first patch should probably be an initdb-time option. Then, we can allow
a lazy mode (like your WRITES state) and an eager offline check with a
pg_checksums utility. Then we can work towards per-table checksums,
control via VACUUM, protecting the SLRU, treating zero pages as invalid,
protecting temp files (which can be a GUC), replication integration,
etc.

> Hmmm, better to have a 2nd GUC:
> 
> checksum_fail_action = WARNING | ERROR
> 
> ... since some people want the write or read to fail, and others just
> want to see it in the logs.

Checksums don't introduce new failure modes on writes, only on reads.

And for reads, I think we have a problem doing anything less than an
ERROR. If we allow the read to succeed, we either risk a crash (or
silently corrupting other buffers in shared memory), or we have to put a
zero page in its place. But we already have the zero_damaged_pages
option, which I think is better because reading corrupt data is only
useful for data recovery efforts.

> So, thinking about it, state (3) is never the state of an entire
> installation; it's always the state of individual tables.

That contradicts the idea of using a GUC then. It would make more sense
to have extra syntax or extra VACUUM modes to accomplish that per-table.

Unfortunately, I'm worried that the per-table approach will not be
completed by 9.3. Do you see something about my proposal that makes it
harder to get where we want to go in th

Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Noah Misch
On Sun, Nov 11, 2012 at 10:10:31AM -0500, Matthew Gerber wrote:
> > > Matthew Gerber  writes:
> > > >> Here is the command that was executing when the 0xC409 exception
> > was
> > > >> raised:
> > > >> INSERT INTO places (bounding_box,country,full_name,id,name,type,url)
> > > >> VALUES
> > > >> (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085
> > > >> 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789
> > > >> 32.771786,-97.034085 32.771786))',4326),26918),'United
> > States','Irving,
> > > >> TX','dce44ec49eb788f5','Irving','city','
> > > >> http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'),

> A slight update on this:  previously, my insert code involved a long
> "SELECT ... UNION ALL ... SELECT ... UNION ALL ..." command. If this
> command was too long, I would get a stack depth exception thrown back to my
> client application. I reduced the length of the command to eliminate the
> client-side exceptions, but on some occasions I would still get the
> 0xC409 crash on the server side. I have eliminated the long "SELECT ...
> UNION ALL ... " command, and I now get no errors on either side. So it
> seems like long commands like this were actually causing the server-side
> crashes. The proper behavior would seem to be throwing the exception back
> to the client application instead of crashing the server.

Above, you quoted an INSERT ... VALUES of two rows.  Have you observed an
exception-0xC409 crash with an INSERT ... VALUES query, or only with an
"INSERT ... SELECT ... thousands of UNION" query?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: optimized DROP of multiple tables within a transaction

2012-11-11 Thread Tomas Vondra
Hi,

I've prepared a slightly updated patch, based on the previous review.
See it attached.

On 18.10.2012 04:28, 花田 茂 wrote:> Hi Tomas,
>
> On 2012/10/17, at 20:45, Tomas Vondra  wrote:
>>
>> Dne 17.10.2012 12:34, Shigeru HANADA napsal:
>>> Performance test
>>> 
>>> I tested 1000 tables case (each is copy of pgbench_branches with 
>>> 10 rows) on 1GB shared_buffers server. Please note that I 
>>> tested on MacBook air, i.e. storage is not HDD but SSD. Here is 
>>> the test procedure:
>>>
>>> 1) loop 1000 times
>>> 1-1) create copy table of pgbench_accounts as accounts$i
>>> 1-2) load 10 rows
>>> 1-3) add primary key
>>> 1-4) select all rows to cache pages in shared buffer
>>> 2) BEGIN
>>> 3) loop 1000 times
>>> 3-1) DROP TABLE accounts$i
>>> 4) COMMIT
>>
>> I don't think the 'load rows' and 'select all rows' is really 
>> necessary. And AFAIK sequential scans use small circular buffer
>> not to pollute sharedbuffers, so I'd guess the pages are not cached
>> in shared buffers anyway. Have you verified that, e.g. by
>> pg_buffercache?
>
> Oops, you're right. I omitted 1-3 and 1-4 in actual measurement, but
> IMO loading data is necessary to fill the shared buffer up, because
> # of buffers which are deleted during COMMIT is major factor of this
> patch. And, yes, I verified that all shared buffers are used after
> all loading have been finished.

I don't think it's an important factor, as the original code does this:

  for (i = 0; i < NBuffers; i++)
  {
volatile BufferDesc *bufHdr = &BufferDescriptors[i];
...
  }

in the DropRelFileNodeAllBuffers. That loops through all shared buffers
no matter what, so IMHO the performance in this case depends on the
total size of the shared buffers. But maybe I'm missing something important.

 Our system creates a lot of "working tables" (even 100.000)
 and we need to perform garbage collection (dropping obsolete 
 tables) regularly. Thisoften took ~ 1 hour, because we're using
 big AWS instances with lots of RAM (which tends to be slower
 than RAM on bare hw). After applying this patch and dropping
 tables in groups of 100, the gc runs in less than 4 minutes
 (i.e. a 15x speed-up).
>>>
>>> Hm, my environment seems very different from yours. Could you
>>> show the setting of shared_buffers in your environment? I'd like
>>> to make my test environment as similar as possible to yours.
>>
>> We're using m2.4xlarge instances (70G of RAM) with 10GB shared
>> buffers.
>
> Thank you, it's more huge than I expected. I'm not sure whether my
> boss allows me to use such rich environment... :(

I've done a simple benchmark on my laptop with 2GB shared buffers, it's
attached in the drop-test.py (it's a bit messy, but it works).

It does four things:

1) creates N tables
2) drops them one by one
3) creates N tables
4) drops them in batches (batch = one transaction)

To run the script, simply specify number of tables you want to work with
(say 1), size of the batch (e.g. 100) and connection string (e.g.
'host=localhost dbname=mydb').

With those parameters, I got these numbers on the laptop:

  creating 1 tables
all tables created in 3.298694 seconds
  dropping 1 tables one by one ...
all tables dropped in 32.692478 seconds
  creating 1 tables
all tables created in 3.458178 seconds
  dropping 1 tables in batches by 100...
all tables dropped in 3.28268 seconds

So it's 33 seconds vs. 3.3 seconds, i.e. 10x speedup. On AWS we usually
get larger differences, as we use larger shared buffers and the memory
is significantly slower there.

Regarding the other comments:

> * As Robert commented, this patch adds DropRelFileNodeAllBuffersList 
> by copying code from DropRelFileNodeAllBuffers. Please refactor it
> to avoid code duplication.

Yes, I've merged the two functions, i.e. I've removed the original
DropRelFileNodeAllBuffers and used the name for the new one (taking
array instead of single relnode). Then I've modified the existing calls
to to use

DropRelFileNodeAllBuffers(&relnode, 1)

instead of the original one

DropRelFileNodeAllBuffers(relnode)

Maybe this should be done for smgrdounlink/smgrdounlinkall too.

> * In smgrDoPendingDeletes, you free srels explicitly. Can't we leave
>  them to memory context stuff? Even it is required, at least pfree 
> must be called in the case nrels == 0 too.

Hmmm, right. Not sure which choice is better, so for now I've moved the
pfree out of the 'if' block, so it's executed for 'nrels==0' too.

> * In smgrDoPendingDeletes, the buffer srels is expanded in every 
> iteration. This seems a bit inefficient. How about doubling the 
> capacity when used it up? This requires additional variable, but 
> reduces repalloc call considerably.

Done, although I haven't seen any significant speed improvement.

> * Just an idea, but if we can remove entries for local relations from
> rnodes array before buffer loop in DropRelFileNodeAllBuffersList, 
> following 

Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Noah Misch
On Sun, Nov 11, 2012 at 12:22:24PM -0500, Tom Lane wrote:
> perl -e 'print "SELECT 1 a, 2 b, 3 c\n"; print "UNION ALL SELECT 1 a, 2 b, 3 
> c\n" foreach (1..8200);' | psql
> 
> On the machine I tried this on, it works up to about 8200 and then fails
> in the way I'd expect:
> 
> ERROR:  stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth" (currently 
> 2048kB), after ensuring the platform's stack depth limit is adequate.
> 
> But then when I cranked it up to 8, kaboom:
> 
> connection to server was lost

I tried this test case on Windows Server 2008 (x64).  It hit max_stack_depth
at 9000 UNIONs and crashed at 1.  When I run it under a debugger, the
debugger reports exception 0xC0FD (STATUS_STACK_OVERFLOW).  Run normally,
the server log reports exception 0xC005 (STATUS_ACCESS_VIOLATION).

> Inspection of the core dump shows transformSetOperationTree is the
> problem --- it's recursing but lacks a check_stack_depth test.
> So that's easy to fix, but I wonder why the critical depth limit seems
> to be so much less on your machine.  I get the expected error up to
> about 65000 UNION ALLs --- why is yours crashing at a tenth of that?

So, I can reproduce the lower threshold, but the exception type does not agree
with the one Matthew observed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-11-11 Thread Jeff Davis
On Sat, 2012-11-10 at 14:46 +0100, Florian Pflug wrote:
> > The bit indicating that a checksum is present may be lost due to
> > corruption.
> 
> Though that concern mostly goes away if instead of a separate bit we use a
> special checksum value, say 0xDEAD, to indicate that the page isn't
> checksummed, no?

Right. But then we have an upgrade impact to set the checksum to 0xDEAD
on all existing pages, which seems to eliminate most of the possible
reason for it.

Also, we'd need to tweak the algorithm to make sure that it never landed
on that magic value. So if we think we might want this in the future, we
should reserve that magic value now.

But I can't think of many reasons for it, unless we expect people to be
turning checksums on and off repeatedly.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Tom Lane
Matthew Gerber  writes:
> Interesting. I really have no idea why mine seemed to fail so much sooner.
> I recalled my 5k-7k figure from memory, so I might be off on that, but
> probably not by an order of magnitude. In any case, it sounds like you know
> how to fix the problem. Should I file this as a bug report or will you take
> care of it from here?

I will fix the crash I'm seeing; I'm just not 100% convinced it's the
same crash you're seeing.  It'd be useful if some other folk would poke
at similar examples on Windows to see if there's something unexpected
happening with stack depth checks there.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Matthew Gerber
On Sun, Nov 11, 2012 at 12:22 PM, Tom Lane  wrote:

> Matthew Gerber  writes:
> > On Sun, Nov 11, 2012 at 11:19 AM, Tom Lane  wrote:
> >> How long is "long"?
>
> > I was seeing queries with around 5000-7000  "UNION ALL" statements.
>
> Hm.  I experimented with test queries created like so:
>
> perl -e 'print "SELECT 1 a, 2 b, 3 c\n"; print "UNION ALL SELECT 1 a, 2 b,
> 3 c\n" foreach (1..8200);' | psql
>
> On the machine I tried this on, it works up to about 8200 and then fails
> in the way I'd expect:
>
> ERROR:  stack depth limit exceeded
> HINT:  Increase the configuration parameter "max_stack_depth" (currently
> 2048kB), after ensuring the platform's stack depth limit is adequate.
>
> But then when I cranked it up to 8, kaboom:
>
> connection to server was lost
>
> Inspection of the core dump shows transformSetOperationTree is the
> problem --- it's recursing but lacks a check_stack_depth test.
> So that's easy to fix, but I wonder why the critical depth limit seems
> to be so much less on your machine.  I get the expected error up to
> about 65000 UNION ALLs --- why is yours crashing at a tenth of that?
>

Tom,

Interesting. I really have no idea why mine seemed to fail so much sooner.
I recalled my 5k-7k figure from memory, so I might be off on that, but
probably not by an order of magnitude. In any case, it sounds like you know
how to fix the problem. Should I file this as a bug report or will you take
care of it from here?

Best,
Matt


Re: [HACKERS] too much pgbench init output

2012-11-11 Thread Tomas Vondra
On 23.10.2012 18:21, Robert Haas wrote:
> On Tue, Oct 23, 2012 at 12:02 PM, Alvaro Herrera
>  wrote:
>> Tomas Vondra wrote:
>>
>>> I've been thinking about this a bit more, and do propose to use an
>>> option that determines "logging step" i.e. number of items (either
>>> directly or as a percentage) between log lines.
>>>
>>> The attached patch defines a new option "--logging-step" that accepts
>>> either integers or percents. For example if you want to print a line
>>> each 1000 lines, you can to this
>>>
>>>   $ pgbench -i -s 1000 --logging-step 1000 testdb
>>
>> I find it hard to get excited about having to specify a command line
>> argument to tweak this.  Would it work to have it emit messages
>> depending on elapsed time and log scale of tuples emitted?  So for
>> example emit the first message after 5 seconds or 100k tuples, then back
>> off until (say) 15 seconds have lapsed and 1M tuples, etc?  The idea is
>> to make it verbose enough to keep a human satisfied with what he sees,
>> but not flood the terminal with pointless updates.  (I think printing
>> the ETA might be nice as well, not sure).
> 
> I like this idea.  One of the times when the more verbose output is
> really useful is when you expect it to run fast but then it turns out
> that for some reason it runs really slow.  If you make the output too
> terse, then you end up not really knowing what's going on.  Having it
> give an update at least every 5 seconds would be a nice way to give
> the user a heads-up if things aren't going as planned, without
> cluttering the normal case.

I've prepared a patch along these lines. The attached version used only
elapsed time to print the log messages each 5 seconds, so now it prints
a meessage each 5 seconds no matter what, along with an estimate of
remaining time.

I've removed the config option, although it might be useful to specify
the interval?

I'm not entirely sure how the 'log scale of tuples' should work - for
example when the time 15 seconds limit is reached, should it be reset
back to the previous step (5 seconds) to give a more detailed info, or
should it be kept at 15 seconds?

Tomas

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 090c210..3cbb564 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -135,6 +135,11 @@ intunlogged_tables = 0;
 double sample_rate = 0.0;
 
 /*
+ * logging steps (seconds between log messages)
+ */
+intlog_step_seconds = 5;
+
+/*
  * tablespace selection
  */
 char  *tablespace = NULL;
@@ -1362,6 +1367,11 @@ init(bool is_no_vacuum)
charsql[256];
int i;
 
+   /* used to track elapsed time and estimate of the remaining time */
+   instr_time  start, diff;
+   double elapsed_sec, remaining_sec;
+   int log_interval = 1;
+
if ((con = doConnect()) == NULL)
exit(1);
 
@@ -1430,6 +1440,8 @@ init(bool is_no_vacuum)
}
PQclear(res);
 
+   INSTR_TIME_SET_CURRENT(start);
+
for (i = 0; i < naccounts * scale; i++)
{
int j = i + 1;
@@ -1441,10 +1453,28 @@ init(bool is_no_vacuum)
exit(1);
}
 
-   if (j % 10 == 0)
-   fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
-   j, naccounts * scale,
-   (int) (((int64) j * 100) / (naccounts * 
scale)));
+   /* let's not call the timing for each row, but only each 100 
rows */
+   if (j % 100 == 0 || j == scale * naccounts)
+   {
+   INSTR_TIME_SET_CURRENT(diff);
+   INSTR_TIME_SUBTRACT(diff, start);
+   
+   elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+   remaining_sec = (scale * naccounts - j) * elapsed_sec / 
j;
+   
+   /* have ve reached the next interval? */
+   if (elapsed_sec >= log_interval * log_step_seconds) {
+   
+   fprintf(stderr, "%d of %d tuples (%d%%) done 
(elapsed %.2f s, remaining %.2f s).\n",
+   j, naccounts * scale,
+   (int) (((int64) j * 100) / 
(naccounts * scale)), elapsed_sec, remaining_sec);
+   
+   /* skip to the next interval */
+   log_interval = 
(int)ceil(elapsed_sec/log_step_seconds);
+   
+   }
+   }
+
}
if (PQputline(con, "\\.\n"))
{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Tom Lane
Matthew Gerber  writes:
> On Sun, Nov 11, 2012 at 11:19 AM, Tom Lane  wrote:
>> How long is "long"?

> I was seeing queries with around 5000-7000  "UNION ALL" statements.

Hm.  I experimented with test queries created like so:

perl -e 'print "SELECT 1 a, 2 b, 3 c\n"; print "UNION ALL SELECT 1 a, 2 b, 3 
c\n" foreach (1..8200);' | psql

On the machine I tried this on, it works up to about 8200 and then fails
in the way I'd expect:

ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 
2048kB), after ensuring the platform's stack depth limit is adequate.

But then when I cranked it up to 8, kaboom:

connection to server was lost

Inspection of the core dump shows transformSetOperationTree is the
problem --- it's recursing but lacks a check_stack_depth test.
So that's easy to fix, but I wonder why the critical depth limit seems
to be so much less on your machine.  I get the expected error up to
about 65000 UNION ALLs --- why is yours crashing at a tenth of that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Matthew Gerber
On Sun, Nov 11, 2012 at 11:19 AM, Tom Lane  wrote:

> Matthew Gerber  writes:
> > On Sun, Nov 11, 2012 at 12:23 AM, Noah Misch  wrote:
> >> It signifies scribbling past the end of the frame's local variables:
> >> http://msdn.microsoft.com/en-us/library/8dbf701c.aspx
>
> > A slight update on this:  previously, my insert code involved a long
> > "SELECT ... UNION ALL ... SELECT ... UNION ALL ..." command.
>
> How long is "long"?
>

I was seeing queries with around 5000-7000  "UNION ALL" statements.

Matt


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Tom Lane
Matthew Gerber  writes:
> On Sun, Nov 11, 2012 at 12:23 AM, Noah Misch  wrote:
>> It signifies scribbling past the end of the frame's local variables:
>> http://msdn.microsoft.com/en-us/library/8dbf701c.aspx

> A slight update on this:  previously, my insert code involved a long
> "SELECT ... UNION ALL ... SELECT ... UNION ALL ..." command.

How long is "long"?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unresolved error 0xC0000409 on Windows Server

2012-11-11 Thread Matthew Gerber
On Sun, Nov 11, 2012 at 12:23 AM, Noah Misch  wrote:

> On Sun, Nov 04, 2012 at 02:30:38PM -0500, Tom Lane wrote:
> > Matthew Gerber  writes:
> > >> Here is the command that was executing when the 0xC409 exception
> was
> > >> raised:
> > >> INSERT INTO places (bounding_box,country,full_name,id,name,type,url)
> > >> VALUES
> > >> (st_transform_null(ST_GeometryFromText('POLYGON((-97.034085
> > >> 32.771786,-97.034085 32.953966,-96.888789 32.953966,-96.888789
> > >> 32.771786,-97.034085 32.771786))',4326),26918),'United
> States','Irving,
> > >> TX','dce44ec49eb788f5','Irving','city','
> > >> http://api.twitter.com/1/geo/id/dce44ec49eb788f5.json'),
> >
> > Assuming that 0xC409 does actually represent a stack-overrun
> > condition,
>
> It signifies scribbling past the end of the frame's local variables:
> http://msdn.microsoft.com/en-us/library/8dbf701c.aspx
>

A slight update on this:  previously, my insert code involved a long
"SELECT ... UNION ALL ... SELECT ... UNION ALL ..." command. If this
command was too long, I would get a stack depth exception thrown back to my
client application. I reduced the length of the command to eliminate the
client-side exceptions, but on some occasions I would still get the
0xC409 crash on the server side. I have eliminated the long "SELECT ...
UNION ALL ... " command, and I now get no errors on either side. So it
seems like long commands like this were actually causing the server-side
crashes. The proper behavior would seem to be throwing the exception back
to the client application instead of crashing the server.

Hope this helps...

Matt


Re: [HACKERS] Failing SSL connection due to weird interaction with openssl

2012-11-11 Thread Lars Kanis
Am 06.11.2012 21:40, schrieb Robert Haas:
> On Tue, Oct 23, 2012 at 4:09 AM, Lars Kanis  wrote:
>> While investigating a ruby-pg issue [1], we noticed that a libpq SSL
>> connection can fail, if the running application uses OpenSSL for other work,
>> too. Root cause is the thread local error queue of OpenSSL, that is used to
>> transmit textual error messages to the application after a failed crypto
>> operation. In case that the application leaves errors on the queue, the
>> communication to the PostgreSQL server can fail with a message left from the
>> previous failed OpenSSL operation, in particular when using non-blocking
>> operations on the socket. This issue with openssl is quite old now - see
>> [3].
>>
>> For [1] it turned out that the issue is subdivided into these three parts:
>> 1. the ruby-openssl binding does not clear the thread local error queue of
>> OpenSSL after a certificate verify
>> 2. OpenSSL makes use of a shared error queue for different crypto contexts.
>> 3. libpq does not ensure a cleared error queue when doing SSL_* calls
>>
>> To 1: Remaining messages on the error queue can generally lead to failing
>> operations, later on. I'd talk to the ruby-openssl developers, to discuss
>> how we can avoid any remaining messages on the queue.
>>
>> To 2: SSL_get_error() inspects the shared error queue under some conditions.
>> It's maybe poor API design, but it's documented behaviour [2]. So we
>> certainly have to get along with it.
>>
>> To 3: To make libpq independent to a previous error state, the error queue
>> might be cleared with a call to ERR_clear_error() prior
>> SSL_connect/read/write as in the attached trivial patch. This would make
>> libpq robust against other uses of openssl within the application.
>>
>> What do you think about clearing the OpenSSL error queue in libpq in that
>> way?
> Shouldn't it be the job of whatever code is consuming the error to
> clear the error queue afterwards?
>
Yes, of course. I already filed a bug for ruby-openssl, some weeks ago [1].

But IMHO libpq should be changed too, for the following reasons:

1. The behavior of libpq isn't consistent, since blocking calls are
already agnostic to remaining errors in the openssl queue, but
non-blocking are not. This is a openssl quirk, that is exposed to the
libpq-API, this way.

2. libpq throws wrong errors. The error of libpq isn't "Remaining errors
in openssl error queue. libpq requires a clear error queue in order to
work correctly.", but instead it throws arbitrary foreign errors that
could relate to or may not relate to the communication of libpq. The
documentation for SSL_get_error(3) is pretty unambiguous about the need
to clear the error queue first.

3. The  sensitivity of libpq to the error queue can lead to bugs that
are hard to track down, like this one [2]. This is because a libpq error
leads the developer to look for a bug related to the database
connection, although the issue is in a very different part of the code.

Regards,
Lars

[1] http://bugs.ruby-lang.org/issues/7215
[2]
https://bitbucket.org/ged/ruby-pg/issue/142/async_exec-over-ssl-connection-can-fail-on




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Patch to compute Max LSN of Data Pages

2012-11-11 Thread Noah Misch
On Sun, Nov 11, 2012 at 02:18:11AM -0300, Alvaro Herrera wrote:
> Amit kapila wrote:
> > On Saturday, November 10, 2012 10:19 PM Noah Misch wrote:
> > > This patch is now marked Returned with Feedback in the CF, but I see no
> > > on-list feedback.  Did some review happen?
> > 
> > No review happened for this patch.
> > It has returned due to slight confusion thinking that this is same as:
> > Patch for option in pg_resetxlog for restore from WAL files 
> > (https://commitfest.postgresql.org/action/patch_view?id=897 ) for which 
> > Heikki has given some comments.
> 
> Oops, sorry, my mistake.  Please reopen it as needing review.

Done.

> I will
> move all the open patches to CF3, unless someone beats me to it.  I
> probably won't be able to get anything done tomorrow, so if somebody has
> a boring Sunday I would appreciate the help.

Likewise.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers