date:20140525

Re: [HACKERS] Sending out a request for more buildfarm animals?

2014-05-25 Thread Andres Freund

On 2014-05-25 16:58:39 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-05-25 01:02:25 +0200, Tomas Vondra wrote:
> >> The cache invalidation bug was apparently fixed, but we're still getting
> >> failures (see for example markhor):
> >> http://www.pgbuildfarm.org/cgi-bin/show_history.pl?nm=markhor&br=HEAD
> >> I see there's a transaction (COMMIT+BEGIN) - is this caused by the
> >> extremely long runtimes?
> 
> > Yes, that's the reason. Normally the test doesn't trigger autovacuum at
> > all, but if it's running for a *long* time it can. I haven't yet figured
> > out a good way to deal with that.
> 
> Any way to make the test print only WAL entries arising from the
> foreground transaction?

None that doesn't suck, so far :(. The least bad I can think of is toe
just add a xinfo flag for such 'background' transaction commits. Don't
like it much...

> If an autovac run can trigger this failure, then I would think it would
> happen sometimes, probabilistically, even when the test runtime wasn't
> all that long.  That would be very unhappy-making, eg for packagers
> who would like build runs to reliably work the first time.  So I think
> this is important even without the desire to run CLOBBER_CACHE regression
> tests.

Agreed.

> Another idea is to provide a variant "expected" file, but that seems
> a bit fragile: if you can get one extra transaction, why not two?

Yeah, that's not going to work. Autovac's transaction could appear at
different places in the changestream. We probably don't want a expected
file listing all.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sending out a request for more buildfarm animals?

2014-05-25 Thread Andres Freund

On 2014-05-25 16:58:39 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2014-05-25 01:02:25 +0200, Tomas Vondra wrote:
> >> The cache invalidation bug was apparently fixed, but we're still getting
> >> failures (see for example markhor):
> >> http://www.pgbuildfarm.org/cgi-bin/show_history.pl?nm=markhor&br=HEAD
> >> I see there's a transaction (COMMIT+BEGIN) - is this caused by the
> >> extremely long runtimes?
> 
> > Yes, that's the reason. Normally the test doesn't trigger autovacuum at
> > all, but if it's running for a *long* time it can. I haven't yet figured
> > out a good way to deal with that.
> 
> Any way to make the test print only WAL entries arising from the
> foreground transaction?

None that doesn't suck, so far :(. The least bad I can think of is toe
just add a xinfo flag for such 'background' transaction commits. Don't
like it much...

> If an autovac run can trigger this failure, then I would think it would
> happen sometimes, probabilistically, even when the test runtime wasn't
> all that long.  That would be very unhappy-making, eg for packagers
> who would like build runs to reliably work the first time.  So I think
> this is important even without the desire to run CLOBBER_CACHE regression
> tests.

Agreed.

> Another idea is to provide a variant "expected" file, but that seems
> a bit fragile: if you can get one extra transaction, why not two?

Yeah, that's not going to work. Autovac's transaction could appear at
different places in the changestream. We probably don't want a expected
file listing all.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SQL access to database attributes

2014-05-25 Thread Stephen Frost

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> I don't really object to doing an unlocked check for another such
> database, but I'm not convinced that additional locking to try to
> prevent a race is worth its keep.

+1 on the nannyism, and +1 to ignoring the race.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] 9.4 btree index corruption

2014-05-25 Thread Andres Freund

Hi,

On 2014-05-25 14:45:38 -0700, Jeff Janes wrote:
> You're welcome.  If only I was as good at fixing things as at breaking
> them.

At the moment there seems to be a bigger shortage of folks being good at
breaking stuff - before the release! - than of people fixing the
resulting breakage...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade fails: Mismatch of relation OID in database 8.4 -> 9.3

2014-05-25 Thread Andres Freund

Hi,

On 2014-05-25 16:59:24 -0600, Jeff Ross wrote:
> Could a check like this be added to pg_upgrade?
> Is there a downside to
> adding a column big enough to force a toast table and then dropping it for
> any table that is too large not to have a toast table but doesn't?

It can take time and permanently 'uses up' a attribute number. So you'd
add costs for everyone for the few people that decided to do something
dangerous and unsupported...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] relaying errors from background workers

2014-05-25 Thread Amit Kapila

On Thu, May 22, 2014 at 9:51 AM, Robert Haas  wrote:
>
> Suppose a user backend starts a background worker for some purpose;
> the background worker dies with an error.  The infrastructure we have
> today is sufficient for the user backend to discover that the worker
> backend has died, but not why.  There might be an error in the server
> log, but the error information won't be transmitted back to the user
> backend in any way.  I think it would be nice to fix that.  I also
> think it would be nice to be able to relay not only errors, but also
> messages logged via ereport() or elog() at lower log levels (WARNING,
> NOTICE, INFO, DEBUG).
>
> The design I have in mind is to teach elog.c how to write such
> messages to a shm_mq.  This is in fact one of the major use cases I
> had in mind when I designed the shm_mq infrastructure, because it
> seems to me that almost anything we want to do in parallel is likely
> to want to do this.  Even aside from parallelism, it's not too hard to
> imagine wanting to use background workers to launch a job in the
> background and then come back later and see what happened.  If there
> was an error, you're going to want go know specifically what went
> wrong, not just that something went wrong.
>
> The main thing I'm not sure about is how to format the message that we
> write to the shm_mq.  One option is to try to use the good old FEBE
> protocol.  This doesn't look entirely straightforward, because
> send_message_to_frontend() assembles the message using pq_sendbyte()
> and pq_sendstring(), and then sends it out to the client using
> pq_endmessage() and pq_flush().  This infrastructure assumes that the
> connection to the frontend is a socket.  It doesn't seem impossible to
> kludge that infrastructure to be able to send to either a socket or a
> shm_mq, but I'm not sure whether it's a good idea.

I think it will be better to keep assembling part of message similar to
current and then have different way of communicating to backend.
This will make rethrowing of message to client simpler.
Already we have mechanism for reporting to client and server
(send_message_to_server_log()/send_message_to_frontend()), so
devising a similar way for communicating with backend seems to be
a plausible way.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-25 Thread David Fetter

On Sun, May 25, 2014 at 11:23:41PM +0200, Ronan Dunklau wrote:
> Le dimanche 25 mai 2014 12:41:18 David Fetter a écrit :
> > On Fri, May 23, 2014 at 10:08:06PM +0200, Ronan Dunklau wrote:
> > > Hello,
> > > 
> > > Since my last proposal didn't get any strong rebuttal, please find
> > > attached a more complete version of the IMPORT FOREIGN SCHEMA statement.
> > 
> > Thanks!
> > 
> > Please to send future patches to this thread so people can track them
> > in their mail.
> 
> I'll do.
> 
> I didn't for the previous one because it was a few months ago, and no patch 
> had been added to the commit fest.

Thanks for adding this one :)

> > > The import_foreign_schema_postgres_fdw patch proposes an implementation of
> > > this API for postgres_fdw. It will import a foreign schema using the right
> > > types as well as nullable information.
> > 
> > In the case of PostgreSQL, "the right types" are obvious until there's
> > a user-defined one.  What do you plan to do in that case ?
> 
> The current implementation fetches the types as regtype, and when receiving a 
> custom type, two things can happen:
> 
>  - the type is defined locally: everything will work as expected
>  - the type is not defined locally: the conversion function will fail, and 
> raise an error of the form: ERROR:  type "schema.typname" does not exist
> 
> Should I add that to the regression test suite ?

Yes.

In the "easy" case of PostgreSQL, you might also be able to establish
whether the UDT in the "already defined locally" case above is
identical to the one defined remotely, but I don't think it's possible
even in principle for non-PostgreSQL remote systems, possibly not even
for ones with non-identical architecture, PostgreSQL major version,
etc.

> > > Regarding documentation, I don't really know where it should have been
> > > put. If I missed something, let me know and I'll try to correct it.
> > 
> > It's not exactly something you missed, but I need to bring it up
> > anyway before we go too far.  The standard missed two crucial concepts
> > when this part of it was written:
> > 
> > 1.  No single per-database-type universal type mapping can be correct.
> > 
> > [snip]
> > 
> > To address these problems, I propose the following:
> > 
> > - We make type mappings settable at the level of:
> > - FDW
> > - Instance (a.k.a. cluster)
> > - Database
> > - Schema
> > - Table
> > - Column
> > 
> > using the existing ALTER command and some way of spelling out how

Oops.  Forgot to include CREATE in the above.

> > a remote type maps to a local type.
> > 
> > This would consist of:
> > - The remote type
> > - The local type to which it maps
> > - The inbound transformation (default identity)
> > - The outbound transformation (default identity)
> > 
> > At any given level, the remote type would need to be unique.  To
> > communicate this to the system, we either invent new syntax, with
> > all the hazards attendant thereto, or we could use JSON or similar
> > serialization.
> > 
> > ALTER FOREIGN TABLE foo
> > ADD TYPE MAPPING
> > FROM "datetime"
> > TO TEXT
> > WITH (
> > INBOUND TRANSFORMATION IDENTITY,
> > OUTBOUND TRANSFORMATION IDENTITY
> > ) /* Ugh!!! */

"Ugh!!!" means I don't think we should do it this way.

> > vs.
> > 
> > ALTER FOREIGN TABLE foo ADD (mapping '{
> > "datetime": "text",
> > "inbound": "IDENTITY",
> > outbound: "IDENTITY"
> > }')
> > 
> > Each FDW would have some set of default mappings and some way to
> > override them as above.
> 
> I understand your points, but I'm not really comfortable with the
> concept, unless there is something that I missed.

My poor communication ability might have a lot to do with it.  I
assure you my explanation would have been even worse if I had tried it
in French, though. :P

> We can already support this use case through specific-fdw options. Should I 
> add that to postgres_fdw ?

I believe the capability belongs in our FDW API with the decision of
whether to implement it up to FDW authors.  They know (or should know)
how to throw ERRCODE_FEATURE_NOT_SUPPORTED.

> Additionally, I don't really see how that would be useful in a general case. 
> With an "in-core" defined meaning of type transformation, any FDW that 
> doesn't 
> fit exactly into the model would have a hard time. For example, what happens 
> if an FDW is only ever capable of returning text ?

That's actually the case where it's most important to have the feature
all the way down to the column level.

> Or if a mapping can only be set at the server or FDW model because
> it depends on some connection parameter ?

ERRCODE_FEATURE_NOT_SUPPORTED.

> The bulk of the code for managing type mappings would be
> FDW-specific anyway.

The part that actually does the transformations would necessarily be
part of each FDW.  I omitted opining on whether such transformations

[HACKERS] Fwd: Typo fixes in Solution.pm, part of MSVC scripts

2014-05-25 Thread Michael Paquier

Hi all,

I noticed a couple of typos in Solution.pm, fixed by the patch
attached. Those things have been introduced by commits cec8394 (visual
2013 support) and 0a78320 (latest pgident run, not actually a problem
of pgident, because of a misuse of commas).
The tab problems should not be backpatched to 9.3, but the use of
commas instead of semicolons should be.

Regards,
--
Michael
diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm
index 283d399..c0d7f38 100644
--- a/src/tools/msvc/Solution.pm
+++ b/src/tools/msvc/Solution.pm
@@ -747,13 +747,13 @@ sub new
 	my $self  = $classname->SUPER::_new(@_);
 	bless($self, $classname);
 
-	$self->{solutionFileVersion}  = '12.00';
-	$self->{vcver}= '12.00';
-	$self->{visualStudioName} = 'Visual Studio 2013';
-	$self->{VisualStudioVersion}  = '12.0.21005.1',
-	  $self->{MinimumVisualStudioVersion} = '10.0.40219.1',
+	$self->{solutionFileVersion}= '12.00';
+	$self->{vcver}  = '12.00';
+	$self->{visualStudioName}   = 'Visual Studio 2013';
+	$self->{VisualStudioVersion}= '12.0.21005.1';
+	$self->{MinimumVisualStudioVersion} = '10.0.40219.1';
 
-	  return $self;
+	return $self;
 }
 
 sub GetAdditionalHeaders

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-25 Thread Matteo Beccati

Hi Tom,

thanks for the feedback.

On 25/05/2014 21:10, Tom Lane wrote:
> Matteo Beccati  writes:
>> here's the latest version of my uuid changes patch, according to
>> proposal (2) from Tom in the thread about OSSP-UUID[1].
> 
> Hmm ... this is not actually what I had in mind.  Unless I'm misreading
> the patch, this nukes the "uuid-ossp" extension entirely in favor of a
> new extension "uuid" (providing the same SQL functions with a different
> underlying implementation).  I don't think this works from the standpoint
> of providing compatibility for users of the existing extension.
> In particular, it'd break pg_upgrade (because of the change of the .so
> library name) as well as straight pg_dump upgrades (which would expect
> CREATE EXTENSION "uuid-ossp" to work).  Not to mention application code
> that might expect CREATE EXTENSION "uuid-ossp" to still work.
> 
> Another objection is that for people for whom OSSP uuid still works fine,
> this is forcing them to adopt a new implementation whose compatibility is
> as yet unproven.
> 
> What I'd rather do is preserve contrib/uuid-ossp with the same extension
> and .so name, but with two configure options that select different
> underlying implementations.

Sure, that makes sense. I wasn't actually sure it was ok to keep the
"OSSP" brand even though the extensions didn't use the oosp library,
hence the renaming. But I do agree upgrades wouldn't be very easy if we
don't.

> In the long run it'd be nice to migrate away from the "uuid-ossp"
> extension name, mostly because of the poorly-chosen use of a dash in the
> name.  But I'm not sure how we do that without breaking backwards
> compatibility, and anyway it's an entirely cosmetic thing that we can
> worry about later.
> 
> Anyhow, doing it like that seems like it ought to be a pretty
> straightforward refactoring of your patch.  I could pursue that,
> or you can.

I do have a system with the ossp library installed: I'd be happy to give
it a try tomorrow morning my time, depending on my workload. I'll keep
you posted!


Cheers
-- 
Matteo Beccati

Development & Consulting - http://www.beccati.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_recvlogical not accepting -I to specify start LSN position

2014-05-25 Thread Michael Paquier

On Mon, May 26, 2014 at 2:01 AM, Andres Freund  wrote:
> On 2014-05-25 22:35:24 +0900, Michael Paquier wrote:
>> Attached patch corrects that, reshuffling at the same time the option
>> letters parsed with getopt_long in alphabetical order.
>
> Hm. Not a big fan of this in isolation. In the attached patch I've
> reordered the options to all be ordered alphabetically, but only inside
> the section they are in --help.
>
> What do you think?
That's fine as well. Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade fails: Mismatch of relation OID in database 8.4 -> 9.3

2014-05-25 Thread Jeff Ross



On 5/25/14, 11:44 AM, Andres Freund wrote:


Hi,

On 2014-05-23 08:23:57 -0600, Jeff Ross wrote:

UDB=# \x
Expanded display is on.
UDB=# SELECT attrelid::regclass, attname, attnum, attlen, *
FROM pg_attribute
WHERE attrelid = 'masterairportlist'::regclass
ORDER BY attnum ASC;
UDB=#
[ RECORD 1 ]-+--
...


A quick sum over the returned values seems to indicate that it's too
large to not have a toast table. Adding up attlen and atttypmod gives a
value of 1283. Considering that there additionally has to be VARHDRSZ
space for the varlena header and that the maximum storage size for a
varchar(n) is n * pg_encoding_max_length(GetDatabaseEncoding()) (usually
4) this seems to indicate that bad things[tm] have been done to the
database.
I suggest you write a script that does a 'ALTER TABLE $tbl ADD COLUMN
toastme text; ALTER TABLE $tbl DROP COLUMN toastme' for all tables.

Greetings,

Andres Freund



Hi Andres,

Yes, that is exactly what I will do before our next test migration.  I'd 
already started on the script since I could not see any downside to 
adding a column big enough to force a toast table and then dropping it, 
exactly as Bruce and you suggest and especially if that will let me use 
pg_upgrade rather than the traditional dump/restore.


Could a check like this be added to pg_upgrade?  Is there a downside to 
adding a column big enough to force a toast table and then dropping it 
for any table that is too large not to have a toast table but doesn't?


Jeff


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Could not finish anti-wraparound VACUUM when stop limit is reached

2014-05-25 Thread Jeff Janes

On Sunday, May 25, 2014, Heikki Linnakangas
>
wrote:

> While debugging the B-tree bug that Jeff Janes reported (
> http://www.postgresql.org/message-id/CAMkU=1y=VwF07Ay+Cpqk_
> 7fpihrctmssv9y99sbghitkxpb...@mail.gmail.com), a new issue came up:
>
> If you reach the xidStopLimit, and try to run VACUUM, it fails with error:
>
> jjanes=# vacuum;
> ERROR:  database is not accepting commands to avoid wraparound data loss
> in database "jjanes"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.
>

This problem also afflicted me in 9.3 and 9.2 (and probably existed further
back too).  I figured it was mostly a barrier to more effective testing,
but it would be nice to have it fixed.

But I don't understand how you encountered this.   I only ran into it when
the vacuum had already been started, but not yet completed, by the time the
limit was reached.  Once it is already reached, how do you even get the
vacuum to start?  Doesn't it error out right at the beginning.

Jeff's database seems to have wrapped around already, because after fixing
> the above, I get this:
>

Do you have the patch to fix this?

>
> jjanes=# vacuum;
> WARNING:  some databases have not been vacuumed in over 2 billion
> transactions
> DETAIL:  You might have already suffered transaction-wraparound data loss.
> VACUUM
>

This is odd.  When I apply your patch from the other thread to fix the
vacuum, and then start up in single-user mode, I can run vacuum to
completion and re-open the database.  When I first start it up, it says it
needs to be vacuumed within 999,935 transactions.  There is no indication
that it has already suffered a wrap around, just that it was about to do so.

> We do not truncate clog when wraparound has already happened, so we never
> get past that point. Jeff advanced XID counter aggressively with some
> custom C code, so hitting the actual wrap-around is a case of "don't do
> that". Still, the case is quite peculiar: pg_controldata says that nextXid
> is 4/1593661139. The oldest datfrozenxid is equal to that, 1593661139. So
> ISTM he managed to not just wrap around, but execute 2 billion more
> transactions after the wraparound and reach datfrozenxid again. I'm not
> sure how that happened.
>

If it had actually undergone an undetected wraparound, wouldn't data be
disappearing and appearing inappropriately? I think the testing harness
should have detected that inconsistency.

(Also, the max setting for JJ_xid during the test run was 40, so I don't
think it could have blown right past the 1,000,000 safety margin and out
the other side without triggering a shutdown).

Cheers,

Jeff

Re: [HACKERS] 9.4 btree index corruption

2014-05-25 Thread Heikki Linnakangas


On 05/25/2014 05:45 PM, Jeff Janes wrote:

On Sun, May 25, 2014 at 7:16 AM, Heikki Linnakangas 
wrote:



On 05/21/2014 10:22 PM, Jeff Janes wrote:


Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

...





https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&;
usp=sharing



I downloaded the data directory and investigated. I got this message when
I started it up:

20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block 460
is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table
"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yours
claimed it was 1264. I don't know why; perhaps a huge number of insertions
happened after that error, causing the parent level pages to be split,
moving the downlinks it complains about to the right. Did you continue
running the test after that error occurred?



Yes, I didn't set it up to abort upon vacuum errors, so it continued to run
until it reached the 1,000,000 wrap around limit.

Between when the vacuums started failing and when it reached wrap around
limit, it seemed to run normally (other than the increasing bloat and
non-advancement of frozenxid).

I only noticed the ERROR when I was looking through the log file in
postmortem.  Looking back, I see 10 errors with 1264, then it switched to
86458.  I didn't notice the change and reported only the first ERROR
message.


Ok, that explains the difference in the error message.


I'll apply your patch and see what happens, but 90 further hours of testing
gave no more occurrences of the error so I'm afraid that not seeing errors
is not much evidence that the error was fixed.  So I'll have to rely on
your knowledge of the code.


I've pushed the fix.


This is what the tree looks like around those pages:

Level 1:
+-+ +-+ +-+
| Blk 1264| | Blk 160180  | | Blk 86458   |
| | | | | |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ... | | ... | | 1269|
| | | | |  460|
| | | | |23513|
+-+ +-+ +-+




Is this done with the pageinspect extension?


Yeah, with some more tooling on top of it. The elegant diagram I drew by 
hand.


Thanks again! BTW, it would be useful to keep the WAL logs for as long 
as you have disk space. I'm not sure if that would've helped in this 
case, but it can be really useful to see all the actions that were done 
to the page before something goes wrong.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Spreading full-page writes

2014-05-25 Thread Heikki Linnakangas

Here's an idea I tried to explain to Andres and Simon at the pub last 
night, on how to reduce the spikes in the amount of WAL written at 
beginning of a checkpoint that full-page writes cause. I'm just writing 
this down for the sake of the archives; I'm not planning to work on this 
myself.



When you are replaying a WAL record that lies between the Redo-pointer 
of a checkpoint and the checkpoint record itself, there are two 
possibilities:


a) You started WAL replay at that checkpoint's Redo-pointer.

b) You started WAL replay at some earlier checkpoint, and are already in 
a consistent state.


In case b), you wouldn't need to replay any full-page images, normal 
differential WAL records would be enough. In case a), you do, and you 
won't be consistent until replaying all the WAL up to the checkpoint record.


We can exploit those properties to spread out the spike. When you modify 
a page and you're about to write a WAL record, check if the page has the 
BM_CHECKPOINT_NEEDED flag set. If it does, compare the LSN of the page 
against the *previous* checkpoints redo-pointer, instead of the one's 
that's currently in-progress. If no full-page image is required based on 
that comparison, IOW if the page was modified and a full-page image was 
already written after the earlier checkpoint, write a normal WAL record 
without full-page image and set a new flag in the buffer header 
(BM_NEEDS_FPW). Also set a new flag on the WAL record, XLR_FPW_SKIPPED.


When checkpointer (or any other backend that needs to evict a buffer) is 
about to flush a page from the buffer cache that has the BM_NEEDS_FPW 
flag set, write a new WAL record, containing a full-page-image of the 
page, before flushing the page.


Here's how this works out during replay:

a) You start WAL replay from the latest checkpoint's Redo-pointer.

When you see a WAL record that's been marked with XLR_FPW_SKIPPED, don't 
replay that record at all. It's OK because we know that there will be a 
separate record containing the full-page image of the page later in the 
stream.


b) You are continuing WAL replay that started from an earlier 
checkpoint, and have already reached consistency.


When you see a WAL record that's been marked with XLR_FPW_SKIPPED, 
replay it normally. It's OK, because the flag means that the page was 
modified after the earlier checkpoint already, and hence we must have 
seen a full-page image of it already. When you see one of the WAL 
records containing a separate full-page-image, ignore it.


This scheme make the b-case behave just as if the new checkpoint was 
never started. The regular WAL records in the stream are identical to 
what they would've been if the redo-pointer pointed to the earlier 
checkpoint. And the additional FPW records are simply ignored.


In the a-case, it's not be safe to replay the records marked with 
XLR_FPW_SKIPPED, because they don't contain FPWs, and you have all the 
usual torn-page hazards that comes with that. However, the separate FPW 
records that come later in the stream will fix-up those pages.



Now, I'm sure there are issues with this scheme I haven't thought about, 
but I wanted to get this written down. Note this does not reduce the 
overall WAL volume - on the contrary - but it ought to reduce the spike.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.4 btree index corruption

2014-05-25 Thread Jeff Janes

On Sun, May 25, 2014 at 7:16 AM, Heikki Linnakangas  wrote:

> On 05/21/2014 10:22 PM, Jeff Janes wrote:
>
>> Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
>> with foreign keys, I found some btree index corruption.
>>
>> ...

>
>> https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&;
>> usp=sharing
>>
>
> I downloaded the data directory and investigated. I got this message when
> I started it up:
>
> 20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block 460
> is not next child 23513 of block 86458 in index "foo_p_id_idx"
> 20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table
> "jjanes.public.foo"
>
> Interestingly, it's complaining about parent page 86458, while yours
> claimed it was 1264. I don't know why; perhaps a huge number of insertions
> happened after that error, causing the parent level pages to be split,
> moving the downlinks it complains about to the right. Did you continue
> running the test after that error occurred?
>

Yes, I didn't set it up to abort upon vacuum errors, so it continued to run
until it reached the 1,000,000 wrap around limit.

Between when the vacuums started failing and when it reached wrap around
limit, it seemed to run normally (other than the increasing bloat and
non-advancement of frozenxid).

I only noticed the ERROR when I was looking through the log file in
postmortem.  Looking back, I see 10 errors with 1264, then it switched to
86458.  I didn't notice the change and reported only the first ERROR
message.

I'll apply your patch and see what happens, but 90 further hours of testing
gave no more occurrences of the error so I'm afraid that not seeing errors
is not much evidence that the error was fixed.  So I'll have to rely on
your knowledge of the code.

> This is what the tree looks like around those pages:
>
> Level 1:
> +-+ +-+ +-+
> | Blk 1264| | Blk 160180  | | Blk 86458   |
> | | | | | |
> | Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
> | ... | | ... | | 1269|
> | | | | |  460|
> | | | | |23513|
> +-+ +-+ +-+
>
>

Is this done with the pageinspect extension?

With the attached patch, I was able to get past that error, but when VACUUM
> reaches the end, I got this:
>
> jjanes=# vacuum foo;
> ERROR:  database is not accepting commands to avoid wraparound data loss
> in database "jjanes"
> HINT:  Stop the postmaster and vacuum that database in single-user mode.
> You might also need to commit or roll back old prepared transactions.
>
> That seems like an unrelated issue, I'll start a new thread about that.
>

Yeah, that's expected.  Since vacuum failures don't abort the test, it ran
on until something else stopped it.

> Many thanks for the testing, Jeff!

You're welcome.  If only I was as good at fixing things as at breaking
them.

Cheers,

Jeff

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-25 Thread Ronan Dunklau

Le dimanche 25 mai 2014 12:41:18 David Fetter a écrit :
> On Fri, May 23, 2014 at 10:08:06PM +0200, Ronan Dunklau wrote:
> > Hello,
> > 
> > Since my last proposal didn't get any strong rebuttal, please find
> > attached a more complete version of the IMPORT FOREIGN SCHEMA statement.
> 
> Thanks!
> 
> Please to send future patches to this thread so people can track them
> in their mail.

I'll do.

I didn't for the previous one because it was a few months ago, and no patch 
had been added to the commit fest.

> 
> > I tried to follow the SQL-MED specification as closely as possible.
> > 
> > This adds discoverability to foreign servers. The structure of the
> > statement as I understand it is simple enough:
> > 
> > IMPORT FOREIGN SCHEMA remote_schema FROM SERVER some_server [ (LIMIT TO |
> > EXCEPT) table_list ] INTO local_schema.
> > 
> > The import_foreign_schema patch adds the infrastructure, and a new FDW
> > routine:
> > 
> > typedef List *(*ImportForeignSchema_function) (ForeignServer *server,
> > ImportForeignSchemaStmt * parsetree);
> > 
> > This routine must return a list of CreateForeignTableStmt mirroring
> > whatever tables were found on the remote side, which will then be
> > executed.
> > 
> > The import_foreign_schema_postgres_fdw patch proposes an implementation of
> > this API for postgres_fdw. It will import a foreign schema using the right
> > types as well as nullable information.
> 
> In the case of PostgreSQL, "the right types" are obvious until there's
> a user-defined one.  What do you plan to do in that case ?
> 

The current implementation fetches the types as regtype, and when receiving a 
custom type, two things can happen:

 - the type is defined locally: everything will work as expected
 - the type is not defined locally: the conversion function will fail, and 
raise an error of the form: ERROR:  type "schema.typname" does not exist

Should I add that to the regression test suite ?

> > Regarding documentation, I don't really know where it should have been
> > put. If I missed something, let me know and I'll try to correct it.
> 
> It's not exactly something you missed, but I need to bring it up
> anyway before we go too far.  The standard missed two crucial concepts
> when this part of it was written:
> 
> 1.  No single per-database-type universal type mapping can be correct.
> 
> People will have differing goals for type mapping, and writing a whole
> new FDW for each of those goals is, to put it mildly, wasteful.  I
> will illustrate with a concrete and common example.
> 
> MySQL's datetime type encourages usages which PostgreSQL's
> corresponding type, timestamptz, simply disallows, namely '-00-00
> 00:00:00' as its idea of UNKNOWN or NULL.
> 
> One way PostgreSQL's mapping could work is to map it to TEXT, which
> would preserve the strings exactly and be in some sense an identity
> map.  It would also make the type somewhat useless in its original
> intended form.
> 
> Another one would map the type is to a composite data type
> mysql_datetime(tstz timestamptz, is_wacky boolean) which would
> capture, for example, ('2014-04-01 00:00:00+00', false) for the UTC
> start of April Fools' Day this year, and (NULL, true) for '-00-00
> 00:00:00'.
> 
> There are doubtless others, and there is no principled way to assign
> any one of them as universally correct.
> 
> This brings me to the next crucial concept the standard missed:
> 
> 2.  The correct mapping may not be the identity, and furthermore, the
> inbound and outbound mappings might in general not be mere inversions
> of each other.
> 
> MySQL (no aspersions intended) again provides a concrete example with
> its unsigned integer types.  Yes, it's possible to create a domain
> over INT8 which simulates UINT4, a domain over NUMERIC which simulates
> UINT8, etc., but again this process's correctness depends on
> circumstances.
> 
> To address these problems, I propose the following:
> 
> - We make type mappings settable at the level of:
> - FDW
> - Instance (a.k.a. cluster)
> - Database
> - Schema
> - Table
> - Column
> 
> using the existing ALTER command and some way of spelling out how
> a remote type maps to a local type.
> 
> This would consist of:
> - The remote type
> - The local type to which it maps
> - The inbound transformation (default identity)
> - The outbound transformation (default identity)
> 
> At any given level, the remote type would need to be unique.  To
> communicate this to the system, we either invent new syntax, with
> all the hazards attendant thereto, or we could use JSON or similar
> serialization.
> 
> ALTER FOREIGN TABLE foo
> ADD TYPE MAPPING
> FROM "datetime"
> TO TEXT
> WITH (
> INBOUND TRANSFORMATION IDENTITY,
> OUTBOUND TRANSFORMATION IDENTITY
> ) /* Ugh!!! */
> 
> vs.
> 
> ALTER FOREIGN TABLE foo ADD (mapping '{
> "datetime": "text",
>

Re: [HACKERS] Sending out a request for more buildfarm animals?

2014-05-25 Thread Tom Lane

Andres Freund  writes:
> On 2014-05-25 01:02:25 +0200, Tomas Vondra wrote:
>> The cache invalidation bug was apparently fixed, but we're still getting
>> failures (see for example markhor):
>> http://www.pgbuildfarm.org/cgi-bin/show_history.pl?nm=markhor&br=HEAD
>> I see there's a transaction (COMMIT+BEGIN) - is this caused by the
>> extremely long runtimes?

> Yes, that's the reason. Normally the test doesn't trigger autovacuum at
> all, but if it's running for a *long* time it can. I haven't yet figured
> out a good way to deal with that.

Any way to make the test print only WAL entries arising from the
foreground transaction?

If an autovac run can trigger this failure, then I would think it would
happen sometimes, probabilistically, even when the test runtime wasn't
all that long.  That would be very unhappy-making, eg for packagers
who would like build runs to reliably work the first time.  So I think
this is important even without the desire to run CLOBBER_CACHE regression
tests.

Another idea is to provide a variant "expected" file, but that seems
a bit fragile: if you can get one extra transaction, why not two?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pgbench / int64 instead of int for xact count

2014-05-25 Thread Tom Lane

Tomas Vondra  writes:
> Should we change the type of nxacts to int64 (thus allowing '-t' with
> 64-bit integers), or just the overflow in the printf call? I don't find
> it very practical to use -t with values not fitting into 32-bits (the -T
> seems better to do that), so I'm inclined to just fix the printf.

I concur: I don't see people setting -t higher than INT_MAX, but a -t
times -c product exceeding INT_MAX seems more plausible.

> Patch v3 attached.

Looks good, pushed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] IMPORT FOREIGN SCHEMA statement

2014-05-25 Thread David Fetter

On Fri, May 23, 2014 at 10:08:06PM +0200, Ronan Dunklau wrote:
> Hello,
> 
> Since my last proposal didn't get any strong rebuttal, please find attached a 
> more complete version of the IMPORT FOREIGN SCHEMA statement.

Thanks!

Please to send future patches to this thread so people can track them
in their mail.

> I tried to follow the SQL-MED specification as closely as possible.
> 
> This adds discoverability to foreign servers. The structure of the statement
> as I understand it is simple enough:
> 
> IMPORT FOREIGN SCHEMA remote_schema FROM SERVER some_server [ (LIMIT TO | 
> EXCEPT) table_list ] INTO local_schema.
> 
> The import_foreign_schema patch adds the infrastructure, and a new FDW 
> routine: 
> 
> typedef List *(*ImportForeignSchema_function) (ForeignServer *server, 
> ImportForeignSchemaStmt * parsetree);
> 
> This routine must return a list of CreateForeignTableStmt mirroring whatever 
> tables were found on the remote side, which will then be executed.
> 
> The import_foreign_schema_postgres_fdw patch proposes an implementation of 
> this API for postgres_fdw. It will import a foreign schema using the right 
> types as well as nullable information.

In the case of PostgreSQL, "the right types" are obvious until there's
a user-defined one.  What do you plan to do in that case?

> Regarding documentation, I don't really know where it should have been put. 
> If 
> I missed something, let me know and I'll try to correct it.

It's not exactly something you missed, but I need to bring it up
anyway before we go too far.  The standard missed two crucial concepts
when this part of it was written:

1.  No single per-database-type universal type mapping can be correct.

People will have differing goals for type mapping, and writing a whole
new FDW for each of those goals is, to put it mildly, wasteful.  I
will illustrate with a concrete and common example.

MySQL's datetime type encourages usages which PostgreSQL's
corresponding type, timestamptz, simply disallows, namely '-00-00
00:00:00' as its idea of UNKNOWN or NULL.

One way PostgreSQL's mapping could work is to map it to TEXT, which
would preserve the strings exactly and be in some sense an identity
map.  It would also make the type somewhat useless in its original
intended form.

Another one would map the type is to a composite data type
mysql_datetime(tstz timestamptz, is_wacky boolean) which would
capture, for example, ('2014-04-01 00:00:00+00', false) for the UTC
start of April Fools' Day this year, and (NULL, true) for '-00-00
00:00:00'.

There are doubtless others, and there is no principled way to assign
any one of them as universally correct.

This brings me to the next crucial concept the standard missed:

2.  The correct mapping may not be the identity, and furthermore, the
inbound and outbound mappings might in general not be mere inversions
of each other.

MySQL (no aspersions intended) again provides a concrete example with
its unsigned integer types.  Yes, it's possible to create a domain
over INT8 which simulates UINT4, a domain over NUMERIC which simulates
UINT8, etc., but again this process's correctness depends on
circumstances.

To address these problems, I propose the following:

- We make type mappings settable at the level of:
- FDW
- Instance (a.k.a. cluster)
- Database
- Schema
- Table
- Column

using the existing ALTER command and some way of spelling out how
a remote type maps to a local type.

This would consist of:
- The remote type
- The local type to which it maps
- The inbound transformation (default identity)
- The outbound transformation (default identity)

At any given level, the remote type would need to be unique.  To
communicate this to the system, we either invent new syntax, with
all the hazards attendant thereto, or we could use JSON or similar
serialization.

ALTER FOREIGN TABLE foo
ADD TYPE MAPPING
FROM "datetime"
TO TEXT
WITH (
INBOUND TRANSFORMATION IDENTITY,
OUTBOUND TRANSFORMATION IDENTITY
) /* Ugh!!! */

vs.

ALTER FOREIGN TABLE foo ADD (mapping '{
"datetime": "text",
"inbound": "IDENTITY",
outbound: "IDENTITY"
}')

Each FDW would have some set of default mappings and some way to
override them as above.

What say?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pgbench / int64 instead of int for xact count

2014-05-25 Thread Tomas Vondra

On 25.5.2014 20:32, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 25.5.2014 19:05, Andres Freund wrote:
>>> That's not right though. On windows a long (indicated by the %l) is only
>>> 4 bytes wide. Check INT64_FORMAT. That's generated by configure/platform
>>> template files and should always be correct.
> 
>> Oh, right. v2 of the patch attached.
> 
> Hm ... if normal_xacts can overflow an int, shouldn't we expect that
> the product nxacts * nclients could?

Maybe, but I saw that as a separate thing, mostly unrelated to the
'hardware got so fast it can overflow' problem. Because nxacts is the
parameter we use to define duration, so if it din't overflow in the
past, it won't overflow today.

OTOH, thanks to hardware improvements we tend to use more clients /
higher number of transactions, so fixing this seems like a good idea.

Should we change the type of nxacts to int64 (thus allowing '-t' with
64-bit integers), or just the overflow in the printf call? I don't find
it very practical to use -t with values not fitting into 32-bits (the -T
seems better to do that), so I'm inclined to just fix the printf. Patch
v3 attached.

Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 6cc06d7..cf71c73 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -239,7 +239,7 @@ typedef struct
 typedef struct
 {
 	instr_time	conn_time;
-	int			xacts;
+	int64		xacts;
 	int64		latencies;
 	int64		sqlats;
 	int64		throttle_lag;
@@ -2180,7 +2180,7 @@ process_builtin(char *tb)
 
 /* print out results */
 static void
-printResults(int ttype, int normal_xacts, int nclients,
+printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
 			 int64 total_latencies, int64 total_sqlats,
@@ -2213,13 +2213,13 @@ printResults(int ttype, int normal_xacts, int nclients,
 	if (duration <= 0)
 	{
 		printf("number of transactions per client: %d\n", nxacts);
-		printf("number of transactions actually processed: %d/%d\n",
-			   normal_xacts, nxacts * nclients);
+		printf("number of transactions actually processed: " INT64_FORMAT "/" INT64_FORMAT "\n",
+			   normal_xacts, (int64)nxacts * nclients);
 	}
 	else
 	{
 		printf("duration: %d s\n", duration);
-		printf("number of transactions actually processed: %d\n",
+		printf("number of transactions actually processed: " INT64_FORMAT "\n",
 			   normal_xacts);
 	}
 
@@ -2359,7 +2359,7 @@ main(int argc, char **argv)
 	instr_time	start_time;		/* start up time */
 	instr_time	total_time;
 	instr_time	conn_total_time;
-	int			total_xacts = 0;
+	int64		total_xacts = 0;
 	int64		total_latencies = 0;
 	int64		total_sqlats = 0;
 	int64		throttle_lag = 0;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Replacement for OSSP-UUID for Linux and BSD

2014-05-25 Thread Tom Lane

Matteo Beccati  writes:
> here's the latest version of my uuid changes patch, according to
> proposal (2) from Tom in the thread about OSSP-UUID[1].

Hmm ... this is not actually what I had in mind.  Unless I'm misreading
the patch, this nukes the "uuid-ossp" extension entirely in favor of a
new extension "uuid" (providing the same SQL functions with a different
underlying implementation).  I don't think this works from the standpoint
of providing compatibility for users of the existing extension.
In particular, it'd break pg_upgrade (because of the change of the .so
library name) as well as straight pg_dump upgrades (which would expect
CREATE EXTENSION "uuid-ossp" to work).  Not to mention application code
that might expect CREATE EXTENSION "uuid-ossp" to still work.

Another objection is that for people for whom OSSP uuid still works fine,
this is forcing them to adopt a new implementation whose compatibility is
as yet unproven.

What I'd rather do is preserve contrib/uuid-ossp with the same extension
and .so name, but with two configure options that select different
underlying implementations.

In the long run it'd be nice to migrate away from the "uuid-ossp"
extension name, mostly because of the poorly-chosen use of a dash in the
name.  But I'm not sure how we do that without breaking backwards
compatibility, and anyway it's an entirely cosmetic thing that we can
worry about later.

Anyhow, doing it like that seems like it ought to be a pretty
straightforward refactoring of your patch.  I could pursue that,
or you can.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pgbench / int64 instead of int for xact count

2014-05-25 Thread Tom Lane

Tomas Vondra  writes:
> On 25.5.2014 19:05, Andres Freund wrote:
>> That's not right though. On windows a long (indicated by the %l) is only
>> 4 bytes wide. Check INT64_FORMAT. That's generated by configure/platform
>> template files and should always be correct.

> Oh, right. v2 of the patch attached.

Hm ... if normal_xacts can overflow an int, shouldn't we expect that
the product nxacts * nclients could?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pgbench / int64 instead of int for xact count

2014-05-25 Thread Tomas Vondra

On 25.5.2014 19:05, Andres Freund wrote:
>>  printf("number of transactions per client: %d\n", nxacts);
>> -printf("number of transactions actually processed: %d/%d\n",
>> +printf("number of transactions actually processed: %ld/%d\n",
>> normal_xacts, nxacts * nclients);
> 
> That's not right though. On windows a long (indicated by the %l) is only
> 4 bytes wide. Check INT64_FORMAT. That's generated by configure/platform
> template files and should always be correct.

Oh, right. v2 of the patch attached.

Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 6cc06d7..2f586da 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -239,7 +239,7 @@ typedef struct
 typedef struct
 {
 	instr_time	conn_time;
-	int			xacts;
+	int64		xacts;
 	int64		latencies;
 	int64		sqlats;
 	int64		throttle_lag;
@@ -2180,7 +2180,7 @@ process_builtin(char *tb)
 
 /* print out results */
 static void
-printResults(int ttype, int normal_xacts, int nclients,
+printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
 			 int64 total_latencies, int64 total_sqlats,
@@ -2213,13 +2213,13 @@ printResults(int ttype, int normal_xacts, int nclients,
 	if (duration <= 0)
 	{
 		printf("number of transactions per client: %d\n", nxacts);
-		printf("number of transactions actually processed: %d/%d\n",
+		printf("number of transactions actually processed: " INT64_FORMAT "/%d\n",
 			   normal_xacts, nxacts * nclients);
 	}
 	else
 	{
 		printf("duration: %d s\n", duration);
-		printf("number of transactions actually processed: %d\n",
+		printf("number of transactions actually processed: " INT64_FORMAT "\n",
 			   normal_xacts);
 	}
 
@@ -2359,7 +2359,7 @@ main(int argc, char **argv)
 	instr_time	start_time;		/* start up time */
 	instr_time	total_time;
 	instr_time	conn_total_time;
-	int			total_xacts = 0;
+	int64		total_xacts = 0;
 	int64		total_latencies = 0;
 	int64		total_sqlats = 0;
 	int64		throttle_lag = 0;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [9.5] possible fast path for pinning a page multiple times

2014-05-25 Thread Tom Lane

Jeff Davis  writes:
> Idea:

> Let's say we have a routine PinBufferTag, that's like PinBuffer but it
> takes an additional BufferTag argument. When it locks the buffer header,
> it would also compare the argument to the buffer's tag, and if they
> don't match, return a status indicating that it's the wrong buffer and
> don't pin it. In other words, it pins the buffer only if it's the right
> one.

> Then, we can just have a backend-local cache that maps BufferTag to
> buffer ID. If it's missing an entry, or if the entry is wrong, then it
> just proceeds with the normal BufferAlloc path. But if the cache holds
> the right value, then we completely bypass the BufMappingPartitionLock
> while getting the pin.

> Before I do too much performance testing of this, is it a correct
> approach? It seems too easy.

Don't see why it wouldn't work.  I think you'd need to test buffer status
not just the contents of the tag field, but that's easy enough.

Bigger issues are how large we'd want to allow the local cache to get
(and what to do to keep it within that) and how much performance we
really gain.  But those are tuning questions not correctness questions.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg9.4b1: unhelpful error message when creating a collation

2014-05-25 Thread Tom Lane

Andres Freund  writes:
> On 2014-05-25 09:17:24 +0200, Fabien COELHO wrote:
>> sql> CREATE COLLATION "french" (locale='fr_FR.utf8');
>> ERROR:  could not create locale "fr_FR.utf8": Success

> This seems to be a glibc bug. If a nonexistant locale has already been
> asked for errno is set to 0 instead of something sensible.
> Searching for that error turned up:
> https://sourceware.org/bugzilla/show_bug.cgi?id=14247
> https://bugzilla.redhat.com/show_bug.cgi?id=827510
> The latter by Tom Lane ;). Unfortunately not much seems to have happened
> since.

Yeah.  Red Hat seems to be carrying a patch (the RH bugzilla says it
was fixed in Fedora 17, and I see correct behavior in current RHEL6)
but if the sourceware bug reflects the state of affairs upstream then
other distros may still be suffering from the bug.

> Except badgering the glibc guys a bit, the only thing I can think of is
> to fudge errno = 0 into errno = ENOENT. Not nice.

Yeah, I don't especially care for adding such a workaround.  If the
consequences were worse than a wrong error message, then maybe ---
but adding the workaround might itself sometimes mean a wrong error
message, so it's not clear we'd be making much forward progress.

Better to pester the glibc boys to fix it properly.  In the meantime,
you might have some luck asking the Ubuntu packagers to copy Red Hat's
fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_upgrade fails: Mismatch of relation OID in database 8.4 -> 9.3

2014-05-25 Thread Andres Freund

Hi,

On 2014-05-23 08:23:57 -0600, Jeff Ross wrote:
> UDB=# \x
> Expanded display is on.
> UDB=# SELECT attrelid::regclass, attname, attnum, attlen, *
> FROM pg_attribute
> WHERE attrelid = 'masterairportlist'::regclass
> ORDER BY attnum ASC;
> UDB=#
> [ RECORD 1 ]-+--
> ...

A quick sum over the returned values seems to indicate that it's too
large to not have a toast table. Adding up attlen and atttypmod gives a
value of 1283. Considering that there additionally has to be VARHDRSZ
space for the varlena header and that the maximum storage size for a
varchar(n) is n * pg_encoding_max_length(GetDatabaseEncoding()) (usually
4) this seems to indicate that bad things[tm] have been done to the
database.
I suggest you write a script that does a 'ALTER TABLE $tbl ADD COLUMN
toastme text; ALTER TABLE $tbl DROP COLUMN toastme' for all tables.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [GSoC] Clustering in MADlib - status update

2014-05-25 Thread Maxence Ahlouche

Hi,

Here is my first report. You can also find it on my Gitlab [0].
Week 1 - 2014/05/25

For this first week, I have written a test script that generates some
simple datasets, and produces an image containing the output of the MADlib
clustering algorithms.

This script can be called like this:

./clustering_test.py new ds0 -n 8 # generates a dataset called "ds0"
with 8 clusters
./clustering_test.py query ds0 -o output.png # outputs the result of
the clustering algorithms applied to ds0 in output.png

See ./clustering_test.py -h for all the available options.

An example of output can be found here
[1].

Of course, I will keep improving this test script, as it is still far from
perfect; but for now, it does approximately what I want.

For next week, I'll start working on the implementation of k-medoids in
MADlib. As a reminder, according to the timeline I suggested for the
project, this step must be done on May 30. Depending on the problems I will
face (mostly lack of knowledge of the codebase, I guess), this might not be
finished on time, but it should be done a few days later (by the end of
next week, hopefully).

Attached is the patch containing everything I have done this week, though
the git log might be more convenient to read.

Regards,

Maxence A.

[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
[1]
http://git.viod.eu/viod/gsoc_2014/blob/master/clustering_test/example_dataset.png

-- 
Maxence Ahlouche
06 06 66 97 00
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 000..97de20e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,2 @@
+***/__pycache__/
+**.png
\ No newline at end of file
diff --git a/autogen_results.py b/autogen_results.py
deleted file mode 100755
index 033c309..000
--- a/autogen_results.py
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/usr/bin/python
-
-import os
-
-while(True):
-os.system("./k-means_test.py --regen -o results/$(date | md5sum | cut -d ' ' -f 1).png")
diff --git a/clustering_test.py b/clustering_test.py
deleted file mode 100755
index 2afc0d1..000
--- a/clustering_test.py
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/usr/bin/env python3
-
-import argparse
-import psycopg2 as pg
-import sys
-
-
-class DatabaseConnection():
-db_name = 'madlib'
-user = 'madlib'
-host = 'localhost'
-port = 5432
-table_name = 'tmp_points'
-field_name = 'coords'
-
-def __init__(self):
-self.conn = pg.connect(database=self.db_name, user=self.user, host=self.host, port=5432)
-self.cur = self.conn.cursor()
-self.cur.execute('DROP TABLE IF EXISTS %s CASCADE;' % self.table_name)
-self.cur.execute('CREATE TABLE %s (id SERIAL PRIMARY KEY, coords INT[]);' % self.table_name)
-self.conn.commit()
-
-def __del__(self):
-self.cur.close()
-self.conn.close()
-
-
-def main(args):
-parser = argparse.ArgumentParser(description='Visualize output of the clustering algorithms provided by MADlib, in PostgreSQL.')
-parser.add_argument('-n', metavar='number of clusters', type=int)
-
-dc = DatabaseConnection()
-
-if __name__ == '__main__':
-main(sys.argv[1:])
diff --git a/clustering_test/autogen_results.py b/clustering_test/autogen_results.py
new file mode 100755
index 000..033c309
--- /dev/null
+++ b/clustering_test/autogen_results.py
@@ -0,0 +1,6 @@
+#!/usr/bin/python
+
+import os
+
+while(True):
+os.system("./k-means_test.py --regen -o results/$(date | md5sum | cut -d ' ' -f 1).png")
diff --git a/clustering_test/clustering_test.py b/clustering_test/clustering_test.py
new file mode 100755
index 000..248b5cf
--- /dev/null
+++ b/clustering_test/clustering_test.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python3
+
+import argparse
+
+import database as db
+import dataset_generator as ds
+import visualiser as vs
+
+
+if __name__ == '__main__':
+parser = argparse.ArgumentParser(
+description='Visualize output of the clustering algorithms provided by '
+'MADlib, in PostgreSQL. You should start by adding a dataset. You need '
+'a PostgreSQL running.')
+subparsers = parser.add_subparsers(help='subparsers help', dest='action')
+
+parser_dataset = subparsers.add_parser('new', help='generate a dataset')
+parser_dataset.add_argument(
+'dataset_name',
+help='the name of the dataset to create',
+)
+parser_dataset.add_argument(
+'-n',
+'--nb_clusters',
+type=int,
+help='the number of clusters composing the new dataset. Defaults to a '
+'random value between 2 and 10.',
+)
+parser_dataset.add_argument(
+'-d',
+'--distribution',
+choices = ds.gen_cluster.keys(),
+help='the distribution of the points in the clusters. Defaults to '
+'uniform.',
+)
+
+parser_query = subparsers.add_parser('query', help='apply clustering algorithms on a dataset')
+parser_query.add_argument(
+

Re: [HACKERS] Sending out a request for more buildfarm animals?

2014-05-25 Thread Andres Freund

On 2014-05-25 01:02:25 +0200, Tomas Vondra wrote:
> On 14.5.2014 15:17, Andres Freund wrote:
> The cache invalidation bug was apparently fixed, but we're still getting
> failures (see for example markhor):
> 
> http://www.pgbuildfarm.org/cgi-bin/show_history.pl?nm=markhor&br=HEAD
> 
> I see there's a transaction (COMMIT+BEGIN) - is this caused by the
> extremely long runtimes?

Yes, that's the reason. Normally the test doesn't trigger autovacuum at
all, but if it's running for a *long* time it can. I haven't yet figured
out a good way to deal with that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pgbench / int64 instead of int for xact count

2014-05-25 Thread Andres Freund

Hi,

On 2014-05-25 18:05:03 +0200, Tomas Vondra wrote:
> I've been running a few longer pgbench tests (~week), and I've run into
> this:

> number of transactions actually processed: -1785047856
> latency average: -10.846 ms
> tps = -2950.492090 (including connections establishing)
> tps = -2950.492325 (excluding connections establishing)
> 
> The instance was doing ~10k tps for a week, which caused overflow of the
> int counter, used to track number of transactions. Hence the negative
> values.

> I think we've reached the time when hardeare capable of doing this is
> pretty common (SSDs, ...), so I think it's time to switch the counter to
> int64.

Especially when it's perfectly possible to do 500k read only
transactions a second...

>   printf("number of transactions per client: %d\n", nxacts);
> - printf("number of transactions actually processed: %d/%d\n",
> + printf("number of transactions actually processed: %ld/%d\n",
>  normal_xacts, nxacts * nclients);

That's not right though. On windows a long (indicated by the %l) is only
4 bytes wide. Check INT64_FORMAT. That's generated by configure/platform
template files and should always be correct.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_recvlogical not accepting -I to specify start LSN position

2014-05-25 Thread Andres Freund

Hi,

On 2014-05-25 22:35:24 +0900, Michael Paquier wrote:
> As written in subject, pg_recvlogical does not work properly with
> option -I but it should:
> $ pg_recvlogical -I 0/0
> pg_recvlogical: invalid option -- I
> Try "pg_recvlogical --help" for more information.
> $ pg_recvlogical --help | grep "\-I"
>   -I, --startpos=PTR where in an existing slot should the streaming start

Good catch. Apparently only the long argument version ever worked...

> Attached patch corrects that, reshuffling at the same time the option
> letters parsed with getopt_long in alphabetical order.

Hm. Not a big fan of this in isolation. In the attached patch I've
reordered the options to all be ordered alphabetically, but only inside
the section they are in --help.

What do you think?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 85f54dafd13238ff831d1c66cddeaf071ad60708 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 25 May 2014 18:47:05 +0200
Subject: [PATCH] Fix pg_recvlogical to accept the documented -I instead only
 --startpos.

The bug was caused by omitting 'I:' from the short argument list to
getopt_long(). To make similar bugs in the future less likely reorder
options in --help, long and short option lists to be in the same,
alphabetical within groups, order.

Report and fix by Michael Paquier, some additional reordering by me.
---
 src/bin/pg_basebackup/pg_recvlogical.c | 50 +-
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 651cc40..cbbba9e 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -68,6 +68,8 @@ usage(void)
 	printf(_("  %s [OPTION]...\n"), progname);
 	printf(_("\nOptions:\n"));
 	printf(_("  -f, --file=FILEreceive log into this file. - for stdout\n"));
+	printf(_("  -F  --fsync-interval=SECS\n"
+			 " frequency of syncs to the output file (default: %d)\n"), (fsync_interval / 1000));
 	printf(_("  -n, --no-loop  do not loop on connection lost\n"));
 	printf(_("  -v, --verbose  output verbose messages\n"));
 	printf(_("  -V, --version  output version information, then exit\n"));
@@ -80,8 +82,7 @@ usage(void)
 	printf(_("  -w, --no-password  never prompt for password\n"));
 	printf(_("  -W, --password force password prompt (should happen automatically)\n"));
 	printf(_("\nReplication options:\n"));
-	printf(_("  -F  --fsync-interval=SECS\n"
-			 " frequency of syncs to the output file (default: %d)\n"), (fsync_interval / 1000));
+	printf(_("  -I, --startpos=PTR where in an existing slot should the streaming start\n"));
 	printf(_("  -o, --option=NAME[=VALUE]\n"
 			 " specify option NAME with optional value VALUE, to be passed\n"
 			 " to the output plugin\n"));
@@ -89,7 +90,6 @@ usage(void)
 	printf(_("  -s, --status-interval=SECS\n"
 			 " time between status packets sent to server (default: %d)\n"), (standby_message_timeout / 1000));
 	printf(_("  -S, --slot=SLOTuse existing replication slot SLOT instead of starting a new one\n"));
-	printf(_("  -I, --startpos=PTR where in an existing slot should the streaming start\n"));
 	printf(_("\nAction to be performed:\n"));
 	printf(_("  --create   create a new replication slot (for the slotname see --slot)\n"));
 	printf(_("  --startstart streaming in a replication slot (for the slotname see --slot)\n"));
@@ -600,6 +600,7 @@ main(int argc, char **argv)
 	static struct option long_options[] = {
 /* general options */
 		{"file", required_argument, NULL, 'f'},
+		{"fsync-interval", required_argument, NULL, 'F'},
 		{"no-loop", no_argument, NULL, 'n'},
 		{"verbose", no_argument, NULL, 'v'},
 		{"version", no_argument, NULL, 'V'},
@@ -612,12 +613,11 @@ main(int argc, char **argv)
 		{"no-password", no_argument, NULL, 'w'},
 		{"password", no_argument, NULL, 'W'},
 /* replication options */
+		{"startpos", required_argument, NULL, 'I'},
 		{"option", required_argument, NULL, 'o'},
 		{"plugin", required_argument, NULL, 'P'},
 		{"status-interval", required_argument, NULL, 's'},
-		{"fsync-interval", required_argument, NULL, 'F'},
 		{"slot", required_argument, NULL, 'S'},
-		{"startpos", required_argument, NULL, 'I'},
 /* action */
 		{"create", no_argument, NULL, 1},
 		{"start", no_argument, NULL, 2},
@@ -647,7 +647,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "f:F:nvd:h:o:p:U:wWP:s:S:",
+	while ((c = getopt_long(argc, argv, "f:F:nvd:h:p:U:wWI:o:P:s:S:",
 			long_options, &option_index)) != -1)
 	{
 		switch (c)
@@ -656,6 +656,15 @@ main(int argc, char **argv)
 			case 'f':
 outfile = pg_strdup(optarg);
 b

[HACKERS] PATCH: pgbench / int64 instead of int for xact count

2014-05-25 Thread Tomas Vondra

Hi,

I've been running a few longer pgbench tests (~week), and I've run into
this:

transaction type: SELECT only
scaling factor: 1250
query mode: simple
number of clients: 32
number of threads: 4
duration: 605000 s
number of transactions actually processed: -1785047856
latency average: -10.846 ms
tps = -2950.492090 (including connections establishing)
tps = -2950.492325 (excluding connections establishing)

The instance was doing ~10k tps for a week, which caused overflow of the
int counter, used to track number of transactions. Hence the negative
values.

I think we've reached the time when hardeare capable of doing this is
pretty common (SSDs, ...), so I think it's time to switch the counter to
int64.

Tomas
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 6cc06d7..f5b3f60 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -239,7 +239,7 @@ typedef struct
 typedef struct
 {
 	instr_time	conn_time;
-	int			xacts;
+	int64		xacts;
 	int64		latencies;
 	int64		sqlats;
 	int64		throttle_lag;
@@ -2180,7 +2180,7 @@ process_builtin(char *tb)
 
 /* print out results */
 static void
-printResults(int ttype, int normal_xacts, int nclients,
+printResults(int ttype, int64 normal_xacts, int nclients,
 			 TState *threads, int nthreads,
 			 instr_time total_time, instr_time conn_total_time,
 			 int64 total_latencies, int64 total_sqlats,
@@ -2213,13 +2213,13 @@ printResults(int ttype, int normal_xacts, int nclients,
 	if (duration <= 0)
 	{
 		printf("number of transactions per client: %d\n", nxacts);
-		printf("number of transactions actually processed: %d/%d\n",
+		printf("number of transactions actually processed: %ld/%d\n",
 			   normal_xacts, nxacts * nclients);
 	}
 	else
 	{
 		printf("duration: %d s\n", duration);
-		printf("number of transactions actually processed: %d\n",
+		printf("number of transactions actually processed: %ld\n",
 			   normal_xacts);
 	}
 
@@ -2359,7 +2359,7 @@ main(int argc, char **argv)
 	instr_time	start_time;		/* start up time */
 	instr_time	total_time;
 	instr_time	conn_total_time;
-	int			total_xacts = 0;
+	int64		total_xacts = 0;
 	int64		total_latencies = 0;
 	int64		total_sqlats = 0;
 	int64		throttle_lag = 0;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] assertion in 9.4 with wal_level=logical

2014-05-25 Thread Andres Freund

On 2014-04-18 11:50:55 -0300, Alvaro Herrera wrote:
> Robert Haas wrote:
> > On Thu, Apr 17, 2014 at 10:47 PM, Andres Freund  
> > wrote:
> > > It's this (older) assertion in HeapTupleHeaderGetCmax():
> > >
> > > 
> > > Assert(TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetUpdateXid(tup)));
> > >
> > > That can allocate memory if xmax is a multixact... Does anybody have a
> > > better idea to solve this than adding a CritSectionCount == 0 && in
> > > there?
> > 
> > Blech.  Isn't that just nerfing the assertion?
> 
> Well, that's exactly the point.  Most of the time,
> HeapTupleHeaderGetCmax gets called in a non-critical section, and we
> want to run the assertion in that case.  But it's not huge trouble if
> the assertion is not run in the rare case where HeapTupleHeaderGetCmax
> is being used to write a Xlog record.
> 
> It's a bit painful that HeapTupleHeaderGetUpdateXid allocates memory,
> but to fix that we would have to remove all allocations from
> GetMultiXactIdMembers which doesn't sound feasible.

Since nobody seemed to have a better idea I've proceeded in doing
so... Not pretty.
I've verified that the assertion could be triggered before, but not
after by doing something like:
S1:
CREATE TABLE a();
BEGIN;
SELECT * FROM pg_class WHERE relname = 'a' FOR SHARE;

S2:
BEGIN;
SELECT * FROM pg_class WHERE relname = 'a' FOR SHARE;
COMMIT;

S1:
ALTER TABLE a RENAME to b;
ALTER TABLE b RENAME to a; -- this triggered the assertion

That seems a tad too obscure for its own isolationtester test.

Thanks for the report, Steve!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [9.5] possible fast path for pinning a page multiple times

2014-05-25 Thread Jeff Davis

Context:

A patch from a while ago was rejected:

http://www.postgresql.org/message-id/1369886097.23418.0.camel@jdavis

Most of the objection seemed to be that extra page pins might happen in
some circumstances, such as this one mentioned by Heikki:

http://www.postgresql.org/message-id/50fd11c5.1030...@vmware.com

That was a single-threaded case, but did represent additional pins being
acquired, which could add up to contention over the
BufMappingPartitionLock if there is other traffic on that lock
partition.

Idea:

Let's say we have a routine PinBufferTag, that's like PinBuffer but it
takes an additional BufferTag argument. When it locks the buffer header,
it would also compare the argument to the buffer's tag, and if they
don't match, return a status indicating that it's the wrong buffer and
don't pin it. In other words, it pins the buffer only if it's the right
one.

Then, we can just have a backend-local cache that maps BufferTag to
buffer ID. If it's missing an entry, or if the entry is wrong, then it
just proceeds with the normal BufferAlloc path. But if the cache holds
the right value, then we completely bypass the BufMappingPartitionLock
while getting the pin.

Before I do too much performance testing of this, is it a correct
approach? It seems too easy.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Could not finish anti-wraparound VACUUM when stop limit is reached

2014-05-25 Thread Andres Freund

On 2014-05-25 11:40:09 -0400, Heikki Linnakangas wrote:
> So, vac_truncate_clog() tries to get a new transaction ID, which fails
> because we've already reached the stop-limit. vac_truncate_clog() doesn't
> really need a new XID to be assigned, though, it only uses it to compare
> against datfrozenxid to see if wrap-around has already happened, so it could
> use ReadNewTransactionId() instead.

Right. But IIRC we need one in the vicinity anyway to write new
pg_database et al rows?

> Jeff's database seems to have wrapped around already, because after fixing
> the above, I get this:
> 
> jjanes=# vacuum;
> WARNING:  some databases have not been vacuumed in over 2 billion
> transactions
> DETAIL:  You might have already suffered transaction-wraparound data loss.
> VACUUM
> 
> We do not truncate clog when wraparound has already happened, so we never
> get past that point. Jeff advanced XID counter aggressively with some custom
> C code, so hitting the actual wrap-around is a case of "don't do that".
> Still, the case is quite peculiar: pg_controldata says that nextXid is
> 4/1593661139. The oldest datfrozenxid is equal to that, 1593661139. So ISTM
> he managed to not just wrap around, but execute 2 billion more transactions
> after the wraparound and reach datfrozenxid again. I'm not sure how that
> happened.

I think that may be explained by Jeff's changes to varsup.c to use up
xids more rapidly. If JJ_xid is = 100 it'll possibly jump right over
xidStopLimit.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg9.4b1: unhelpful error message when creating a collation

2014-05-25 Thread Andres Freund

Hi,

On 2014-05-25 09:17:24 +0200, Fabien COELHO wrote:
>   sql> CREATE COLLATION "french" (locale='fr_FR.utf8');
>   ERROR:  could not create locale "fr_FR.utf8": Success
> 
> The collation creation fails, not sure why yet. However, the "error ..
> success" message is especially unhelpful.

This seems to be a glibc bug. If a nonexistant locale has already been
asked for errno is set to 0 instead of something sensible.

Using a debugger it's possible to see that in an earlier setlocale()
call errno is set correctly, but newlocale() then fails without setting
errno. The culprit seems to be some caching in glibc's _nl_find_locale().

I am not entirely sure why it's not reliably triggered < 9.4.

Searching for that error turned up:
https://sourceware.org/bugzilla/show_bug.cgi?id=14247
https://bugzilla.redhat.com/show_bug.cgi?id=827510

The latter by Tom Lane ;). Unfortunately not much seems to have happened
since.

Except badgering the glibc guys a bit, the only thing I can think of is
to fudge errno = 0 into errno = ENOENT. Not nice.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Could not finish anti-wraparound VACUUM when stop limit is reached

2014-05-25 Thread Heikki Linnakangas

While debugging the B-tree bug that Jeff Janes reported 
(http://www.postgresql.org/message-id/CAMkU=1y=vwf07ay+cpqk_7fpihrctmssv9y99sbghitkxpb...@mail.gmail.com), 
a new issue came up:


If you reach the xidStopLimit, and try to run VACUUM, it fails with error:

jjanes=# vacuum;
ERROR:  database is not accepting commands to avoid wraparound data loss 
in database "jjanes"

HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

The backtrace looks like this:

#0  errstart (elevel=20, filename=0x9590a0 "varsup.c", lineno=120,
funcname=0x9593f0 <__func__.10455> "GetNewTransactionId", 
domain=0x0) at elog.c:249
#1  0x004f7c14 in GetNewTransactionId (isSubXact=0 '\000') at 
varsup.c:115
#2  0x004f86db in AssignTransactionId (s=0xd62900 
)

at xact.c:510
#3  0x004f84a4 in GetCurrentTransactionId () at xact.c:382
#4  0x0062dc1c in vac_truncate_clog (frozenXID=1493663893, 
minMulti=1)

at vacuum.c:909
#5  0x0062dc06 in vac_update_datfrozenxid () at vacuum.c:888
#6  0x0062cdf6 in vacuum (vacstmt=0x29e05e0, relid=0, do_toast=1 
'\001',
bstrategy=0x2a5cc38, for_wraparound=0 '\000', isTopLevel=1 '\001') 
at vacuum.c:294

#7  0x007a3c55 in standard_ProcessUtility (parsetree=0x29e05e0,
queryString=0x29dfbf8 "vacuum ;", context=PROCESS_UTILITY_TOPLEVEL, 
params=0x0,

dest=0x29e0988, completionTag=0x7fff9411a490 "") at utility.c:645

So, vac_truncate_clog() tries to get a new transaction ID, which fails 
because we've already reached the stop-limit. vac_truncate_clog() 
doesn't really need a new XID to be assigned, though, it only uses it to 
compare against datfrozenxid to see if wrap-around has already happened, 
so it could use ReadNewTransactionId() instead.



Jeff's database seems to have wrapped around already, because after 
fixing the above, I get this:


jjanes=# vacuum;
WARNING:  some databases have not been vacuumed in over 2 billion 
transactions

DETAIL:  You might have already suffered transaction-wraparound data loss.
VACUUM

We do not truncate clog when wraparound has already happened, so we 
never get past that point. Jeff advanced XID counter aggressively with 
some custom C code, so hitting the actual wrap-around is a case of 
"don't do that". Still, the case is quite peculiar: pg_controldata says 
that nextXid is 4/1593661139. The oldest datfrozenxid is equal to that, 
1593661139. So ISTM he managed to not just wrap around, but execute 2 
billion more transactions after the wraparound and reach datfrozenxid 
again. I'm not sure how that happened.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.4 btree index corruption

2014-05-25 Thread Heikki Linnakangas


On 05/21/2014 10:22 PM, Jeff Janes wrote:

Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.

28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR:  right sibling 4044 of
block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT:  VACUUM;

It took ~8 hours on 8 cores to encounter this problem.  This is a single
occurrence, it has not yet been reproduced.

I don't know that the partial-writes, or the crash recovery, or the foreign
key, parts of this test are important--it could be a more generic problem
that only happened to be observed here.  Nor do I know yet if it occurs in
9_3_STABLE.

Below is the testing harness and the data directory (massively bloated at
3.7GB once uncompressed).  It is currently in wrap-around shutdown, but
that is the effect of persistent vacuum failures, not the cause of them.
  You can restart the data directory and it will repeat the above sibling
error once autovac kicks in.  I don't know if the bloat is due to the
vacuum failure or if it was already in process before the failures started.
  I've cranked up the logging on that front future efforts.

I'm using some fast-foward code on the xid consumption so that freezing
occurs more often, and some people have expressed reservations that the
code might be imperfect, and I can't rule that out as the cause (but I've
never traced any other problems back to that code).  But it did make it
through 4 complete wraps before this problem was encountered, so if that is
the problem it must be probabilistic rather than deterministic.

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing


I downloaded the data directory and investigated. I got this message 
when I started it up:


20392  2014-05-25 05:51:37.835 PDT:ERROR:  right sibling 4044 of block 
460 is not next child 23513 of block 86458 in index "foo_p_id_idx"
20392  2014-05-25 05:51:37.835 PDT:CONTEXT:  automatic vacuum of table 
"jjanes.public.foo"


Interestingly, it's complaining about parent page 86458, while yours 
claimed it was 1264. I don't know why; perhaps a huge number of 
insertions happened after that error, causing the parent level pages to 
be split, moving the downlinks it complains about to the right. Did you 
continue running the test after that error occurred?


This is what the tree looks like around those pages:

Level 1:
+-+ +-+ +-+
| Blk 1264| | Blk 160180  | | Blk 86458   |
| | | | | |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ... | | ... | | 1269|
| | | | |  460|
| | | | |23513|
+-+ +-+ +-+


Leaf level:
+-+ +-+ +---+ +-+
| Blk 1269| | Blk 460 | | Blk 4044  | | Blk 23513   |
| | <-> | | <-> | HALF_DEAD | <-> | |
| (314 items) | | (empty) | |   | | (212 items) |
+-+ +-+ +---+ +-+

Leaf block 4044 is marked as half-dead, and there is no downlink to it 
on the parent block, 86458. That is normal, when a vacuum is interrupted 
during page deletion (in 9.4). The next vacuum ought to continue it. 
However, because block 460 is empty, the next vacuum that comes along 
will try to start the page deletion process on that block. Page deletion 
does a cross-check at the parent level, checking that the next downlink 
on the parent page points to the right sibling of the page we're about 
to delete. However, because the right sibling is already marked as 
half-dead, and the downlink pointing to it has already been removed, the 
cross-check fails.


So, the sanity check is a bit overzealous. It's normal that the right 
sibling has no downlink, if it's marked as half-dead. The trivial fix is 
to just remove the cross-check, but I'd rather not do that because 
generally-speaking more sanity checks are good.


I think we could fetch the right sibling page to see if it's marked as 
half-dead, and if it is, just give up on deleting the current page. This 
vacuum will eventually get to the right sibling page and finish the 
deletion, and the next vacuum will then be able to delete the empty 
page. Or we could finish the deletion of the right page immediately, but 
I'd like to keep the rarely-occurring corner-case codepaths to the minimum.



With the attached patch, I was able to get past that error, but when 
VACUUM reaches the end, I got this:


jjanes=# vacuum foo;
ERROR:  database is not accepting commands to avoid wraparound data loss 
in database "jjanes"

HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back ol

[HACKERS] pg_recvlogical not accepting -I to specify start LSN position

2014-05-25 Thread Michael Paquier

Hi all,

As written in subject, pg_recvlogical does not work properly with
option -I but it should:
$ pg_recvlogical -I 0/0
pg_recvlogical: invalid option -- I
Try "pg_recvlogical --help" for more information.
$ pg_recvlogical --help | grep "\-I"
  -I, --startpos=PTR where in an existing slot should the streaming start
Attached patch corrects that, reshuffling at the same time the option
letters parsed with getopt_long in alphabetical order.

Regards,
-- 
Michael
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 651cc40..33678d9 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -647,7 +647,7 @@ main(int argc, char **argv)
 		}
 	}
 
-	while ((c = getopt_long(argc, argv, "f:F:nvd:h:o:p:U:wWP:s:S:",
+	while ((c = getopt_long(argc, argv, "d:f:F:h:I:no:p:P:s:S:U:vwW",
 			long_options, &option_index)) != -1)
 	{
 		switch (c)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Window function optimisation, allow pushdowns of items matching PARTITION BY clauses

2014-05-25 Thread Thomas Mayer

Hello David,

sorry for the late response. I will try out your changes from the view 
of a user in mid-June. However, I can't do a trustworthy code review as 
I'm not an experienced postgre-hacker (yet).

Best Regards
Thomas

Am 14.04.2014 13:19, schrieb David Rowley:

On 14 April 2014 03:31, Tom Lane mailto:t...@sss.pgh.pa.us>> wrote:

David Rowley mailto:dgrow...@gmail.com>> writes:
 > On this thread
 >
http://www.postgresql.org/message-id/52c6f712.6040...@student.kit.edu there
 > was some discussion around allowing push downs of quals that
happen to be
 > in every window clause of the sub query. I've quickly put
together a patch
 > which does this (see attached)

I think you should have check_output_expressions deal with this,
instead.
Compare the existing test there for non-DISTINCT output columns.

I've moved the code there and it seems like a much better place for it.
Thanks for the tip.

 > Oh and I know that my function
var_exists_in_all_query_partition_by_clauses
 > has no business in allpaths.c, I'll move it out as soon as I find
a better
 > home for it.

I might be wrong, but I think you could just embed that search loop in
check_output_expressions, and it wouldn't be too ugly.

I've changed the helper function to take the TargetEntry now the same
as targetIsInSortList does for the hasDistinctOn test and I renamed the
helper function to targetExistsInAllQueryPartitionByClauses. It seems
much better, but there's probably a bit of room for improving on the
names some more.

I've included the updated patch with some regression tests.
The final test I added tests that an unused window which would, if used,
cause the pushdown not to take place still stops the pushdownf from
happening... I know you both talked about this case in the other thread,
but I've done nothing for it as Tom mentioned that this should likely be
fixed elsewhere, if it's even worth worrying about at all. I'm not quite
sure if I needed to bother including this test for it, but I did anyway,
it can be removed if it is deemed unneeded.

Per Thomas' comments, I added a couple of tests that ensure that the
order of the columns in the partition by clause don't change the
behaviour. Looking at the implementation of the actual code makes this
test seems a bit unneeded as really but I thought that the test should
not assume anything about the implementation so from that point of view
the extra test seems like a good idea.

For now I can't think of much else to do for the patch, but I'd really
appreciate it Thomas if you could run it through the paces if you can
find any time for it, or maybe just give my regression tests a once over
to see if you can think of any other cases that should be covered.

Regards

David Rowley

--
==
Thomas Mayer
Durlacher Allee 61
D-76131 Karlsruhe
Telefon: +49-721-2081661
Fax: +49-721-72380001
Mobil:   +49-174-2152332
E-Mail:  thomas.ma...@student.kit.edu
===

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Index-only scans for GIST

2014-05-25 Thread Anastasia Lubennikova

Hi, hackers!
There are first results of my work on GSoC project "Index-only scans for
GIST".

1. Version of my project code is in forked repository
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments
- This version is only for one-column indexes
- fetch() method is realized only for box opclass (because it's trivial)

2. I test Index-only scans with SQL script box_test.sql
and it works really faster. (results in box_test_out)

I'll be glad to get your feedback about this feature.

-- 
Best regards,
Lubennikova Anastasia


indexonlygist_2.0.patch
Description: Binary data


box_test.sql
Description: Binary data


box_test_out.out
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Priority table or Cache table

2014-05-25 Thread Hannu Krosing

On 05/20/2014 01:46 PM, Fujii Masao wrote:
> On Mon, Mar 17, 2014 at 1:16 PM, Haribabu Kommi
>  wrote:
>> ...
>> I Implemented a proof of concept patch to see whether the buffer pool
>> split can improve the performance or not.
>>
>> Summary of the changes:
>> 1. The priority buffers are allocated as continuous to the shared buffers.
>> 2. Added new reloption parameter called "buffer_pool" to specify the
>> buffer_pool user wants the table to use.
> I'm not sure if storing the information of "priority table" into
> database is good
> because this means that it's replicated to the standby and the same table
> will be treated with high priority even in the standby server. I can imagine
> some users want to set different tables as high priority ones in master and
> standby.
There might be a possibility to override this in postgresql.conf for
optimising what you described but for most uses it is best to be in
the database, at least to get started.

Cheers

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg9.4b1: unhelpful error message when creating a collation

2014-05-25 Thread Fabien COELHO



  sh> lsb_release -d
  Description: Ubuntu 14.04 LTS
  Codename: trusty

  sh> uname -a
  Linux sto 3.13.0-24-generic #47-Ubuntu SMP Fri May 2 23:30:00 UTC 2014 x86_64 
x86_64 x86_64 GNU/Linux

  sh> locale -a
  C
  C.UTF-8
  en_AG
  en_AG.utf8
  en_AU.utf8
  en_BW.utf8
  en_CA.utf8
  en_DK.utf8
  en_GB.utf8
  en_HK.utf8
  en_IE.utf8
  en_IN
  en_IN.utf8
  en_NG
  en_NG.utf8
  en_NZ.utf8
  en_PH.utf8
  en_SG.utf8
  en_US.utf8
  en_ZA.utf8
  en_ZM
  en_ZM.utf8
  en_ZW.utf8
  fr_FR.utf8
  POSIX

# with postgresql 9.4 installed from apt.postgresql.org.

  sql> SELECT VERSION();
  PostgreSQL 9.4beta1 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 
4.8.2-19ubuntu1) 4.8.2, 64-bit

  sql> \dOS+
  +- same list as above with "locale"

  sql> SHOW lc_messages;
  en_US.UTF-8

# stuff which is working as expected:

  sql> CREATE COLLATION "test0" (locale='C');
  CREATE COLLATION

  sql> CREATE COLLATION "test1" (locale='en_US.UTF-8');
  CREATE COLLATION

  sql> CREATE COLLATION "test2" (locale='en_US.UTF8');
  CREATE COLLATION

  sql> CREATE COLLATION "test3" (locale='Foo');
  ERROR:  could not create locale "test3": No such file or directory
  DETAIL:  The operating system could not find any locale data for the locale name 
"Foo".

# stuff which is incoherent:

  sql> CREATE COLLATION "french" (locale='fr_FR.utf8');
  ERROR:  could not create locale "fr_FR.utf8": Success

The collation creation fails, not sure why yet. However, the "error .. 
success" message is especially unhelpful.


Raising client_min_messages does not add more information.

I could not get the same error with 9.3.4.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

42 matches

Mail list logo