date:20130622

Re: [HACKERS] WITH CHECK OPTION for auto-updatable views

2013-06-22 Thread Stephen Frost

Dean,

* Dean Rasheed (dean.a.rash...@gmail.com) wrote:
 Here's an updated version --- I missed the necessary update to the
 check_option column of information_schema.views.

Thanks!  This is really looking quite good, but it's a bit late and I'm
going on vacation tomorrow, so I didn't quite want to commit it yet. :)
Instead, here are a few things that I'd like to see fixed up:

I could word-smith the docs all day, most likely, but at least the
following would be nice to have cleaned up:

 - 'This is parameter may be either'

 - I don't like This allows an existing view's   The option can be
   used on CREATE VIEW as well as ALTER VIEW.  I'd say something like:

   This parameter may be either literallocal/ or
   literalcascaded/, and is equivalent to specifying literalWITH [
   CASCADED | LOCAL ] CHECK OPTION/ (see below).  This option can be
   changed on existing views using xref linkend=sql-alterview.

 - wrt what shows up in '\h create view' and '\h alter view', I think we
   should go ahead and add in with the options are, ala EXPLAIN.  That
   avoids having to guess at it (I was trying 'with_check_option'
   initially :).

 - Supposedly, this option isn't available for RECURSIVE views, but it's
   happily accepted: 

=*# create recursive view qq (a) with (check_option = local) as select z from q;
CREATE VIEW

(same is true of ALTER VIEW on a RECURSIVE view)

 - pg_dump support is there, but it outputs the definition using the PG
   syntax instead of the SQL syntax; is there any particular reason for
   this..?  imv, we should be dumping SQL spec where we can trivially
   do so.

 - Why check_option_offset instead of simply check_option..?  We don't
   have security_barrier_offset and it seems like we should be
   consistent there.

The rest looks pretty good to me.  If you can fix the above, I'll review
again and would be happy to commit it. :)

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Stephen Frost

Noah,

* Noah Misch (n...@leadboat.com) wrote:
 This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
 a higher MaxAllocHugeSize limit of SIZE_MAX/2.  

Nice!  I've complained about this limit a few different times and just
never got around to addressing it.

 This was made easier by tuplesort growth algorithm improvements in commit
 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
 (TODO item Allow sorts to use more available memory), and Tom floated the
 idea[1] behind the approach I've used.  The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

That's frustratingly small. :(

[...]
 --- 1024,1041 
* new array elements even if no other memory were currently 
 used.
*
* We do the arithmetic in float8, because otherwise the 
 product of
 !  * memtupsize and allowedMem could overflow.  Any inaccuracy in 
 the
 !  * result should be insignificant; but even if we computed a
 !  * completely insane result, the checks below will prevent 
 anything
 !  * really bad from happening.
*/
   double  grow_ratio;
   
   grow_ratio = (double) state-allowedMem / (double) memNowUsed;
 ! if (memtupsize * grow_ratio  INT_MAX)
 ! newmemtupsize = (int) (memtupsize * grow_ratio);
 ! else
 ! newmemtupsize = INT_MAX;
   
   /* We won't make any further enlargement attempts */
   state-growmemtuples = false;

I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
everything can handle that cleanly..?  I feel like it might be a bit
safer to shy a bit short of INT_MAX (say, by 1K).  Perhaps that's overly
paranoid, but there's an awful lot of callers and some loop which +2's
and then overflows would suck, eg:

int x = INT_MAX;
for (x-1; (x-1)  INT_MAX; x += 2) {
myarray[x] = 5;
}

Also, could this be used to support hashing larger sets..?  If we change
NTUP_PER_BUCKET to one, we could end up wanting to create a hash table
larger than INT_MAX since, with 8-byte pointers, that'd only be around
134M tuples.

Haven't had a chance to review the rest, but +1 on the overall idea. :)

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Simon Riggs

On 13 May 2013 15:26, Noah Misch n...@leadboat.com wrote:
 A memory chunk allocated through the existing palloc.h interfaces is limited
 to MaxAllocSize (~1 GiB).  This is best for most callers; SET_VARSIZE() need
 not check its own 1 GiB limit, and algorithms that grow a buffer by doubling
 need not check for overflow.  However, a handful of callers are quite happy to
 navigate those hazards in exchange for the ability to allocate a larger chunk.

 This patch introduces MemoryContextAllocHuge() and repalloc_huge() that check
 a higher MaxAllocHugeSize limit of SIZE_MAX/2.  Chunks don't bother recording
 whether they were allocated as huge; one can start with palloc() and then
 repalloc_huge() to grow the value.

I like the design and think its workable.

I'm concerned that people will accidentally use MaxAllocSize. Can we
put in a runtime warning if someone tests AllocSizeIsValid() with a
larger value?

  To demonstrate, I put this to use in
 tuplesort.c; the patch also updates tuplestore.c to keep them similar.  Here's
 the trace_sort from building the pgbench_accounts primary key at scale factor
 7500, maintenance_work_mem = '56GB'; memtuples itself consumed 17.2 GiB:

 LOG:  internal sort ended, 48603324 KB used: CPU 75.65s/305.46u sec elapsed 
 391.21 sec

 Compare:

 LOG:  external sort ended, 1832846 disk blocks used: CPU 77.45s/988.11u sec 
 elapsed 1146.05 sec

Cool.

I'd like to put in an explicit test for this somewhere. Obviously not
part of normal regression, but somewhere, at least, so we have
automated testing that we all agree on. (yes, I know we don't have
that for replication/recovery yet, but thats why I don't want to
repeat that mistake).

 This was made easier by tuplesort growth algorithm improvements in commit
 8ae35e91807508872cabd3b0e8db35fc78e194ac.  The problem has come up before
 (TODO item Allow sorts to use more available memory), and Tom floated the
 idea[1] behind the approach I've used.  The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

 I have not added variants like palloc_huge() and palloc0_huge(), and I have
 not added to the frontend palloc.h interface.  There's no particular barrier
 to doing any of that.  I don't expect more than a dozen or so callers, so most
 of the variations might go unused.

 The comment at MaxAllocSize said that aset.c expects doubling the size of an
 arbitrary allocation to never overflow, but I couldn't find the code in
 question.  AllocSetAlloc() does double sizes of blocks used to aggregate small
 allocations, so maxBlockSize had better stay under SIZE_MAX/2.  Nonetheless,
 that expectation does apply to dozens of repalloc() users outside aset.c, and
 I preserved it for repalloc_huge().  64-bit builds will never notice, and I
 won't cry for the resulting 2 GiB limit on 32-bit.

Agreed. Can we document this for the relevant parameters?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Simon Riggs

On 22 June 2013 08:46, Stephen Frost sfr...@snowman.net wrote:

The next limit faced by sorts is
 INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
 150 GiB when sorting int4.

 That's frustratingly small. :(


But that has nothing to do with this patch, right? And is easily fixed, yes?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] XLogInsert scaling, revisited

2013-06-22 Thread Heikki Linnakangas


On 21.06.2013 21:55, Jeff Janes wrote:

I think I'm getting an undetected deadlock between the checkpointer and a
user process running a TRUNCATE command.

This is the checkpointer:

#0  0x003a73eeaf37 in semop () from /lib64/libc.so.6
#1  0x005ff847 in PGSemaphoreLock (sema=0x7f8c0a4eb730,
interruptOK=0 '\000') at pg_sema.c:415
#2  0x004b0abf in WaitOnSlot (upto=416178159648) at xlog.c:1775
#3  WaitXLogInsertionsToFinish (upto=416178159648) at xlog.c:2086
#4  0x004b657a in CopyXLogRecordToWAL (write_len=32, isLogSwitch=1
'\001', rdata=0x0, StartPos=value optimized out, EndPos=416192397312)
 at xlog.c:1389
#5  0x004b6fb2 in XLogInsert (rmid=0 '\000', info=value optimized
out, rdata=0x7fff0020) at xlog.c:1209
#6  0x004b7644 in RequestXLogSwitch () at xlog.c:8748


Hmm, it looks like the xlog-switch is trying to wait for itself to 
finish. The concurrent TRUNCATE is just being blocked behind the 
xlog-switch, which is stuck on itself.


I wasn't able to reproduce exactly that, but I got a PANIC by running 
pgbench and concurrently doing select pg_switch_xlog() many times in psql.


Attached is a new version that fixes at least the problem I saw. Not 
sure if it fixes what you saw, but it's worth a try. How easily can you 
reproduce that?



This is using the same testing harness as in the last round of this patch.


This one? 
http://www.postgresql.org/message-id/CAMkU=1xoa6fdyoj_4fmlqpiczr1v9gp7clnxjdhu+iggqb6...@mail.gmail.com



Is there a way for me to dump the list of held/waiting lwlocks from gdb?


You can print out the held_lwlocks array. Or to make it more friendly, 
write a function that prints it out and call that from gdb. There's no 
easy way to print out who's waiting for what that I know of.


Thanks for the testing!

- Heikki


xloginsert-scale-24.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Stephen Frost

* Simon Riggs (si...@2ndquadrant.com) wrote:
 On 22 June 2013 08:46, Stephen Frost sfr...@snowman.net wrote:
 The next limit faced by sorts is
  INT_MAX concurrent tuples in memory, which limits helpful work_mem to about
  150 GiB when sorting int4.
 
  That's frustratingly small. :(
 
 But that has nothing to do with this patch, right? And is easily fixed, yes?

I don't know about 'easily fixed' (consider supporting a HashJoin of 2B
records) but I do agree that dealing with places in the code where we are
using an int4 to keep track of the number of objects in memory is outside
the scope of this patch.

Hopefully we are properly range-checking and limiting ourselves to only
what a given node can support and not solely depending on MaxAllocSize
to keep us from overflowing some int4 which we're using as an index for
an array or as a count of how many objects we've currently got in
memory, but we'll want to consider carefully what happens with such
large sets as we're adding support into nodes for these Huge
allocations (along with the recent change to allow 1TB work_mem, which
may encourage users with systems large enough to actually try to set it
that high... :)

Thanks,

Stephen


signature.asc
Description: Digital signature

[HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

Previous discussions of Hash Joins have noted that the performance
decreases when the average number of tuples per bucket increases.
O(N^2) effects are seen.

We've argued this about many different ways, yet all of those
discussions have centred around the constant NTUP_PER_BUCKET. I
believe that was a subtle mistake and I have a proposal.

The code in ExecChooseHashTableSize() line 460 says
 /*
  * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
  * memory is filled.
...

but the calculation then sets the number of buckets like this

 dbuckets = ceil(ntuples / NTUP_PER_BUCKET);

**This is doesn't match the comment.** If we underestimate the number
of tuples and go on to fill the available memory, we then end up with
an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
notational confusion that has been skewing the discussion.

The correct calculation that would match the objective set out in the
comment would be

 dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

Which leads us to a much more useful value of dbuckets in the case
where using ntuples occupies much less space than is available. This
value is always same or higher than previously because of the if-test
that surrounds it.

Given my experience that on larger tables we end up underestimating
ndistinct by 10-100-1000 times, I don't think this change is
unwarranted.

This solves the problem in earlier discussions since we get a lower
average number of tuples per bucket and yet we also get to keep the
current NTUP_PER_BUCKET value. Everybody wins.

Comments?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


rethink_num_hash_buckets.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Cédric Villemain

Le lundi 17 juin 2013 00:02:21, Fabien COELHO a écrit :
  What activity would you expect?
  
  A patch which applies cleanly to git HEAD.  This one doesn't for me,
  although I'm not really sure why, I don't see any obvious conflicts.
 
 Please find attached a freshly generated patch against current master.

* Submission review: 
patch is in unified format and apply on HEAD.
patch contains documentation, however I believe 'AS IMPLICIT' is a PostgreSQL 
extension with special behavior and 'AS EXPLICIT' respect the standard except 
that PostgreSQL adds only the expression 'AS EXPLICIT' (it is also the default 
in the standard). So maybe it is possible to rephrase this piece:

@@ -411,8 +427,8 @@ CREATE CAST (bigint AS int4) WITH FUNCTION int4(bigint) AS 
ASSIGNMENT;
acronymSQL/acronym standard,
except that SQL does not make provisions for binary-coercible
types or extra arguments to implementation functions.
-   literalAS IMPLICIT/ is a productnamePostgreSQL/productname
-   extension, too.
+   literalAS IMPLICIT/ and literalAS EXPLICIT/ are
+   a productnamePostgreSQL/productname extension, too.
   /para
  /refsect1

After digging in the archive and the CF: original request is at  
https://commitfest.postgresql.org/action/patch_view?id=563

* Usability review 
** Does the patch actually implement that? yes
** Do we want that? 
Back in 2012 Tom exposed arguments against it, or at least not a clear +1. 
The patch add nothing but more explicit creation statement, it has gone 
untouched for 2 years without interest so it is surely not something really 
important for PostgreSQL users. However we already have non-standard words for 
CREATE CAST, this new one is not very intrusive .

** Does it follow SQL spec, or the community-agreed behavior? 
It does not follow SQL spec.

** Does it include pg_dump support (if applicable)?
Not but it is probably not interesting to add that to the pg_dump output: it 
increases incompatibility with SQL spec for no gain. The result is that the 
patch only allows to CREATE CAST..AS EXPLICIT without error. Then pg_dump 
won't know if the CAST has been created with the default or an 'explicit 
default'...
 
** Are there dangers? 
It seems no.

* Feature test 
** Does the feature work as advertised? Yes
** Are there corner cases the author has failed to consider?
I think no, but my skills with the parser are limited (gram.y, ...)
** Are there any assertion failures or crashes?
no

* Performance review: not relevant.

* Coding review 
Patch does not pass test:
./check_keywords.pl gram.y ../../../src/include/parser/kwlist.h

I had to update the unreserved keyword list in order to be able to build 
postgresql.

** Does it follow the project coding guidelines? yes
** Are there portability issues? no (only for SQL)
** Will it work on Windows/BSD etc?  yes
** Are the comments sufficient and accurate? Yes
** Does it do what it says, correctly? Yes
** Does it produce compiler warnings? don't build as is. Need patch update
** Can you make it crash? no

* Architecture review
** Is everything done in a way that fits together coherently with other 
features/modules? Yes
** Are there interdependencies that can cause problems? No.

I flag it 'return with feedback', please update the patch so it builds. 
Everything else is ok.
-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] Hardware donation

2013-06-22 Thread Simon Riggs

On 21 June 2013 20:03, Jim Nasby j...@nasby.net wrote:

 Who can be point of contact from the community to arrange shipping, etc?

Do they need to be shipped? Can we just leave them where they are and
arrange access and power charges to be passed to SPI? Sounds like it
would be cheaper and easier to leave them where they are and they
won't get damaged in transit then. Of course, may not be possible.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Andres Freund

On 2013-06-22 12:50:52 +0900, Michael Paquier wrote:
 On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Hm. Looking at how this is currently used - I am afraid it's not
  correct... the reason RelationGetIndexList() returns a copy is that
  cache invalidations will throw away that list. And you do index_open()
  while iterating over it which will accept invalidation messages.
  Mybe it's better to try using RelationGetIndexList directly and measure
  whether that has a measurable impact=
 By looking at the comments of RelationGetIndexList:relcache.c,
 actually the method of the patch is correct because in the event of a
 shared cache invalidation, rd_indexvalid is set to 0 when the index
 list is reset, so the index list would get recomputed even in the case
 of shared mem invalidation.

The problem I see is something else. Consider code like the following:

RelationFetchIndexListIfInvalid(toastrel);
foreach(lc, toastrel-rd_indexlist)
   toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock);

index_open calls relation_open calls LockRelationOid which does:
if (res != LOCKACQUIRE_ALREADY_HELD)
   AcceptInvalidationMessages();

So, what might happen is that you open the first index, which accepts an
invalidation message which in turn might delete the indexlist. Which
means we would likely read invalid memory if there are two indexes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Andres Freund

On 2013-06-21 23:19:27 -0400, Tom Lane wrote:
 Robert Haas robertmh...@gmail.com writes:
  On Fri, Jun 21, 2013 at 5:44 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  The traditional theory has been that that would be less robust, not
  more so.  Child backends are (mostly) able to carry out queries whether
  or not the postmaster is around.
 
  I think that's the Tom Lane theory.  The Robert Haas theory is that if
  the postmaster has died, there's no reason to suppose that it hasn't
  corrupted shared memory on the way down, or that the system isn't
  otherwise heavily fuxxored in some way.
 
 Eh?  The postmaster does its level best never to touch shared memory
 (after initialization anyway).

I am not sure that will never happen - but I think the chain of argument
misses the main point. Normally we rely on the postmaster to kill off
all other backends if a backend PANICs or segfaults for all the known
reasons. As soon as there's no postmaster anymore we loose that
capability.
And *that* is scary imo. Especially as I would say the chance of getting
PANICs or segfaults increases if there's no postmaster anymore since we
might reach code branches we otherwise won't.

  True, you can't make new connections,
  but how does killing the existing children make that better?
 
  It allows you to start a new postmaster in a timely fashion, instead
  of waiting for an idle connection that may not ever terminate without
  operator intervention.

And it's no always easy to figure out which cluster those backends
belong to if there are multiple postgres instances running as the same user.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Michael Paquier

On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-06-22 12:50:52 +0900, Michael Paquier wrote:
 On Fri, Jun 21, 2013 at 10:47 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Hm. Looking at how this is currently used - I am afraid it's not
  correct... the reason RelationGetIndexList() returns a copy is that
  cache invalidations will throw away that list. And you do index_open()
  while iterating over it which will accept invalidation messages.
  Mybe it's better to try using RelationGetIndexList directly and measure
  whether that has a measurable impact=
 By looking at the comments of RelationGetIndexList:relcache.c,
 actually the method of the patch is correct because in the event of a
 shared cache invalidation, rd_indexvalid is set to 0 when the index
 list is reset, so the index list would get recomputed even in the case
 of shared mem invalidation.

 The problem I see is something else. Consider code like the following:

 RelationFetchIndexListIfInvalid(toastrel);
 foreach(lc, toastrel-rd_indexlist)
toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock);

 index_open calls relation_open calls LockRelationOid which does:
 if (res != LOCKACQUIRE_ALREADY_HELD)
AcceptInvalidationMessages();

 So, what might happen is that you open the first index, which accepts an
 invalidation message which in turn might delete the indexlist. Which
 means we would likely read invalid memory if there are two indexes.
And I imagine that you have the same problem even with
RelationGetIndexList, not only RelationGetIndexListIfInvalid, because
this would appear as long as you try to open more than 1 index with an
index list.
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Andres Freund

On 2013-06-22 22:45:26 +0900, Michael Paquier wrote:
 On Sat, Jun 22, 2013 at 10:34 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  On 2013-06-22 12:50:52 +0900, Michael Paquier wrote:
  By looking at the comments of RelationGetIndexList:relcache.c,
  actually the method of the patch is correct because in the event of a
  shared cache invalidation, rd_indexvalid is set to 0 when the index
  list is reset, so the index list would get recomputed even in the case
  of shared mem invalidation.
 
  The problem I see is something else. Consider code like the following:
 
  RelationFetchIndexListIfInvalid(toastrel);
  foreach(lc, toastrel-rd_indexlist)
 toastidxs[i++] = index_open(lfirst_oid(lc), RowExclusiveLock);
 
  index_open calls relation_open calls LockRelationOid which does:
  if (res != LOCKACQUIRE_ALREADY_HELD)
 AcceptInvalidationMessages();
 
  So, what might happen is that you open the first index, which accepts an
  invalidation message which in turn might delete the indexlist. Which
  means we would likely read invalid memory if there are two indexes.
 And I imagine that you have the same problem even with
 RelationGetIndexList, not only RelationGetIndexListIfInvalid, because
 this would appear as long as you try to open more than 1 index with an
 index list.

No. RelationGetIndexList() returns a copy of the list for exactly that
reason. The danger is not to see an outdated list - we should be
protected by locks against that - but looking at uninitialized or reused
memory.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

Simon,

* Simon Riggs (si...@2ndquadrant.com) wrote:
 Previous discussions of Hash Joins have noted that the performance
 decreases when the average number of tuples per bucket increases.

Having the hash table so small that we have hash bucket collisions with
different 32bit hash values is extremely painful, however..

 The correct calculation that would match the objective set out in the
 comment would be
 
  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

This looks to be driving the size of the hash table size off of how
many of this size tuple can I fit into memory? and ignoring how many
actual rows we have to hash.  Consider a work_mem of 1GB with a small
number of rows to actually hash- say 50.  With a tupsize of 8 bytes,
we'd be creating a hash table sized for some 13M buckets.

 This solves the problem in earlier discussions since we get a lower
 average number of tuples per bucket and yet we also get to keep the
 current NTUP_PER_BUCKET value. Everybody wins.

I agree that this should reduce the number of tuples per bucket, but I
don't think it's doing what you expect and based on what it's doing, it
seems wasteful.

 Comments?

Based on your argument that we want to have a bucket load which is, on
average, the size of NTUP_PER_BUCKET, I have to '-1' on this.  What we
want is to size a table large enough that we never have any true
collisions (cases where different 32-bit hash values end up in the same
bucket) *for all tuples being hashed*, that includes both the side
building the hash table and the side doing the scan.  This should be
done when we can avoid batching- if we don't have enough to avoid
batching while having such a large table, we should consider adjusting
the hash table size down some because, in those cases, we're memory
constrained.

When we have enough memory to avoid batching, we never want to have to
check down through a bucket for a tuple which doesn't exist- we should
simply be able to look in the hash table itself, determine (with a
single fetch) that there are no entries for that hash value, and throw
the tuple away and move on to the next.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Unaccent performance

2013-06-22 Thread Andres Freund

On 2013-06-21 22:52:04 +0100, Thom Brown wrote:
  CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
   RETURNS text
   LANGUAGE sql
   IMMUTABLE
  AS $function$
  SELECT
  replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
  ;
  $function$
 

 Another test passing in a string of 10 characters gives the following
 timings:
 
 unaccent: 240619.395 ms
 myunaccent: 785.505 ms

The reason for that is that unaccent is 'stable' while your function is
'immutable', so the planner recognizes that and computes it only once
since you're always passing the same text string to it.

 Another test inserting long text strings into a text column of a table
 100,000 times, then updating another column to have that unaccented value
 using both methods:
 
 unaccent: 3867.306 ms
 myunaccent: 43611.732 ms

Whereas it cannot recognize that in this case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Implementing incremental backup

2013-06-22 Thread Cédric Villemain

Le samedi 22 juin 2013 01:09:20, Jehan-Guillaume (ioguix) de Rorthais a écrit 
:
 On 20/06/2013 03:25, Tatsuo Ishii wrote:
  On Wed, Jun 19, 2013 at 8:40 PM, Tatsuo Ishii is...@postgresql.org 
wrote:
  On Wed, Jun 19, 2013 at 6:20 PM, Stephen Frost sfr...@snowman.net 
wrote:
  * Claudio Freire (klaussfre...@gmail.com) wrote:
 [...]
 
  The only bottleneck here, is WAL archiving. This assumes you can
  afford WAL archiving at least to a local filesystem, and that the WAL
  compressor is able to cope with WAL bandwidth. But I have no reason to
  think you'd be able to cope with dirty-map updates anyway if you were
  saturating the WAL compressor, as the compressor is more efficient on
  amortized cost per transaction than the dirty-map approach.
  
  Thank you for detailed explanation. I will think more about this.
 
 Just for the record, I was mulling over this idea since a bunch of
 month. I even talked about that with Dimitri Fontaine some weeks ago
 with some beers :)
 
 My idea came from a customer during a training explaining me the
 difference between differential and incremental backup in Oracle.
 
 My approach would have been to create a standalone tool (say
 pg_walaggregate) which takes a bunch of WAL from archives and merge them
 in a single big file, keeping only the very last version of each page
 after aggregating all their changes. The resulting file, aggregating all
 the changes from given WAL files would be the differential backup.
 
 A differential backup resulting from a bunch of WAL between W1 and Wn
 would help to recover much faster to the time of Wn than replaying all
 the WALs between W1 and Wn and saves a lot of space.
 
 I was hoping to find some time to dig around this idea, but as the
 subject rose here, then here are my 2¢!

something like that maybe :
./pg_xlogdump -b \
../data/pg_xlog/00010001 \   
../data/pg_xlog/00010005| \
grep 'backup bkp' | awk '{print ($5,$9)}'

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.

Re: [HACKERS] Implementing incremental backup

2013-06-22 Thread Andres Freund

On 2013-06-22 15:58:35 +0200, Cédric Villemain wrote:
  A differential backup resulting from a bunch of WAL between W1 and Wn
  would help to recover much faster to the time of Wn than replaying all
  the WALs between W1 and Wn and saves a lot of space.
  
  I was hoping to find some time to dig around this idea, but as the
  subject rose here, then here are my 2¢!
 
 something like that maybe :
 ./pg_xlogdump -b \
 ../data/pg_xlog/00010001 \   
 ../data/pg_xlog/00010005| \
 grep 'backup bkp' | awk '{print ($5,$9)}'

Note that it's a bit more complex than that for a number of reasons:
* we don't log full page images for e.g. new heap pages, we just set the
  XLOG_HEAP_INIT_PAGE flag on the record
* there also are XLOG_FPI records
* How do you get a base backup as the basis to apply those to? You need
  it to be recovered exactly to a certain point...

But yes, I think something can be done in the end. I think Heikki's
pg_rewind already has quite a bit of the required logic.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible bug in CASE evaluation

2013-06-22 Thread Andres Freund

On 2013-06-21 16:45:28 +0200, Andres Freund wrote:
 On 2013-06-21 09:51:05 -0400, Noah Misch wrote:
  That being said, if we discover a simple-enough fix that performs well, we 
  may
  as well incorporate it.
 
 What about passing another parameter down eval_const_expressions_mutator
 (which is static, so changing the API isn't a problem) that basically
 tells us whether we actually should evaluate expressions or just perform
 the transformation?
 There's seems to be basically a couple of places where we call dangerous
 things:
 * simplify_function (via -evaluate_function-evaluate_expr) which is
   called in a bunch of places
 * evaluate_expr which is directly called in T_ArrayExpr
   T_ArrayCoerceExpr
 
 All places I've inspected so far need to deal with simplify_function
 returning NULL anyway, so that seems like a viable fix.

*Prototype* patch - that seems simple enough - attached. Opinions?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 1349f8e4a8133eebbf66753b1aa3787f88ee3e33 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Sat, 22 Jun 2013 16:53:39 +0200
Subject: [PATCH] Don't evaluate potentially unreachable parts of CASE during
 planning

---
 src/backend/optimizer/util/clauses.c | 31 +--
 src/test/regress/expected/case.out   | 13 ++---
 src/test/regress/sql/case.sql|  4 ++--
 3 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index 6d5b204..94b52f6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -63,6 +63,7 @@ typedef struct
 	List	   *active_fns;
 	Node	   *case_val;
 	bool		estimate;
+	bool		doevil;
 } eval_const_expressions_context;
 
 typedef struct
@@ -2202,6 +2203,7 @@ eval_const_expressions(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = false;	/* safe transformations only */
+	context.doevil = true;		/* allowed to evaluate const expressions */
 	return eval_const_expressions_mutator(node, context);
 }
 
@@ -2233,6 +2235,7 @@ estimate_expression_value(PlannerInfo *root, Node *node)
 	context.active_fns = NIL;	/* nothing being recursively simplified */
 	context.case_val = NULL;	/* no CASE being examined */
 	context.estimate = true;	/* unsafe transformations OK */
+	context.doevil = true;		/* allowed to evaluate const expressions */
 	return eval_const_expressions_mutator(node, context);
 }
 
@@ -2751,7 +2754,7 @@ eval_const_expressions_mutator(Node *node,
  * If constant argument and it's a binary-coercible or
  * immutable conversion, we can simplify it to a constant.
  */
-if (arg  IsA(arg, Const) 
+if (context-doevil  arg  IsA(arg, Const) 
 	(!OidIsValid(newexpr-elemfuncid) ||
 func_volatile(newexpr-elemfuncid) == PROVOLATILE_IMMUTABLE))
 	return (Node *) evaluate_expr((Expr *) newexpr,
@@ -2843,16 +2846,20 @@ eval_const_expressions_mutator(Node *node,
 CaseExpr   *caseexpr = (CaseExpr *) node;
 CaseExpr   *newcase;
 Node	   *save_case_val;
+bool	save_doevil;
 Node	   *newarg;
 List	   *newargs;
 bool		const_true_cond;
 Node	   *defresult = NULL;
 ListCell   *arg;
 
+save_doevil = context-doevil;
+
 /* Simplify the test expression, if any */
 newarg = eval_const_expressions_mutator((Node *) caseexpr-arg,
 		context);
 
+
 /* Set up for contained CaseTestExpr nodes */
 save_case_val = context-case_val;
 if (newarg  IsA(newarg, Const))
@@ -2894,6 +2901,17 @@ eval_const_expressions_mutator(Node *node,
 		const_true_cond = true;
 	}
 
+	/*
+	 * We can only evaluate expressions as long we are sure the
+	 * the expression will be reached at runtime - otherwise we
+	 * might cause spurious errors to be thrown. The first
+	 * eval_const_expression above can always evaluate
+	 * expressions (unless we are in a nested CaseExpr) since
+	 * it will always be reached at runtime.
+	 */
+	if (!const_true_cond)
+		context-doevil = false;
+
 	/* Simplify this alternative's result value */
 	caseresult = eval_const_expressions_mutator((Node *) oldcasewhen-result,
 context);
@@ -2926,6 +2944,8 @@ eval_const_expressions_mutator(Node *node,
 
 context-case_val = save_case_val;
 
+context-doevil = save_doevil;
+
 /*
  * If no non-FALSE alternatives, CASE reduces to the default
  * result
@@ -2982,7 +3002,7 @@ eval_const_expressions_mutator(Node *node,
 newarray-multidims = arrayexpr-multidims;
 newarray-location = arrayexpr-location;
 
-if (all_const)
+if (all_const  context-doevil)
 	return (Node *) evaluate_expr((Expr *) newarray,

Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-22 Thread Bruce Momjian

On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote:
 (I'm still learning the details of Pg's WAL, WAL replay and recovery, so
 the below's just my understanding):
 
 The problem is that WAL for all tablespaces is mixed together in the
 archives. If you lose your tablespace then you have to keep *all* WAL
 around and replay *all* of it again when the tablespace comes back
 online. This would be very inefficient, would require a lot of tricks to
 cope with applying WAL to a database that has an on-disk state in the
 future as far as the archives are concerned. It's not as simple as just
 replaying all WAL all over again - as I understand it, things like
 CLUSTER or TRUNCATE will result in relfilenodes not being where they're
 expected to be as far as old WAL archives are concerned. Selective
 replay would be required, and that leaves the door open to all sorts of
 new and exciting bugs in areas that'd hardly ever get tested.
 
 To solve the massive disk space explosion problem I imagine we'd have to
 have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
 and loss of the rather nice property that WAL writes are nice sequential
 writes. It'd be complicated and probably cause nightmares during
 recovery, for archive-based replication, etc.
 
 The only other thing I can think of is: When a tablespace is offline,
 write WAL records to a separate tablespace recovery log as they're
 encountered. Replay this log when the tablespace comes is restored,
 before applying any other new WAL to the tablespace. This wouldn't
 affect archive-based recovery since it'd already have the records from
 the original WAL.
 
 None of these options seem exactly simple or pretty, especially given
 the additional complexities that'd be involved in allowing WAL records
 to be applied out-of-order, something that AFAIK _never_h happens at the
 moment.
 
 The key problem, of course, is that this all sounds like a lot of
 complicated work for a case that's not really supposed to happen. Right
 now, the answer is your database is unrecoverable, switch to your
 streaming warm standby and re-seed it from the standby. Not pretty, but
 at least there's the option of using a sync standby and avoiding data loss.
 
 How would you approach this?

Sorry to be replying late.  You are right that you could record/apply
WAL separately for offline tablespaces.  The problem is that you could
have logical ties from the offline tablespace to online tablespaces. 
For example, what happens if data in an online tablespace references a
primary key in an offline tablespace.  What if the system catalogs are
stored in an offline tablespace?  Right now, we allow logical bindings
across physical tablespaces.  To do what you want, you would really need
to store each database in its own tablespace.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

On 22 June 2013 14:48, Stephen Frost sfr...@snowman.net wrote:

 Based on your argument that we want to have a bucket load which is, on
 average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

That is not my argument. I am pointing out that the comments claim the
code does that, and yet the code does not.

So either we apply the patch to make the code fit the comment, or we
change the comment.

Tom's wish for an average tuples per bucket of 10 may be connected to
that error; I can't say. But we definitely don't end up with an
average of 10 in many cases, which is the basis for previous poor
performance reports.

 What we
 want is to size a table large enough that we never have any true
 collisions (cases where different 32-bit hash values end up in the same
 bucket) *for all tuples being hashed*, that includes both the side
 building the hash table and the side doing the scan.  This should be
 done when we can avoid batching- if we don't have enough to avoid
 batching while having such a large table, we should consider adjusting
 the hash table size down some because, in those cases, we're memory
 constrained.

 When we have enough memory to avoid batching, we never want to have to
 check down through a bucket for a tuple which doesn't exist- we should
 simply be able to look in the hash table itself, determine (with a
 single fetch) that there are no entries for that hash value, and throw
 the tuple away and move on to the next.

The bottom line is that you're hoping for perfection. If we knew
exactly how many tuples were in the hash table then we could size it
precisely for the hash join we are about to perform. We don't know
that, so we can't size it precisely. Worse than that is that we know
we badly estimate the number of buckets on a regular basis.

That gives us three choices: if we have a hash table fixed without
prior knowledge then we either (0) we continue to underallocate them
in many cases or (1) potentially overallocate buckets. Or (2) we learn
to cope better with the variation in the number of entries. If we rule
out the first two we have only the last one remaining.

(1) will always be a heuristic, and as you point out could itself be
an extreme setting.

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-22 Thread Fabien COELHO



Dear Robert and Greg,


I think so.  If it doesn't get fixed now, it's not likely to get fixed
later.  And the fact that nobody understands why it's happening is
kinda worrisome...


Possibly, but I thing that it is not my fault:-)

So, I spent some time at performance debugging...

First, I finally succeeded in reproducing Greg Smith spikes on my ubuntu 
laptop. It needs short transactions and many clients per thread so as to 
be a spike. With pgbench standard transactions and not too many clients 
per thread it is more of a little bump, or even it is not there at all.


After some poking around, and pursuing various red herrings, I resorted to 
measure the delay for calling PQfinish(), which is really the only 
special thing going around at the end of pgbench run...


BINGO!

In his tests Greg is using one thread. Once the overall timer is exceeded, 
clients start closing their connections by calling PQfinish once their 
transactions are done. This call takes between a few us and a few ... ms. 
So if some client closing time hit the bad figures, the transactions of 
other clients are artificially delayed by this time, and it seems they 
have a high latency, but it is really because the thread is in another 
client's PQfinish and was not available to process the data. If you have 
one thread per client, no problem, especially as the final PQfinish() time 
is not measured at all by pgbench:-) Also, the more clients, the higher 
the spike because more are to be stopped and may hit the bad figures.


Here is a trace, with the simple SELECT transaction.

  sh ./pgbench --rate=14000 -T 10 -r -l -c 30 -S bench
  ...

  sh less pgbench_log.*

 # short transactions, about 0.250 ms
 ...
 20 4849 241 0 1371916583.455400
 21 4844 256 0 1371916583.455470
 22 4832 348 0 1371916583.455569
 23 4829 218 0 1371916583.455627

   ** TIMER EXCEEDED **

 25 4722 390 0 1371916583.455820
 25 done in 1560 [1,2]  # BING, 1560 us for PQfinish()
 26 4557 1969 0 1371916583.457407
 26 done in 21 [1,2]
 27 4372 1969 0 1371916583.457447
 27 done in 19 [1,2]
 28 4009 1910 0 1371916583.457486
 28 done in 1445 [1,2]  # BOUM
 2 interrupted in 1300 [0,0]# BANG
 3 interrupted in 15 [0,0]
 4 interrupted in 40 [0,0]
 5 interrupted in 203 [0,0] # boum?
 6 interrupted in 1352 [0,0]# ZIP
 7 interrupted in 18 [0,0]
 ...
 23 interrupted in 12 [0,0]

 ## the cumulated PQfinish() time above is about 6 ms which
 ## appears as an apparent latency for the next clients:

  0 4880 6521 0 1371916583.462157
  0 done in 9 [1,2]
  1 4877 6397 0 1371916583.462194
  1 done in 9 [1,2]
 24 4807 6796 0 1371916583.462217
 24 done in 9 [1,2]
 ...

Note that the bad measures also appear when there is no throttling:

  sh ./pgbench -T 10 -r -l -c 30 -S bench
  sh grep 'done.*[0-9][0-9][0-9]' pgbench_log.*
   0 done in 1974 [1,2]
   1 done in 312 [1,2]
   3 done in 2159 [1,2]
   7 done in 409 [1,2]
  11 done in 393 [1,2]
  12 done in 2212 [1,2]
  13 done in 1458 [1,2]
  ## other clients execute PQfinish in less than 100 us

This done is issued by my instrumented version of clientDone().

The issue does also appear if I instrument pgbench from master, without 
anything from the throttling patch at all:


  sh git diff master
  diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
  index 1303217..7c5ea81 100644
  --- a/contrib/pgbench/pgbench.c
  +++ b/contrib/pgbench/pgbench.c
  @@ -869,7 +869,15 @@ clientDone(CState *st, bool ok)

  if (st-con != NULL)
  {
  +   instr_time now;
  +   int64 s0, s1;
  +   INSTR_TIME_SET_CURRENT(now);
  +   s0 = INSTR_TIME_GET_MICROSEC(now);
  PQfinish(st-con);
  +   INSTR_TIME_SET_CURRENT(now);
  +   s1 = INSTR_TIME_GET_MICROSEC(now);
  +   fprintf(stderr, %d done in %ld [%d,%d]\n,
  +   st-id, s1-s0, st-listen, st-state);
  st-con = NULL;
  }
  return false;   /* always false */

  sh ./pgbench -T 10 -r -l -c 30 -S bench 2 x.err

  sh grep 'done.*[0-9][0-9][0-9]' x.err
  14 done in 1985 [1,2]
  16 done in 276 [1,2]
  17 done in 1418 [1,2]

So my argumented conclusion is that the issue is somewhere within 
PQfinish(), possibly in interaction with pgbench doings, but is *NOT* 
related in any way to the throttling patch, as it is preexisting it. Gregs 
stumbled upon it because he looked at latencies.


I'll submit a slightly improved v12 shortly.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-22 Thread Alvaro Herrera

Andres Freund escribió:
 On 2013-06-22 22:45:26 +0900, Michael Paquier wrote:

  And I imagine that you have the same problem even with
  RelationGetIndexList, not only RelationGetIndexListIfInvalid, because
  this would appear as long as you try to open more than 1 index with an
  index list.
 
 No. RelationGetIndexList() returns a copy of the list for exactly that
 reason. The danger is not to see an outdated list - we should be
 protected by locks against that - but looking at uninitialized or reused
 memory.

Are we doing this only to save some palloc traffic?  Could we do this
by, say, teaching list_copy() to have a special case for lists of ints
and oids that allocates all the cells in a single palloc chunk?

(This has the obvious problem that list_free no longer works, of
course.  But I think that specific problem can be easily fixed.  Not
sure if it causes more breakage elsewhere.)

Alternatively, I guess we could grab an uncopied list, then copy the
items individually into a locally allocated array, avoiding list_copy.
We'd need to iterate differently than with foreach().

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Alvaro Herrera

MauMau escribió:

 Are you suggesting simplifying the following part in ServerLoop()?
 I welcome the idea if this condition becomes simpler.  However, I
 cannot imagine how.

  if (AbortStartTime  0   /* SIGKILL only once */
   (Shutdown == ImmediateShutdown || (FatalError  !SendStop)) 
   now - AbortStartTime = 10)
  {
   SignalAllChildren(SIGKILL);
   AbortStartTime = 0;
  }

Yes, that's what I'm suggesting.  I haven't tried coding it yet.

 I thought of adding some new state of pmState for some reason (that
 might be the same as your idea).
 But I refrained from doing that, because pmState has already many
 states.  I was afraid adding a new pmState value for this bug fix
 would complicate the state management (e.g. state transition in
 PostmasterStateMachine()).  In addition, I felt PM_WAIT_BACKENDS was
 appropriate because postmaster is actually waiting for backends to
 terminate after sending SIGQUIT.  The state name is natural.

Well, a natural state name is of no use if we're using it to indicate
two different states, which I think is what would be happening here.
Basically, with your patch, PM_WAIT_BACKENDS means two different things
depending on AbortStartTime.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Andres Freund

On 2013-06-21 16:56:57 -0400, Alvaro Herrera wrote:
  What we could do to improve the robustness a bit - at least on linux -
  is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be
  killed if the postmaster goes away...
 
 This is an interesting idea (even if it has no relationship to the patch
 at hand).

Well, we could just set the deathsig to SIGKILL and exit normally -
which would be a twoliner or so.
Admittedly the cross-platform issue makes this approach not so great...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-22 Thread Fabien COELHO



Please find attached a v12, which under timer_exceeded interrupts clients 
which are being throttled instead of waiting for the end of the 
transaction, as the transaction is not started yet.


I've also changed the log format that I used for debugging the apparent 
latency issue:


  x y z 12345677 1234 - x y z 12345677.001234

It seems much clearer that way.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pgbench --throttle (submission 7 - with lag measurement)

2013-06-22 Thread Fabien COELHO



Please find attached a v12, which under timer_exceeded interrupts clients 
which are being throttled instead of waiting for the end of the transaction, 
as the transaction is not started yet.


Oops, I forgot the attachment. Here it is!

--
Fabien.diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 1303217..4c9e55d 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -137,6 +137,12 @@ int			unlogged_tables = 0;
 double		sample_rate = 0.0;
 
 /*
+ * When threads are throttled to a given rate limit, this is the target delay
+ * to reach that rate in usec.  0 is the default and means no throttling.
+ */
+int64		throttle_delay = 0;
+
+/*
  * tablespace selection
  */
 char	   *tablespace = NULL;
@@ -200,11 +206,13 @@ typedef struct
 	int			listen;			/* 0 indicates that an async query has been
  * sent */
 	int			sleeping;		/* 1 indicates that the client is napping */
+boolthrottling; /* whether nap is for throttling */
 	int64		until;			/* napping until (usec) */
 	Variable   *variables;		/* array of variable definitions */
 	int			nvariables;
 	instr_time	txn_begin;		/* used for measuring transaction latencies */
 	instr_time	stmt_begin;		/* used for measuring statement latencies */
+	bool		throttled;  /* whether current transaction was throttled */
 	int			use_file;		/* index in sql_files for this client */
 	bool		prepared[MAX_FILES];
 } CState;
@@ -222,6 +230,10 @@ typedef struct
 	instr_time *exec_elapsed;	/* time spent executing cmds (per Command) */
 	int		   *exec_count;		/* number of cmd executions (per Command) */
 	unsigned short random_state[3];		/* separate randomness for each thread */
+int64   throttle_trigger;  /* previous/next throttling (us) */
+	int64   throttle_lag;  /* total transaction lag behind throttling */
+	int64   throttle_lag_max;  /* max transaction lag */
+
 } TState;
 
 #define INVALID_THREAD		((pthread_t) 0)
@@ -230,6 +242,8 @@ typedef struct
 {
 	instr_time	conn_time;
 	int			xacts;
+	int64   throttle_lag;
+	int64   throttle_lag_max;
 } TResult;
 
 /*
@@ -355,6 +369,8 @@ usage(void)
 		 -n   do not run VACUUM before tests\n
 		 -N   do not update tables \pgbench_tellers\ and \pgbench_branches\\n
 		 -r   report average latency per command\n
+		 -R SPEC, --rate SPEC\n
+		  target rate in transactions per second\n
 		 -s NUM   report this scale factor in output\n
 		 -S   perform SELECT-only transactions\n
 	   -t NUM   number of transactions each client runs (default: 10)\n
@@ -898,17 +914,58 @@ doCustom(TState *thread, CState *st, instr_time *conn_time, FILE *logfile, AggVa
 {
 	PGresult   *res;
 	Command   **commands;
+	booldo_throttle = false;
 
 top:
 	commands = sql_files[st-use_file];
 
+	/* handle throttling once per transaction by inserting a sleep.
+	 * this is simpler than doing it at the end.
+	 */
+	if (throttle_delay  ! st-throttled)
+	{
+		/* compute delay to approximate a Poisson distribution
+		 * 100 = 13.8 .. 0 multiplier
+		 *  10 = 11.5 .. 0
+		 *   1 =  9.2 .. 0
+		 *1000 =  6.9 .. 0
+		 * if transactions are too slow or a given wait shorter than
+		 * a transaction, the next transaction will start right away.
+		 */
+		int64 wait = (int64)
+			throttle_delay * -log(getrand(thread, 1, 1000)/1000.0);
+
+		thread-throttle_trigger += wait;
+
+		st-until = thread-throttle_trigger;
+		st-sleeping = 1;
+		st-throttling = true;
+		st-throttled = true;
+		if (debug)
+			fprintf(stderr, client %d throttling INT64_FORMAT us\n,
+	st-id, wait);
+	}
+
 	if (st-sleeping)
 	{			/* are we sleeping? */
 		instr_time	now;
+		int64 now_us;
 
 		INSTR_TIME_SET_CURRENT(now);
-		if (st-until = INSTR_TIME_GET_MICROSEC(now))
+		now_us = INSTR_TIME_GET_MICROSEC(now);
+		if (st-until = now_us)
+		{
 			st-sleeping = 0;	/* Done sleeping, go ahead with next command */
+			if (st-throttling)
+			{
+/* measure lag of throttled transaction */
+int64 lag = now_us - st-until;
+thread-throttle_lag += lag;
+if (lag  thread-throttle_lag_max)
+	thread-throttle_lag_max = lag;
+st-throttling = false;
+			}
+		}
 		else
 			return true;		/* Still sleeping, nothing to do here */
 	}
@@ -1037,7 +1094,7 @@ top:
 	 * This is more than we really ought to know about
 	 * instr_time
 	 */
-	fprintf(logfile, %d %d %.0f %d %ld %ld\n,
+	fprintf(logfile, %d %d %.0f %d %ld.%06ld\n,
 			st-id, st-cnt, usec, st-use_file,
 			(long) now.tv_sec, (long) now.tv_usec);
 #else
@@ -1046,7 +1103,7 @@ top:
 	 * On Windows, instr_time doesn't provide a timestamp
 	 * anyway
 	 */
-	fprintf(logfile, %d %d %.0f %d 0 0\n,
+	fprintf(logfile, %d %d %.0f %d 0.0\n,
 			st-id, st-cnt, usec, st-use_file);
 #endif
 }
@@ -1095,6 +1152,13 @@ top:
 			st-state = 0;
 			st-use_file = (int) getrand(thread, 0, num_files - 1);
 			commands =

Re: [HACKERS] Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Robert Haas

On Fri, Jun 21, 2013 at 11:19 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I think that's the Tom Lane theory.  The Robert Haas theory is that if
 the postmaster has died, there's no reason to suppose that it hasn't
 corrupted shared memory on the way down, or that the system isn't
 otherwise heavily fuxxored in some way.

 Eh?  The postmaster does its level best never to touch shared memory
 (after initialization anyway).

And yet it certainly does - see pmsignal.c, for example.  Besides
which, as Andres points out, if the postmaster is dead, there is zip
for a guarantee that some OTHER backend hasn't panicked.  I think it's
just ridiculous to suppose that the system can run in any sort of
reasonable way without the postmaster.  The whole reason why we work
so hard to make sure that the postmaster doesn't die in the first
place is because we need it to clean up when things go horribly wrong.
 If that cleanup function is important, then we need a living
postmaster at all times.  If it's not important, then our extreme
paranoia about what operations the postmaster is permitted to engage
in is overblown.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 9:15 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Previous discussions of Hash Joins have noted that the performance
 decreases when the average number of tuples per bucket increases.
 O(N^2) effects are seen.

 We've argued this about many different ways, yet all of those
 discussions have centred around the constant NTUP_PER_BUCKET. I
 believe that was a subtle mistake and I have a proposal.

 The code in ExecChooseHashTableSize() line 460 says
  /*
   * Set nbuckets to achieve an average bucket load of NTUP_PER_BUCKET when
   * memory is filled.
 ...

 but the calculation then sets the number of buckets like this

  dbuckets = ceil(ntuples / NTUP_PER_BUCKET);

 **This is doesn't match the comment.** If we underestimate the number
 of tuples and go on to fill the available memory, we then end up with
 an average number of tuples per bucket higher than NTUP_PER_BUCKET. A
 notational confusion that has been skewing the discussion.

 The correct calculation that would match the objective set out in the
 comment would be

  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

 Which leads us to a much more useful value of dbuckets in the case
 where using ntuples occupies much less space than is available. This
 value is always same or higher than previously because of the if-test
 that surrounds it.

+1.  I think this is very clever.

The other thing that seems to be distorting things here is that this
function doesn't account for the memory consumed by the bucket array.
So as things stand today, changing NTUP_PER_BUCKET can't ever increase
the required number of batches.  But that's really nonsensical.
Setting NTUP_PER_BUCKET to a smaller value like 1 or even, in effect,
a value less than 1 would probably improve performance for large hash
tables, but there's no way to decide what value is too expensive
because the memory cost of changing it isn't accounted for in the
first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 9:48 AM, Stephen Frost sfr...@snowman.net wrote:
 The correct calculation that would match the objective set out in the
 comment would be

  dbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;

 This looks to be driving the size of the hash table size off of how
 many of this size tuple can I fit into memory? and ignoring how many
 actual rows we have to hash.  Consider a work_mem of 1GB with a small
 number of rows to actually hash- say 50.  With a tupsize of 8 bytes,
 we'd be creating a hash table sized for some 13M buckets.

This is a fair point, but I still think Simon's got a good point, too.
 Letting the number of buckets ramp up when there's ample memory seems
like a broadly sensible strategy.  We might need to put a floor on the
effective load factor, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MemoryContextAllocHuge(): selectively bypassing MaxAllocSize

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 3:46 AM, Stephen Frost sfr...@snowman.net wrote:
 I'm not a huge fan of moving directly to INT_MAX.  Are we confident that
 everything can handle that cleanly..?  I feel like it might be a bit
 safer to shy a bit short of INT_MAX (say, by 1K).

Maybe it would be better to stick with INT_MAX and fix any bugs we
find.  If there are magic numbers short of INT_MAX that cause
problems, it would likely be better to find out about those problems
and adjust the relevant code, rather than trying to dodge them.  We'll
have to confront all of those problems eventually as we come to
support larger and larger sorts; I don't see much value in putting it
off.

Especially since we're early in the release cycle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Robert Haas

On Sat, Jun 22, 2013 at 9:16 AM, Cédric Villemain
ced...@2ndquadrant.com wrote:
 patch is in unified format and apply on HEAD.
 patch contains documentation, however I believe 'AS IMPLICIT' is a PostgreSQL
 extension with special behavior and 'AS EXPLICIT' respect the standard except
 that PostgreSQL adds only the expression 'AS EXPLICIT' (it is also the default
 in the standard).

I object to this patch.  This patch a new keyword, EXPLICIT, for
reasons that are strictly cosmetic.  Everything that you can do with
this patch can also be done without this patch.  It is not a good idea
to slow down parsing of every SQL statement we have just so that
someone can write CREATE CAST .. AS EXPLICIT.  Granted, the parsing
slowdown for just one keyword is probably not noticeable, but it's
cumulative with every new keyword we add.  Adding them to support new
features is one thing, but adding them to support purely optional
syntax is, I think, going too far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Robert Haas

On Fri, Jun 21, 2013 at 10:02 PM, MauMau maumau...@gmail.com wrote:
 I'm comfortable with 5 seconds.  We are talking about the interval between
 sending SIGQUIT to the children and then sending SIGKILL to them.  In most
 situations, the backends should terminate immediately.  However, as I said a
 few months ago, ereport() call in quickdie() can deadlock indefinitely. This
 is a PostgreSQL's bug.

So let's fix that bug.  Then we don't need this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dump difference between 9.3 and master after upgrade

2013-06-22 Thread Andrew Dunstan



On 06/20/2013 11:16 AM, I wrote:


On 06/20/2013 10:43 AM, Robert Haas wrote:
On Tue, Jun 18, 2013 at 12:18 PM, Andrew Dunstan 
and...@dunslane.net wrote:
As I was updating my cross version upgrade tester to include support 
for the
9.3 branch, I noted this dump difference between the dump of the 
original
9.3 database and the dump of the converted database, which looks 
odd. Is it

correct?

It looks sketchy to me, but I'm not sure exactly what you're comparing.



Essentially, cross version upgrade testing runs pg_dumpall from the 
new version on the old cluster, runs pg_upgrade, and then runs 
pg_dumpall on the upgraded cluster, and compares the two outputs. This 
is what we get when the new version is HEAD and the old version is 9.3.


The reason this hasn't been caught by the standard same-version 
upgrade tests is that this module uses a more extensive cluster, which 
has had not only the core regression tests run but also all the 
contrib and pl regression tests, and this problem seems to be exposed 
by the postgres_fdw tests.


At first glance to me like pg_dump in binary-upgrade mode is not 
suppressing something that it should be suppressing.





Yeah, after examination I don't see why we should output anything for 
dropped columns of a foreign table in binary upgrade mode. This looks to 
me like it's been a bug back to 9.1 where we introduced foreign tables.


I think something like the attached should fix it, although I'm not sure 
if that's the right place for the fix.


cheers

andrew
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index ec956ad..b25b475 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -6670,7 +6670,7 @@ getTableAttrs(Archive *fout, TableInfo *tblinfo, int numTables)
  * such a column it will mistakenly get pg_attribute.attislocal set to true.)
  * However, in binary_upgrade mode, we must print all such columns anyway and
  * fix the attislocal/attisdropped state later, so as to keep control of the
- * physical column order.
+ * physical column order, unless it's a Foreign Table.
  *
  * This function exists because there are scattered nonobvious places that
  * must be kept in sync with this decision.
@@ -6679,7 +6679,7 @@ bool
 shouldPrintColumn(TableInfo *tbinfo, int colno)
 {
 	if (binary_upgrade)
-		return true;
+		return (tbinfo-relkind != RELKIND_FOREIGN_TABLE || !tbinfo-attisdropped[colno]);
 	return (tbinfo-attislocal[colno]  !tbinfo-attisdropped[colno]);
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Fabien COELHO



Hello Robert,


I object to this patch.  This patch a new keyword, EXPLICIT, for
reasons that are strictly cosmetic.  Everything that you can do with
this patch can also be done without this patch.  It is not a good idea
to slow down parsing of every SQL statement we have just so that
someone can write CREATE CAST .. AS EXPLICIT.  Granted, the parsing
slowdown for just one keyword is probably not noticeable, but it's
cumulative with every new keyword we add.  Adding them to support new
features is one thing, but adding them to support purely optional
syntax is, I think, going too far.


I partly object to the objection:-)

I agree that it may induce a very small delay to the parser, however I 
*do* think that cosmetic things are important. In order to ease 
understanding, learning and memorizing a language, concepts must have 
names, syntaxes, and be orthogonal and symmetric where applicable.


In this example, there are 3 kinds of casts, all 3 have a conceptual name 
(explicit, implicit, assignment) but only two have a syntax, and the other 
one is the absence of syntax. So you have to memorize this stupid 
information (which one of the three does not have a syntax) or read the 
documentation every time to remember that explicit is the one without a 
syntax. Note also that you must state implicit explicitely, but 
explicit is told implicitely, which does not really help.


The impact is also on the documentation which is not symmetric because it 
is based on the syntax which is not, so it is simply a little harder to 
understand.


Every year I do my class about PL/pgSQL and extensions to Pg, and every 
year some students will try as explicit because it is logical to do so. 
I think that she is right and that it should work, instead of having to 
explain that explicit is implicit when dealing with Pg casts. Although 
it is my job, I would prefer to spend time explaining more interesting 
things.


From the software engineering point of view, having a syntax for all case 
means that the developer must think about which kind of cast she really 
wants, instead of doing the default thing just because it is the default.


So in my mind the tradeoff is between people time  annoyance and a few 
machine cycles, and I have no hesitation to choose the later.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Fabien COELHO



Hello Cédric,


So maybe it is possible to rephrase this piece:
-   literalAS IMPLICIT/ is a productnamePostgreSQL/productname
-   extension, too.
+   literalAS IMPLICIT/ and literalAS EXPLICIT/ are
+   a productnamePostgreSQL/productname extension, too.


Ok.


Back in 2012 Tom exposed arguments against it, or at least not a clear +1.
The patch add nothing but more explicit creation statement, it has gone
untouched for 2 years without interest so it is surely not something really
important for PostgreSQL users.


Elegant is important:-) See my answer to Robert's objection.


* Coding review
Patch does not pass test:
./check_keywords.pl gram.y ../../../src/include/parser/kwlist.h


Oops! That is not elegant!


I had to update the unreserved keyword list in order to be able to build
postgresql.



** Does it produce compiler warnings? don't build as is. Need patch update


Indeed.


I flag it 'return with feedback', please update the patch so it builds.
Everything else is ok.


Yep.

--
Fabien.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [Review] Re: [HACKERS] minor patch submission: CREATE CAST ... AS EXPLICIT

2013-06-22 Thread Fabien COELHO




I flag it 'return with feedback', please update the patch so it builds.
Everything else is ok.


Here it is.

--
Fabien.diff --git a/doc/src/sgml/ref/create_cast.sgml b/doc/src/sgml/ref/create_cast.sgml
index 29ea298..0ace996 100644
--- a/doc/src/sgml/ref/create_cast.sgml
+++ b/doc/src/sgml/ref/create_cast.sgml
@@ -20,15 +20,15 @@
 synopsis
 CREATE CAST (replaceablesource_type/replaceable AS replaceabletarget_type/replaceable)
 WITH FUNCTION replaceablefunction_name/replaceable (replaceableargument_type/replaceable [, ...])
-[ AS ASSIGNMENT | AS IMPLICIT ]
+[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ]
 
 CREATE CAST (replaceablesource_type/replaceable AS replaceabletarget_type/replaceable)
 WITHOUT FUNCTION
-[ AS ASSIGNMENT | AS IMPLICIT ]
+[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ]
 
 CREATE CAST (replaceablesource_type/replaceable AS replaceabletarget_type/replaceable)
 WITH INOUT
-[ AS ASSIGNMENT | AS IMPLICIT ]
+[ AS ASSIGNMENT | AS EXPLICIT | AS IMPLICIT ]
 /synopsis
  /refsynopsisdiv
 
@@ -74,7 +74,8 @@ SELECT CAST(42 AS float8);
   /para
 
   para
-   By default, a cast can be invoked only by an explicit cast request,
+   By default, or if the cast is declared literalAS EXPLICIT/,
+   a cast can be invoked only by an explicit cast request,
that is an explicit literalCAST(replaceablex/ AS
replaceabletypename/)/literal or
replaceablex/literal::/replaceabletypename/
@@ -239,6 +240,21 @@ SELECT CAST ( 2 AS numeric ) + 4.0;
 /varlistentry
 
 varlistentry
+ termliteralAS EXPLICIT/literal/term
+
+ listitem
+  para
+   Indicates that the cast can be invoked only with an explicit
+   cast request, that is an explicit literalCAST(replaceablex/ AS
+   replaceabletypename/)/literal or
+   replaceablex/literal::/replaceabletypename/
+   construct.
+   This is the default.
+  /para
+ /listitem
+/varlistentry
+
+varlistentry
  termliteralAS IMPLICIT/literal/term
 
  listitem
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 5094226..2c0694f 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -533,7 +533,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 	DICTIONARY DISABLE_P DISCARD DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP
 
 	EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ENUM_P ESCAPE EVENT EXCEPT
-	EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN
+	EXCLUDE EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN EXPLICIT
 	EXTENSION EXTERNAL EXTRACT
 
 	FALSE_P FAMILY FETCH FIRST_P FLOAT_P FOLLOWING FOR FORCE FOREIGN FORWARD
@@ -6720,6 +6720,7 @@ CreateCastStmt: CREATE CAST '(' Typename AS Typename ')'
 
 cast_context:  AS IMPLICIT_P	{ $$ = COERCION_IMPLICIT; }
 		| AS ASSIGNMENT			{ $$ = COERCION_ASSIGNMENT; }
+		| AS EXPLICIT			{ $$ = COERCION_EXPLICIT; }
 		| /*EMPTY*/{ $$ = COERCION_EXPLICIT; }
 		;
 
@@ -12723,6 +12724,7 @@ unreserved_keyword:
 			| EXCLUSIVE
 			| EXECUTE
 			| EXPLAIN
+			| EXPLICIT
 			| EXTENSION
 			| EXTERNAL
 			| FAMILY
diff --git a/src/include/parser/kwlist.h b/src/include/parser/kwlist.h
index 68a13b7..f97389b 100644
--- a/src/include/parser/kwlist.h
+++ b/src/include/parser/kwlist.h
@@ -149,6 +149,7 @@ PG_KEYWORD(exclusive, EXCLUSIVE, UNRESERVED_KEYWORD)
 PG_KEYWORD(execute, EXECUTE, UNRESERVED_KEYWORD)
 PG_KEYWORD(exists, EXISTS, COL_NAME_KEYWORD)
 PG_KEYWORD(explain, EXPLAIN, UNRESERVED_KEYWORD)
+PG_KEYWORD(explicit, EXPLICIT, UNRESERVED_KEYWORD)
 PG_KEYWORD(extension, EXTENSION, UNRESERVED_KEYWORD)
 PG_KEYWORD(external, EXTERNAL, UNRESERVED_KEYWORD)
 PG_KEYWORD(extract, EXTRACT, COL_NAME_KEYWORD)
diff --git a/src/test/regress/expected/create_cast.out b/src/test/regress/expected/create_cast.out
index 56cd86e..a8858fa 100644
--- a/src/test/regress/expected/create_cast.out
+++ b/src/test/regress/expected/create_cast.out
@@ -27,8 +27,8 @@ ERROR:  function casttestfunc(text) does not exist
 LINE 1: SELECT casttestfunc('foo'::text);
^
 HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
--- Try binary coercion cast
-CREATE CAST (text AS casttesttype) WITHOUT FUNCTION;
+-- Try binary coercion cast and verbose AS EXPLICIT
+CREATE CAST (text AS casttesttype) WITHOUT FUNCTION AS EXPLICIT;
 SELECT casttestfunc('foo'::text); -- doesn't work, as the cast is explicit
 ERROR:  function casttestfunc(text) does not exist
 LINE 1: SELECT casttestfunc('foo'::text);
@@ -54,7 +54,7 @@ SELECT 1234::int4::casttesttype; -- No cast yet, should fail
 ERROR:  cannot cast type integer to casttesttype
 LINE 1: SELECT 1234::int4::casttesttype;
  ^
-CREATE CAST (int4 AS casttesttype) WITH INOUT;
+CREATE CAST (int4 AS casttesttype) WITH INOUT; -- default AS EXPLICIT
 SELECT 1234::int4::casttesttype; -- Should work now
  casttesttype 
 --
diff --git a/src/test/regress/sql/create_cast.sql

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

On Saturday, June 22, 2013, Simon Riggs wrote:

 On 22 June 2013 14:48, Stephen Frost sfr...@snowman.net javascript:;
 wrote:

  Based on your argument that we want to have a bucket load which is, on
  average, the size of NTUP_PER_BUCKET, I have to '-1' on this.

 That is not my argument. I am pointing out that the comments claim the
 code does that, and yet the code does not.


To be honest, I don't read that comment quite the same way as you do.

Tom's wish for an average tuples per bucket of 10 may be connected to
 that error; I can't say. But we definitely don't end up with an
 average of 10 in many cases, which is the basis for previous poor
 performance reports.


I don't believe Tom wishes for an average of 10..  That's just what
NTUP_PER_BUCKET has always been set to.

The bottom line is that you're hoping for perfection. If we knew
 exactly how many tuples were in the hash table then we could size it
 precisely for the hash join we are about to perform. We don't know
 that, so we can't size it precisely. Worse than that is that we know
 we badly estimate the number of buckets on a regular basis.


I'm actually not trying for perfection. What I think we should start with
is to at least not shoot ourselves in the foot by cutting the bucket size
to one-tenth of our distinct tuple estimate and virtually guaranteeing lots
of true collisions which are expensive.


 That gives us three choices: if we have a hash table fixed without
 prior knowledge then we either (0) we continue to underallocate them
 in many cases or (1) potentially overallocate buckets. Or (2) we learn
 to cope better with the variation in the number of entries. If we rule
 out the first two we have only the last one remaining.

 (1) will always be a heuristic, and as you point out could itself be
 an extreme setting


A heuristic based on our estimates is a good choice, imv.  I do think we
can do better than we are now though.

So I think that (2) is the best route: Given that we know with much
 better certainty the number of rows in the scanned-relation, we should
 be able to examine our hash table after it has been built and decide
 whether it would be cheaper to rebuild the hash table with the right
 number of buckets, or continue processing with what we have now. Which
 is roughly what Heikki proposed already, in January.


I'm actually not a huge fan of this as it's certainly not cheap to do. If
it can be shown to be better than an improved heuristic then perhaps it
would work but I'm not convinced.

One other way to address having a smaller actual hash table is to come up
with another way to detect empty parts of the hash space, perhaps by using
a bitmap to represent the hash space (obviously the individual bits would
represent some chunk of the 32bit hash space, perhaps as much as 1/64'th,
allowing our bitmap to fit into a 64bit int; that may end up being too
small to actually help though). This would mean that when we did get to
scanning a bucket we would at least be much more likely of finding a match
instead of scanning it and finding nothing.

Thanks,

Stephen

Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)

2013-06-22 Thread Fabien COELHO



Hello Mitsumasa,

Thanks for the review.


* 2. Output format in result for more readable.

5.0 s[thread 1]: tps = 1015.576032, AverageLatency(ms) = 0.000984663
5.0 s[thread 0]: tps = 1032.580794, AverageLatency(ms) = 0.000968447
10.0 s [thread 0]: tps = 1129.591189, AverageLatency(ms) = 0.000885276
10.0 s [thread 1]: tps = 1126.267776, AverageLatency(ms) = 0.000887888


However, interesting of output format(design) is different depending on the 
person:-). If you like other format, fix it.


I think that your suggestion is too verbose, and as far as automation is 
oncerned I like cut -f 2 unix filtering and other gnuplot processing... 
but I see your point and it is a matter of taste. I'll try to propose 
something in between, if I can.



* 3. Thread name in output format is not nesessary.
I cannot understand that thread name is displayed in each progress. I think 
that it does not need. I hope that output result sould be more simple also in 
a lot of thread. My images is here,



5.0 s: tps = 2030.576032, AverageLatency(ms) = 0.000984663
10.0 s : tps = 2250.591189, AverageLatency(ms) = 0.000885276


This output format is more simple and intuitive. If you need result in each 
threads, please tell us the reason.


I agree that it would be better, but only a thread has access to its data, 
if it must work with the fork pthread emulation, so each thread has to 
do its report... If the fork emulation is removed and only real threads 
are used, it would be much better, and one thread would be able to report 
for everyone. The alternative is to do a feature which does not work with

fork emulation.


* 4. Slipping the progress time.
Whan I executed this patch in long time, I found slipping the progress time. 
This problem image is here.


Yep. I must change the test to align on the overall start time.

I'll submit a new patch later.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Heikki Linnakangas


On 22.06.2013 19:19, Simon Riggs wrote:

So I think that (2) is the best route: Given that we know with much
better certainty the number of rows in the scanned-relation, we should
be able to examine our hash table after it has been built and decide
whether it would be cheaper to rebuild the hash table with the right
number of buckets, or continue processing with what we have now. Which
is roughly what Heikki proposed already, in January.


Back in January, I wrote a quick patch to experiment with rehashing when 
the hash table becomes too full. It was too late to make it into 9.3 so 
I didn't pursue it further back then, but IIRC it worked. If we have the 
capability to rehash, the accuracy of the initial guess becomes much 
less important.


- Heikki
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6a2f236..32aebb9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -223,6 +223,60 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
+static void
+Rehash(HashJoinTable hashtable)
+{
+	int nbuckets_old = hashtable-nbuckets_inmem;
+	int nbuckets_new;
+	uint32 mask;
+	int i;
+	
+	if (nbuckets_old  (130))
+		return;
+
+	nbuckets_new = nbuckets_old * 2;
+	hashtable-buckets = (HashJoinTuple *)
+		repalloc(hashtable-buckets, nbuckets_new * sizeof(HashJoinTuple));
+	memset(((char *) hashtable-buckets) + nbuckets_old * sizeof(HashJoinTuple),
+		   0, nbuckets_old * sizeof(HashJoinTuple));
+	hashtable-nbuckets_inmem = nbuckets_new;
+
+	mask = nbuckets_old;
+
+	for (i = 0; i  nbuckets_old; i++)
+	{
+		int			newbucketno = i + nbuckets_old;
+		HashJoinTuple prevtuple;
+		HashJoinTuple tuple;
+
+		prevtuple = NULL;
+		tuple = hashtable-buckets[i];
+
+		while (tuple != NULL)
+		{
+			/* save link in case we delete */
+			HashJoinTuple nexttuple = tuple-next;
+
+			if ((tuple-hashvalue  mask) != 0)
+			{
+/* move to the new bucket */
+tuple-next = hashtable-buckets[newbucketno];
+hashtable-buckets[newbucketno] = tuple;
+
+if (prevtuple)
+	prevtuple-next = nexttuple;
+else
+	hashtable-buckets[i] = nexttuple;
+			}
+			else
+			{
+/* keep in this bucket */
+prevtuple = tuple;
+			}
+			tuple = nexttuple;
+		}
+	}
+}
 
 /* 
  *		ExecHashTableCreate
@@ -271,6 +325,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	 */
 	hashtable = (HashJoinTable) palloc(sizeof(HashJoinTableData));
 	hashtable-nbuckets = nbuckets;
+	hashtable-nbuckets_inmem = nbuckets;
 	hashtable-log2_nbuckets = log2_nbuckets;
 	hashtable-buckets = NULL;
 	hashtable-keepNulls = keepNulls;
@@ -285,6 +340,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable-nbatch_outstart = nbatch;
 	hashtable-growEnabled = true;
 	hashtable-totalTuples = 0;
+	hashtable-inmemTuples = 0;
 	hashtable-innerBatchFile = NULL;
 	hashtable-outerBatchFile = NULL;
 	hashtable-spaceUsed = 0;
@@ -612,7 +668,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i  hashtable-nbuckets; i++)
+	for (i = 0; i  hashtable-nbuckets_inmem; i++)
 	{
 		HashJoinTuple prevtuple;
 		HashJoinTuple tuple;
@@ -734,6 +790,10 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashTuple-next = hashtable-buckets[bucketno];
 		hashtable-buckets[bucketno] = hashTuple;
 
+		hashtable-inmemTuples += 1;
+		if (hashtable-inmemTuples  hashtable-nbuckets_inmem * NTUP_PER_BUCKET * 2)
+			Rehash(hashtable);
+
 		/* Account for space used, and back off if we've used too much */
 		hashtable-spaceUsed += hashTupleSize;
 		if (hashtable-spaceUsed  hashtable-spacePeak)
@@ -873,18 +933,18 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 		  int *bucketno,
 		  int *batchno)
 {
-	uint32		nbuckets = (uint32) hashtable-nbuckets;
+	uint32		nbuckets_inmem = (uint32) hashtable-nbuckets_inmem;
 	uint32		nbatch = (uint32) hashtable-nbatch;
 
 	if (nbatch  1)
 	{
 		/* we can do MOD by masking, DIV by shifting */
-		*bucketno = hashvalue  (nbuckets - 1);
+		*bucketno = hashvalue  (nbuckets_inmem - 1);
 		*batchno = (hashvalue  hashtable-log2_nbuckets)  (nbatch - 1);
 	}
 	else
 	{
-		*bucketno = hashvalue  (nbuckets - 1);
+		*bucketno = hashvalue  (nbuckets_inmem - 1);
 		*batchno = 0;
 	}
 }
@@ -995,7 +1055,7 @@ ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
 		 */
 		if (hashTuple != NULL)
 			hashTuple = hashTuple-next;
-		else if (hjstate-hj_CurBucketNo  hashtable-nbuckets)
+		else if (hjstate-hj_CurBucketNo  hashtable-nbuckets_inmem)
 		{
 			hashTuple = hashtable-buckets[hjstate-hj_CurBucketNo];
 			hjstate-hj_CurBucketNo++;
@@ -1052,7 +1112,7 @@ void
 ExecHashTableReset(HashJoinTable hashtable)
 {
 	MemoryContext oldcxt;
-	int			nbuckets = hashtable-nbuckets;
+	int			nbuckets_inmem = hashtable-nbuckets_inmem;
 
 	/*
 	 * Release all the hash buckets and tuples acquired in the prior pass, and
@@

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Simon Riggs

On 22 June 2013 21:40, Stephen Frost sfr...@snowman.net wrote:

 I'm actually not a huge fan of this as it's certainly not cheap to do. If it
 can be shown to be better than an improved heuristic then perhaps it would
 work but I'm not convinced.

We need two heuristics, it would seem:

* an initial heuristic to overestimate the number of buckets when we
have sufficient memory to do so

* a heuristic to determine whether it is cheaper to rebuild a dense
hash table into a better one.

Although I like Heikki's rebuild approach we can't do this every x2
overstretch. Given large underestimates exist we'll end up rehashing
5-12 times, which seems bad. Better to let the hash table build and
then re-hash once, it we can see it will be useful.

OK?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

On Saturday, June 22, 2013, Heikki Linnakangas wrote:

 On 22.06.2013 19:19, Simon Riggs wrote:

 So I think that (2) is the best route: Given that we know with much
 better certainty the number of rows in the scanned-relation, we should
 be able to examine our hash table after it has been built and decide
 whether it would be cheaper to rebuild the hash table with the right
 number of buckets, or continue processing with what we have now. Which
 is roughly what Heikki proposed already, in January.


 Back in January, I wrote a quick patch to experiment with rehashing when
 the hash table becomes too full. It was too late to make it into 9.3 so I
 didn't pursue it further back then, but IIRC it worked. If we have the
 capability to rehash, the accuracy of the initial guess becomes much less
 important.


What we're hashing isn't going to change mid-way through or be updated
after we've started doing lookups against it.

Why not simply scan and queue the data and then build the hash table right
the first time?  Also, this patch doesn't appear to address dups and
therefore would rehash unnecessarily. There's no point rehashing into more
buckets if the buckets are only deep due to lots of dups. Figuring out how
many distinct values there are, in order to build the best hash table, is
actually pretty expensive compared to how quickly we can build the table
today. Lastly, this still encourages collisions due to too few buckets. If
we would simply start with more buckets outright we'd reduce the need to
rehash..

Thanks,

Stephen

Re: [HACKERS] Support for RANGE ... PRECEDING windows in OVER

2013-06-22 Thread ian link

Thanks Craig! That definitely does help. I probably still have some
questions but I think I will read through the rest of the code before
asking. Thanks again!

Ian

 Craig Ringer
 Friday, June 21, 2013 8:41 PM

 On 06/22/2013 03:30 AM, ian link wrote:

 Forgive my ignorance, but I don't entirely understand the problem. What
 does '+' and '-' refer to exactly?

 Consider RANGE 4.5 PRECEDING'.

 You need to be able to test whether, for the current row 'b', any given
 row 'a' is within the range (b - 4.5)  a = b . Not 100% sure about the
  vs = boundaries, but that's irrelevant for the example.

 To test that, you have to be able to do two things: you have to be able
 to test whether one value is greater than another, and you have to be
 able to add or subtract a constant from one of the values.

 Right now, the b-tree access method provides information on the ordering
 operators  = =  =  , which provides half the answer. But these
 don't give any concept of *distance* - you can test ordinality but not
 cardinality.

 To implement the different by 4.5 part, you have to be able to add 4.5
 to one value or subtract it from the other.

 The obvious way to do that is to look up the function that implements
 the '+' or '-' operator, and do:

 ((OPERATOR(+))(a, 4.5))  b AND (a = b)

 or

 ((OPERATOR(-))(b, 4.5))  a AND (a = b);

 The problem outlined by Tom in prior discussion about this is that
 PostgreSQL tries really hard not to assume that particular operator
 names mean particular things. Rather than knowing that + is always
 an operator that adds two values together; is transitive, symmetric and
 reflexive, PostgreSQL requires that you define an *operator class* that
 names the operator that has those properties.

 Or at least, it does for less-than, less-than-or-equals, equals,
 greater-than-or-equals, greater-than, and not-equals as part of the
 b-tree operator class, which *usually* defines these operators as  = =

 =  , but you could use any operator names you wanted if you really

 liked.

 Right now (as far as I know) there's no operator class that lets you
 identify operators for addition and subtraction in a similar way. So
 it's necessary to either add such an operator class (in which case
 support has to be added for it for every type), extend the existing
 b-tree operator class to provide the info, or blindly assume that +
 and - are always addition and subtraction.

 For an example of why such assumptions are a bad idea, consider matrix
 multiplication. Normally, a * b = b * a, but this isn't true for
 multiplication of matrices. Similarly, if someone defined a + operator
 as an alias for string concatenation (||), we'd be totally wrong to
 assume we could use that for doing range-offset windowing.

 So. Yeah. Operator classes required, unless we're going to change the
 rules and make certain operator names special in PostgreSQL, so that
 if you implement them they *must* have certain properties. This seems
 like a pretty poor reason to add such a big change.

 I hope this explanation (a) is actually correct and (b) is helpful.

 ian link
 Friday, June 21, 2013 12:30 PM
 Forgive my ignorance, but I don't entirely understand the problem. What
does '+' and '-' refer to exactly?
 Thanks!



 Hitoshi Harada
 Friday, June 21, 2013 4:35 AM



 On 06/22/2013 03:30 AM, ian link wrote:
 Forgive my ignorance, but I don't entirely understand the problem. What
 does '+' and '-' refer to exactly?

Consider RANGE 4.5 PRECEDING'.

You need to be able to test whether, for the current row 'b', any given
row 'a' is within the range (b - 4.5)  a = b . Not 100% sure about the
 vs = boundaries, but that's irrelevant for the example.

To test that, you have to be able to do two things: you have to be able
to test whether one value is greater than another, and you have to be
able to add or subtract a constant from one of the values.

Right now, the b-tree access method provides information on the ordering
operators  = =  =  , which provides half the answer. But these
don't give any concept of *distance* - you can test ordinality but not
cardinality.

To implement the different by 4.5 part, you have to be able to add 4.5
to one value or subtract it from the other.

The obvious way to do that is to look up the function that implements
the '+' or '-' operator, and do:

((OPERATOR(+))(a, 4.5))  b AND (a = b)

or

((OPERATOR(-))(b, 4.5))  a AND (a = b);

The problem outlined by Tom in prior discussion about this is that
PostgreSQL tries really hard not to assume that particular operator
names mean particular things. Rather than knowing that + is always
an operator that adds two values together; is transitive, symmetric and
reflexive, PostgreSQL requires that you define an *operator class* that
names the operator that has those properties.

Or at least, it does for less-than, less-than-or-equals, equals,
greater-than-or-equals, greater-than, and not-equals as part of the
b-tree operator class, which *usually* defines these

Re: [HACKERS] Patch for fast gin cache performance improvement

2013-06-22 Thread ian link

Looks like my community login is still not working. No rush, just wanted to
let you know. Thanks!

Ian


On Tue, Jun 18, 2013 at 11:41 AM, Ian Link i...@ilink.io wrote:


 No worries! I'll just wait until it's up again.
 Thanks
 Ian

   Kevin Grittner kgri...@ymail.com
  Tuesday, June 18, 2013 9:15 AM

 Oops -- we seem to have a problem with new community logins at the
 moment, which will hopefully be straightened out soon.  You might
 want to wait a few days if you don't already have a login.

 --
 Kevin Grittner
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
   Kevin Grittner kgri...@ymail.com
  Tuesday, June 18, 2013 9:09 AM

 Ian Link i...@ilink.io i...@ilink.io wrote:


 This patch contains a performance improvement for the fast gin
 cache.

 Our test queries improve from about 0.9 ms to 0.030 ms.

 Impressive!


 Thanks for reading and considering this patch!


 Congratulations on your first PostgreSQL patch!  To get it
 scheduled for review, please add it to this page:
 https://commitfest.postgresql.org/action/commitfest_view/open


 You will need to get a community login (if you don't already have
 one), but that is a quick and painless process.  Choose an
 appropriate topic (like Performance) and reference the message ID
 of the email to which you attached the patch.  Don't worry about
 the fields for reviewers, committer, or date closed.

 Sorry for the administrative overhead, but without it things can
 fall through the cracks.  You can find an overview of the review
 process with links to more detail here:

 http://wiki.postgresql.org/wiki/CommitFest

 Thanks for contributing!


   Ian Link i...@ilink.io
  Monday, June 17, 2013 9:42 PM
  *

 This patch contains a performance improvement for the fast gin cache. As
 you may know, the performance of the fast gin cache decreases with its
 size. Currently, the size of the fast gin cache is tied to work_mem. The
 size of work_mem can often be quite high. The large size of work_mem is
 inappropriate for the fast gin cache size. Therefore, we created a separate
 cache size called gin_fast_limit. This global variable controls the size of
 the fast gin cache, independently of work_mem. Currently, the default
 gin_fast_limit is set to 128kB. However, that value could need tweaking.
 64kB may work better, but it's hard to say with only my single machine to
 test on.

 On my machine, this patch results in a nice speed up. Our test queries
 improve from about 0.9 ms to 0.030 ms. Please feel free to use the test
 case yourself: it should be attached. I can look into additional test cases
 (tsvectors) if anyone is interested.

 In addition to the global limit, we have provided a per-index limit:
 fast_cache_size. This per-index limit begins at -1, which means that it is
 disabled. If the user does not specify a per-index limit, the index will
 simply use the global limit.

 I would like to thank Andrew Gierth for all his help on this patch. As
 this is my first patch he was extremely helpful. The idea for this
 performance improvement was entirely his. I just did the implementation.
 Thanks for reading and considering this patch!*


 Ian Link


postbox-contact.jpgcompose-unknown-contact.jpg

Re: [HACKERS] A better way than tweaking NTUP_PER_BUCKET

2013-06-22 Thread Stephen Frost

On Saturday, June 22, 2013, Simon Riggs wrote:

 On 22 June 2013 21:40, Stephen Frost sfr...@snowman.net javascript:;
 wrote:

  I'm actually not a huge fan of this as it's certainly not cheap to do.
 If it
  can be shown to be better than an improved heuristic then perhaps it
 would
  work but I'm not convinced.

 We need two heuristics, it would seem:

 * an initial heuristic to overestimate the number of buckets when we
 have sufficient memory to do so

 * a heuristic to determine whether it is cheaper to rebuild a dense
 hash table into a better one.

 Although I like Heikki's rebuild approach we can't do this every x2
 overstretch. Given large underestimates exist we'll end up rehashing
 5-12 times, which seems bad. Better to let the hash table build and
 then re-hash once, it we can see it will be useful.

 OK?


I've been thinking a bit more on your notion of simply using as much memory
as we're permitted, but maybe adjust it down based on how big the input to
the hash table is (which I think we have better stats on, and even if we
don't, we could easily keep track of how many tuples we've seen and
consider rehashing as we go).  Still doesn't really address the issue of
dups though. It may still be much larger than it should be if there's a lot
of duplicates in the input that hash into a much smaller set of buckets.

Will think on it more.

Thanks,

Stephen

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread MauMau


From: Alvaro Herrera alvhe...@2ndquadrant.com

MauMau escribió:

I thought of adding some new state of pmState for some reason (that
might be the same as your idea).
But I refrained from doing that, because pmState has already many
states.  I was afraid adding a new pmState value for this bug fix
would complicate the state management (e.g. state transition in
PostmasterStateMachine()).  In addition, I felt PM_WAIT_BACKENDS was
appropriate because postmaster is actually waiting for backends to
terminate after sending SIGQUIT.  The state name is natural.


Well, a natural state name is of no use if we're using it to indicate
two different states, which I think is what would be happening here.
Basically, with your patch, PM_WAIT_BACKENDS means two different things
depending on AbortStartTime.


i PROBABLY GOT YOUR FEELING.  yOU AREN'T FEELING COMFORTABLE ABOUT USING THE 
TIME VARIABLE aBORTsTARTtIME AS A STATE VARIABLE TO CHANGE POSTMASTER'S 
BEHAVIOR, ARE YOU?


tHAT MAY BE RIGHT, BUT i'M NOT SURE WELL... BECAUSE IN pm_wait_backends, AS 
THE NAME INDICATES, POSTMASTER IS INDEED WAITING FOR THE BACKENDS TO 
TERMINATE REGARDLESS OF aBORTsTARTtIME.  aPART FROM THIS, POSTMASTER SEEMS 
TO CHANGE ITS BEHAVIOR IN THE SAME PMsTATE DEPENDING ON OTHER VARIABLES SUCH 
AS sHUTDOWN AND fATALeRROR.  i'M NOT CONFIDENT IN WHICH IS BETTER, SO i 
WON'T OBJECT IF THE REVIEWER OR COMMITTER MODIFIES THE CODE.


rEGARDS
mAUmAU



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PATCH] Revive line type

2013-06-22 Thread rui hua

Hi,

Test results are as follows:

Contents  Purpose

This patch is for finishing the line type and related functions that not
done yet but listed in catalogs and documentation. There are no other new
features added in this patch.

The regression test cases which included in this patch, Documentation are
also updated.


 Run

Regression tests are all succeed, but several problems have be found while
ding some simple test. The updated document said that the points used in
the output are not necessarily the points used on input.  I understand that
as long as they are marked on the same line. But when the input data
represents a  horizontal or vertical line, the output is not exactly the
same line. It is another line parallel to it.

For example:


postgres=# select line('1,3,2,3');

  line

-

 [(0,-3),(1,-3)]

(1 row)



postgres=# select line('1,3,1,6');

  line

-

 [(-1,0),(-1,1)]

(1 row)


In addition, when a straight line coincides with coordinate axis, output
appears -0, I do not know whether it is appropriate.


postgres=# select line('0,1,0,5');

  line

-

 [(-0,0),(-0,1)]

(1 row)


Negative value appeared when use - to calculate the distance between two
parallel lines.


postgres=# select line('1,1,2,1') - line('-1,-1,-2,-1') ;

 ?column?

--

   -2


postgres=# select lseg('1,1,2,1') - line('-1,-1,-2,-1') ;

 ?column?

--

   -2

(1 row)


The same situation occurs in distance calculation between point and a
straight line.

postgres=# select point('-1,1') - line('-3,0,-4,0') ;

 ?column?

--

-1

(1 row)



Should the distance be positive numbers?

Other functions seem work properly.


Performance

==

Because these functions is first implemented. So there is no relatively
comparison for the performance.


 Conclusion

This patch lets line type worked. But there are some bugs. Additionally,
function close_sl not implemented.


With Regards,

Rui

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread MauMau


From: Robert Haas robertmh...@gmail.com

On Fri, Jun 21, 2013 at 10:02 PM, MauMau maumau...@gmail.com wrote:
I'm comfortable with 5 seconds.  We are talking about the interval 
between
sending SIGQUIT to the children and then sending SIGKILL to them.  In 
most
situations, the backends should terminate immediately.  However, as I 
said a
few months ago, ereport() call in quickdie() can deadlock indefinitely. 
This

is a PostgreSQL's bug.


So let's fix that bug.  Then we don't need this.


tHERE ARE TWO WAYS TO FIX THAT BUG.  yOU ARE SUGGESTING 1 OF THE FOLLOWING 
TWO METHODS, AREN'T YOU?


1. (rOBERT-SAN'S IDEA)
uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER SENDS sigkill TO ITS 
CHILDREN.


2. (tOM-SAN'S IDEA)
uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER FIRST SENDS sigquit 
TO ITS CHILDREN, WAIT A WHILE FOR THEM TO TERMINATE, THEN SENDS sigkill TO 
THEM.



aT FIRST i PROPOSED 1.  tHEN tOM SAN SUGGESTED 2 SO THAT THE SERVER IS AS 
FRIENDLY TO THE CLIENTS AS NOW BY NOTIFYING THEM OF THE SERVER SHUTDOWN.  i 
WAS CONVINCED BY THAT IDEA AND CHOSE 2.


bASICALLY, i THINK BOTH IDEAS ARE RIGHT.  tHEY CAN SOLVE THE ORIGINAL 
PROBLEM.


hOWEVER, RE-CONSIDERING THE MEANING OF IMMEDIATE SHUTDOWN, i FEEL 1 IS A 
BIT BETTER, BECAUSE TRYING TO DO SOMETHING IN QUICKDIE() OR SOMEWHERE DOES 
NOT MATCH THE IDEA OF IMMEDIATE. wE MAY NOT HAVE TO BE FRIENDLY TO THE 
CLIENTS AT THE IMMEDIATE SHUTDOWN.  tHE CODE GETS MUCH SIMPLER.  iN 
ADDITION, IT MAY BE BETTER THAT WE SIMILARLY SEND sigkill IN BACKEND CRASH 
(fATALeRROR) CASE, ELIMINATE THE USE OF sigquit AND REMOVE QUICKDIE() AND 
OTHER sigquit HANDLERS.


wHAT DO YOU THINK?  hOW SHOULD WE MAKE CONSENSUS AND PROCEED?

rEGARDS
mAUmAU




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

47 matches

Mail list logo