Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-27 Thread Andres Freund
On 2013-06-26 20:07:40 -0400, Tom Lane wrote:
  I don't see how we could trigger the conditions for EINPROGRESS on
  windows that msdn lists, but since we need it on unixoid systems and its
  valid to treat the connect as partiall successfull on windows, there
  seems little benefit in dropping it.
 
 I was about to argue for removing the EINPROGRESS check on Windows,
 on the grounds that this would be the same type of bug as you're
 complaining of on Linux, ie, if the call does return this error we'll
 mistakenly think the connection is in progress and go on to deliver an
 unhelpful failure message later.

 However, some more trolling of the intertubes suggests that Cygwin's
 emulation of socket() does indeed return EINPROGRESS; see for instance
 this ancient thread of ours:
 http://www.postgresql.org/message-id/flat/14855.49635.565990.716...@kryten.bedford.waii.com#14855.49635.565990.716...@kryten.bedford.waii.com
 Unless we want to distinguish Cygwin from native Windows in this code
 chunk, maybe we'd better leave well enough alone.

After some looking it gets funnier. There currently doesn't seem to be
any chance that we actually can get an EINPROGRESS at that level on
windows. We #define connect() to pgwin32_connect() which fudges errno
around. Where WSAEINPROGRESS is mapped to EINVAL.

Also for some reason the connect is performed synchronously to the
outside by doing pgwin32_waitforsinglesocket() if EWOULDBLOCK is
returned by connect()...
That seems to be the case since
a4c40f140d23cefbf94e00283f7688c772867f1b.

 I still want to delete the test for SOCK_ERRNO == 0.  I traced that back
 to commit da9501bddb4dc33c031b1db6ce2133bcee7b, but I can't find
 anything in the mailing list archives to explain that.  I have an
 inquiry in to Jan to see if he can remember the reason ...

That looks strange, yes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement

2013-06-27 Thread Rushabh Lathia
Latest patch looks good to me.

Regards,
Rushabh


On Wed, Jun 26, 2013 at 11:21 PM, Pavel Stehule pavel.steh...@gmail.comwrote:

 Hello

 updated patch with some basic doc

 Regards

 Pavel


 2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com:
 
 
 
  On Wed, Jun 26, 2013 at 5:11 PM, Pavel Stehule pavel.steh...@gmail.com
  wrote:
 
  2013/6/26 Rushabh Lathia rushabh.lat...@gmail.com:
  
  
  
   On Tue, Jun 25, 2013 at 11:09 AM, Pavel Stehule
   pavel.steh...@gmail.com
   wrote:
  
   2013/6/25 Rushabh Lathia rushabh.lat...@gmail.com:
   
   
   
On Tue, Jun 25, 2013 at 1:47 AM, Pavel Stehule
pavel.steh...@gmail.com
wrote:
   
Hello
   
This is fragment ofmy old code from orafce package - it is
functional,
but it is written little bit more generic due impossible access to
static variables (in elog.c)
   
static char*
dbms_utility_format_call_stack(char mode)
{
   MemoryContext oldcontext = CurrentMemoryContext;
   ErrorData *edata;
   ErrorContextCallback *econtext;
   StringInfo sinfo;
   
   #if PG_VERSION_NUM = 80400
  errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO,
TEXTDOMAIN);
   #else
  errstart(ERROR, __FILE__, __LINE__, PG_FUNCNAME_MACRO);
#endif
   
   MemoryContextSwitchTo(oldcontext);
   
   for (econtext = error_context_stack;
 econtext != NULL;
 econtext = econtext-previous)
(*econtext-callback) (econtext-arg);
   
   edata = CopyErrorData();
   FlushErrorState();
   
https://github.com/orafce/orafce/blob/master/utility.c
   
2013/6/24 Rushabh Lathia rushabh.lat...@gmail.com:
 Hi,

 Use of this feature is to get  call stack for debugging without
 raising
 exception. This definitely seems very useful.

 Here are my comments about the submitted patch:

 Patch gets applied cleanly (there are couple of white space
 warning
 but
 that's
 into test case file). make and make install too went smooth.
 make
 check
 was smooth too. Did some manual testing about feature and its
 went
 smooth.

 However would like to share couple of points:

   
My main motive is concentration to stack info string only. So I
didn't
use err_start function, and I didn't use CopyErrorData too. These
routines does lot of other work.
   
   
 1) I noticed that you did the initialization of edata manually,
 just
 wondering
 can't we directly use CopyErrorData() which will do that job ?

   
yes, we can, but in this context on context field is interesting
for
us.
   
 I found that inside CopyErrorData() there is Assert:

 Assert(CurrentMemoryContext != ErrorContext);

 And in the patch we are setting using ErrorContext as short
 living
 context,
 is
 it the reason of not using CopyErrorData() ?
   
it was not a primary reason, but sure - this routine is not
 designed
for direct using from elog module. Next, we can save one palloc
call.
   
   
Hmm ok.
   
   
   


 2) To reset stack to empty, FlushErrorState() can be used.

   
yes, it can be. I use explicit rows due different semantics,
although
it does same things. FlushErrorState some global ErrorState and it
is
used from other modules. I did a reset of ErrorContext. I fill a
small
difference there - so FlushErrorState is not best name for my
purpose.
What about modification:
   
static void
resetErrorStack()
{
errordata_stack_depth = -1;
recursion_depth = 0;
/* Delete all data in ErrorContext */
MemoryContextResetAndDeleteChildren(ErrorContext);
}
   
and then call in  InvokeErrorCallbacks -- resetErrorStack()
   
and rewrote flushErrorState like
   
void FlushErrorState(void)
{
  /* reset ErrorStack is enough */
  resetErrorStack();
}
   
???
   
   
Nope. rather then that I would still prefer direct call of
FlushErrorState().
   
   
   
but I can live well with direct call of FlushErrorState() - it is
only
minor change.
   
   
 3) I was think about the usability of the feature and wondering
 how
 about if
 we add some header to the stack, so user can differentiate
 between
 STACK
 and
 normal NOTICE message ?

 Something like:

 postgres=# select outer_outer_func(10);
 NOTICE:  - Call Stack -
 PL/pgSQL function inner_func(integer) line 4 at GET DIAGNOSTICS
 PL/pgSQL function outer_func(integer) line 3 at RETURN
 PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
 CONTEXT:  PL/pgSQL function outer_func(integer) line 3 at RETURN
 PL/pgSQL function outer_outer_func(integer) line 3 at RETURN
  outer_outer_func
 --
20
 (1 row)
   
I 

Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Andres Freund
On 2013-06-26 21:18:49 -0700, Peter Geoghegan wrote:
 On Wed, Jun 26, 2013 at 8:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  TBH, I've always been annoyed that pg_filedump is GPL and so there's no
  way for us to just ship it in contrib.  (That stems from Red Hat
  corporate policy of a dozen years ago, but the conflict is real anyway.)
  If somebody is sufficiently excited about this topic to do something
  that's largely new anyway, I'd be in favor of starting from scratch so
  it could be put under the usual Postgres license.
 
 Heroku are interested in online verification of basebackups (i.e.
 using checksums to verify the integrity of heap files as they are
 backed up, with a view to relying less and less on logical backups). I
 am very glad that you made the page checksums stuff available to
 external utilities in commit f04216341dd1cc235e975f93ac806d9d3729a344.
 
 In the last couple of days, I haven't been able to figure out a way to
 solve the problem of torn pages in a way that isn't a complete kludge
 (with a hopefully-acceptable risk of false positives), so I've been
 operating under the assumption that anything I produce here won't be
 up to the standards of contrib.

Why not do this from a function/background worker in the backend where
you can go via the buffer manager to avoid torn pages et al. If you use
a buffer strategy the cache poisoning et al should be controlleable.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: Patch to compute Max LSN of Data Pages

2013-06-27 Thread Amit Kapila
On Thursday, June 27, 2013 11:26 AM Andres Freund wrote:
 On 2013-06-27 11:16:25 +0530, Amit Kapila wrote:
  On Wednesday, June 26, 2013 10:19 PM Fujii Masao wrote:
   On Wed, Jun 26, 2013 at 8:57 PM, Amit Kapila
 amit.kap...@huawei.com
   One more use case for which this utility was done is as
 below:
   It will be used to decide that on new-standby (old-master)
   whether
a full
 backup is needed from
   New-master(old-standby).
   The backup is required when the data page in old-master
 precedes
   the last applied LSN in old-standby (i.e., new-master) at
 the
moment
   of the failover.
   
That's exactly what I was afraid of. Unless I miss something the
   tool
is
*NOT* sufficient to do this.
   
You mean to say if user knows the max LSN of data pages in old-
 master
   and
last applied LSN in new master, he cannot decide whether
Full backup is needed? It should be straightforward decision that
   skip a
backup if that old-master LSN is less than the new-master (i.e.,
 last
applied LSN, IOW, timeline switch LSN).
It was proposed as a usecase in this below mail:
http://www.postgresql.org/message-id/CAHGQGwHyd1fY0hF0qKh0-uKDh-
   gcbYxMOFBYVk
kh4jzji-f...@mail.gmail.com
  
   I guess he meant the commit hint bit problem.
 
  True, after reading the thread mentioned by Andres, I got the reason
 he was
  pointing why it is not sufficient.
  So can it be useful if database has checksums enabled?
 
 I think for that usecase its far more useful to work on getting
 pg_rewind since that has a chance of working when local WAL has been
 applied that hadn't yet shipped to the other side (which is frequently
 the case).

Aren't the use case for both is bit different
Pg_computmaxlsn - by computing max lsn for checksums enabled database, user
can made old-master follow new-master if maxlsn suggests that fullbackup is 
  not required.

Pg_rewind   - a tool to resynchronize old-master and new-master by
copying changed blocks from new master.
  I think more work might be needed in case DDL's happened
on old-master after forking of new-master.

Although for this case, both have resemblance in terms of avoiding full
backup, but I think maxlsn tool can be independently also used.
Do you think pg_rewind can be used by default for any checksum enabled
database to resynchronize old-master?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Peter Geoghegan
On Wed, Jun 26, 2013 at 11:27 PM, Andres Freund and...@2ndquadrant.com wrote:
 Why not do this from a function/background worker in the backend where
 you can go via the buffer manager to avoid torn pages et al. If you use
 a buffer strategy the cache poisoning et al should be controlleable.

I had considered that, but thought it might be a little bit
aggressive, even with a strategy of BAS_BULKREAD. Maybe the kludge I
have in mind might not end up being that bad in practice, and would
certainly perform better than an approach that used the buffer
manager. But then, going through shared_buffers could be worth the
overhead, if only for the peace of mind of not relying on something
that is subtly broken.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Andres Freund
On 2013-06-26 23:42:55 -0700, Peter Geoghegan wrote:
 On Wed, Jun 26, 2013 at 11:27 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
  Why not do this from a function/background worker in the backend where
  you can go via the buffer manager to avoid torn pages et al. If you use
  a buffer strategy the cache poisoning et al should be controlleable.
 
 I had considered that, but thought it might be a little bit
 aggressive, even with a strategy of BAS_BULKREAD.

Well, you can influence the pacing yourself, you don't need to rely on
the strategy for that. I'd only use it because of the ringbuffer logic
it has to avoid trashing the cache.

 Maybe the kludge I
 have in mind might not end up being that bad in practice, and would
 certainly perform better than an approach that used the buffer
 manager.

What do you have in mind then?

 But then, going through shared_buffers could be worth the
 overhead, if only for the peace of mind of not relying on something
 that is subtly broken.

Spurious alarms quickly lead to people ignoring them, consciously or
not, so trying to take care not to go there sounds like a good idea.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)

2013-06-27 Thread KONDO Mitsumasa

Dear Febien

(2013/06/27 14:39), Fabien COELHO wrote:

If I show a latency at full load, that would be nclients/tps, not 1/tps.
However, I'm hoping to pass the throttling patch to pgbench, in which case the
latency to show is a little bit different because the nclients/tps would
include sleep time and does not correspond to the latency for the end user. 
Also,
under throttling it would also be useful to show the time lag behind scheduled
transactions.
All right. Of Corse, I consider your wishing functions is realized with best 
implementation.



So I would like to know whether the throttling patch is committed and then 
update
the progress patch to take that into account.

OK! I watch it and use it.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Peter Geoghegan
On Tue, Jun 18, 2013 at 9:42 AM, Jeff Davis pg...@j-davis.com wrote:
 I'm not sure what the resolution of Alvaro's concern was, so I left the
 flag reporting the same as the previous patch.

Alvaro's concern was that the new flags added (those added by the
foreign key locks patch) do something cute with re-using multiple
other bits in an otherwise nonsensical combination to represent a
distinct state. So as written, the infoMask if statements will result
in spurious reporting of information stored in t_infomask. If you AND
some integer with HEAP_XMAX_SHR_LOCK and get something non-zero,
you'll surely also get a non-zero result with HEAP_LOCK_MASK, because
the latter flag has all the same bits set as the former (plus others,
obviously).


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Yuri Levinsky
  Guys,
Single core CPU's are dying for Home users, my cellular has 4 cores.
Today's standard is minimum 4 cores per CPU and tomorrow who knows?
Parallelization sometimes is only one solution for heavy nightly jobs.
From the other hand parallelization is very tricky and unpredictable
when it comes into user's hands.  Anyway when you have this option (same
for hash partitioning) you in much better position than you don't have
it. The question is: when we may hope to have it?

Sincerely yours,


Yuri Levinsky, DBA
Celltick Technologies Ltd., 32 Maskit St., Herzliya 46733, Israel
Mobile: +972 54 6107703, Office: +972 9 9710239; Fax: +972 9 9710222


-Original Message-
From: Markus Wanner [mailto:mar...@bluegap.ch] 
Sent: Wednesday, June 26, 2013 6:56 PM
To: Heikki Linnakangas
Cc: Kevin Grittner; Claudio Freire; Robert Haas; Bruce Momjian; Yuri
Levinsky; PostgreSQL-Dev
Subject: Re: [HACKERS] Hash partitioning.

On 06/26/2013 05:46 PM, Heikki Linnakangas wrote:
 We could also allow a large query to search a single table in
parallel.
 A seqscan would be easy to divide into N equally-sized parts that can 
 be scanned in parallel. It's more difficult for index scans, but even 
 then it might be possible at least in some limited cases.

So far reading sequentially is still faster than hopping between
different locations. Purely from the I/O perspective, that is.

For queries where the single CPU core turns into a bottle-neck and which
we want to parallelize, we should ideally still do a normal, fully
sequential scan and only fan out after the scan and distribute the
incoming pages (or even tuples) to the multiple cores to process.

Regards

Markus Wanner

This mail was received via Mail-SeCure System.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Peter Geoghegan
On Thu, Jun 27, 2013 at 12:07 AM, Peter Geoghegan p...@heroku.com wrote:
 I'm not sure what the resolution of Alvaro's concern was, so I left the
 flag reporting the same as the previous patch.

 Alvaro's concern was that the new flags added (those added by the
 foreign key locks patch) do something cute with re-using multiple
 other bits in an otherwise nonsensical combination to represent a
 distinct state.

I just realized that it wasn't that you didn't understand the nature
of the problem, but that you weren't sure of a resolution. Sorry for
the noise.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-27 Thread Hitoshi Harada
On Tue, Jun 25, 2013 at 9:07 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Jun 21, 2013 at 5:20 AM, Hitoshi Harada umi.tan...@gmail.com
 wrote:
  If I don't miss something, the requirement for the CONCURRENTLY option
 is to
  allow simple SELECT reader to read the matview concurrently while the
 view
  is populating the new data, and INSERT/UPDATE/DELETE and SELECT FOR
  UPDATE/SHARE are still blocked.  So, I wonder why it is not possible
 just to
  acquire ExclusiveLock on the matview while populating the data and swap
 the
  relfile by taking small AccessExclusiveLock.  This lock escalation is no
  dead lock hazard, I suppose, because concurrent operation would block the
  other at the point ExclusiveLock is acquired, and ExclusiveLock conflicts
  AccessExclusiveLock.  Then you don't need the complicated SPI logic or
  unique key index dependency.

 This is no good.  One, all lock upgrades are deadlock hazards.  In
 this case, that plays out as follows: suppose that the session running
 REFRESH MATERIALIZED VIEW CONCURRENTLY also holds a lock on something
 else.  Some other process takes an AccessShareLock on the materialized
 view and then tries to take a conflicting lock on the other object.
 Kaboom, deadlock.  Granted, the chances of that happening in practice
 are small, but it IS the reason why we typically try to having
 long-running operations perform lock upgrades.  Users get really
 annoyed when their DDL runs for an hour and then rolls back.


OK, that' not safe.  What I was thinking was something similar to
compare-and-swap, where the whole operation is atomic under an
AccessExclusiveLock.  What if we release ExclusiveLock once a new matview
was created and re-acquire AccessExclusiveLock before trying swap?  Note
matview is a little different from index which I know people are talking
about in REINDEX CONCURRENTLY thread, in that the content of matview does
not change incrementally (at least at this point), but only does change
fully in swapping operation by the same REFRESH MATERIALIZED VIEW command.
The only race condition is between releasing Exclusive lock and re-acquire
AccessExclusiveLock someone else can go ahead with the same operation and
could create another one.  If it happens, let's abort us, because I guess
that's the way our transaction system is working anyway;  in case of unique
key index insertion for example, if I find another guy is inserting the
same value in the index, I wait for the other guy to finish his work and if
his transaction commits I give up, otherwise I go ahead.  Maybe it's
annoying if an hour operation finally gets aborted, but my purpose is
actually achieved by the other guy.  If the primary goal of this feature is
let reader reads the matview concurrently it should be ok?

Hmm, but in such cases the first guy is always win and the second guy who
may come an hour later loses so we cannot get the result from the latest
command... I still wonder there should be some way.


 Two, until we get MVCC catalog scans, it's not safe to update any
 system catalog tuple without an AccessExclusiveLock on some locktag
 that will prevent concurrent catalog scans for that tuple.  Under
 SnapshotNow semantics, concurrent readers can fail to see that the
 object is present at all, leading to mysterious failures - especially
 if some of the object's catalog scans are seen and others are missed.


 So what I'm saying above is take AccessExclusiveLock on swapping relfile
in catalog.  This doesn't violate your statement, I suppose.  I'm actually
still skeptical about MVCC catalog, because even if you can make catalog
lookup MVCC, relfile on the filesystem is not MVCC.  If session 1 changes
relfilenode in pg_class and commit transaction, delete the old relfile from
the filesystem, but another concurrent session 2 that just took a snapshot
before 1 made such change keeps running and tries to open this relation,
grabbing the old relfile and open it from filesystem -- ERROR: relfile not
found.  So everyone actually needs to see up-to-date information that
synchronizes with what filesystem says and that's SnapshotNow.   In my
experimental thought above about compare-and-swap way, in compare phase he
needs to see the most recent valid information, otherwise he never thinks
someone did something new.  Since I haven't read the whole thread, maybe we
have already discussed about it, but it would help if you clarify this
concern.


Thanks,
-- 
Hitoshi Harada


Re: [HACKERS] refresh materialized view concurrently

2013-06-27 Thread Hitoshi Harada
On Wed, Jun 26, 2013 at 1:38 AM, Kevin Grittner kgri...@ymail.com wrote:

 Hitoshi Harada umi.tan...@gmail.com wrote:

  I spent a few hours to review the patch.

 Thanks!

  As far as I can tell, the overall approach is as follows.
 
  - create a new temp heap as non-concurrent does, but with
  ExclusiveLock on the matview, so that reader wouldn't be blocked

 Non-concurrent creates the heap in the matview's tablespace and
 namespace, so the temp part is different in concurrent
 generation.  This difference is why concurrent can be faster when
 few rows change.


It's still not clear to me why you need temp in concurrent and not in
non-concurrent.  If this type of operations is always creating temp table
and just swap it with existing one, why can't we just make it temp always?
And if the performance is the only concern, is temp better than just
turning off WAL for the table or UNLOGGED table?


 Also, before the next step there is an ANALYZE of the temp table,
 so the planner can make good choices in the next step.

  - with this temp table and the matview, query FULL JOIN and
  extract difference between original matview and temp heap (via SPI)

 Right; into another temp table.

  - this operation requires unique index for performance reason (or
  correctness reason too)

 It is primarily for correctness in the face of duplicate rows which
 have no nulls.  Do you think the reasons need to be better
 documented with comments?


Ah, yes, even after looking at patch I was confused if it was for
performance or correctness.  It's a shame we cannot refresh it concurrently
if we have duplicate rows in the matview.  I thought it would make sense to
allow it without unique key if it was only performance tradeoffs.



 I also modified the confusing error message to something close to
 the suggestion from Robert.

 What to do about the Assert that the matview is not a system
 relation seems like material for a separate patch.  After review,
 I'm inclined to remove the test altogether, so that extensions can
 create matviews in pg_catalog.


I like this better.


 New version attached.


 Will take another look.

Thanks,
-- 
Hitoshi Harada


[HACKERS] Documentation/help for materialized and recursive views

2013-06-27 Thread Magnus Hagander
Is there a particular reason why CREATE RECURSIVE VIEW is part of the
help for CREATE VIEW, but CREATE MATERIALIZED VIEW doesn't show up
there?

I realize the technical reason (they're different man pages, and that
also controls what's in \h in psql which is where I ran into it), but
was there any particular reason to split those up in the first place?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-27 Thread Andres Freund
On 2013-06-27 00:12:07 -0700, Hitoshi Harada wrote:
  Two, until we get MVCC catalog scans, it's not safe to update any
  system catalog tuple without an AccessExclusiveLock on some locktag
  that will prevent concurrent catalog scans for that tuple.  Under
  SnapshotNow semantics, concurrent readers can fail to see that the
  object is present at all, leading to mysterious failures - especially
  if some of the object's catalog scans are seen and others are missed.
 
 
  So what I'm saying above is take AccessExclusiveLock on swapping relfile
 in catalog.  This doesn't violate your statement, I suppose.  I'm actually
 still skeptical about MVCC catalog, because even if you can make catalog
 lookup MVCC, relfile on the filesystem is not MVCC.  If session 1 changes
 relfilenode in pg_class and commit transaction, delete the old relfile from
 the filesystem, but another concurrent session 2 that just took a snapshot
 before 1 made such change keeps running and tries to open this relation,
 grabbing the old relfile and open it from filesystem -- ERROR: relfile not
 found.

We can play cute tricks akin to what CREATE INDEX CONCURRENTLY currently
does, i.e. wait for all other relations that could have possibly seen
the old relfilenode (they must have at least a share lock on the
relation) before dropping the actual storage.

The reason we cannot currently do that in most scenarios is that we
cannot perform transactional/mvcc updates of non-exclusively locked
objects due to the SnapshotNow problems of seeing multiple or no
versions of a row during a single scan.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Group Commits Vs WAL Writes

2013-06-27 Thread Atri Sharma
Hi all,

I think this is a naive question.

When we do a commit, WAL buffers are written to the disk. This has a
disk latency for the required I/O.

Now, with group commits, do we see a spike in that disk write latency,
especially in the cases where the user has set wal_buffers to a high
value?


Regards,

Atri

--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Dean Rasheed
On 26 June 2013 22:48, Noah Misch n...@leadboat.com wrote:
 On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
 On 26 June 2013 19:32, Noah Misch n...@leadboat.com wrote:
  On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

  md5_agg() is well-defined and not cryptographically novel, and your use 
  case
  is credible.  However, not every useful-sometimes function belongs in 
  core; we
  mostly stick to functions with ubiquitous value and functions that would be
  onerous to implement externally.  (We do carry legacy stuff that wouldn't 
  make
  the cut today.)  In my estimation, md5_agg() does not make that cut.  The
  variety of intuitive md5_agg() definitions attested upthread doesn't bode 
  well
  for its broad applicability.  It could just as well live in an extension
  published on PGXN.  Mine is just one opinion, though; I won't be horrified 
  if
  the community wants an md5_agg() in core.

 I disagree with this though. I see md5_agg() as a natural extension of
 the already-in-core md5() functions and underlying code, satisfying a
 well-defined use-case, and providing a checksum comparable with
 externally generated md5 sums.

 All true, but I don't see those facts justifying core inclusion.

 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables. So I think that there is a case for having md5_agg()
 in core as an alternative to such hacks, while having more
 sophisticated crypto functions available as extensions.

 My nine Google hits for md5(string_agg included one Stack Overflow answer,
 several mirrors of that answer, and a few posts on this thread.


I found more than that using Google. It's not uncommon for people to
use Google and then pick the first accepted answer on Stack
Overflow, in which case they might well pick one of the answers here
[1] or here [2]. Or they might find this [3]. Or if they came to our
lists they might find this [4], or deduce from this [5] that it isn't
possible.

[1] 
http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql

[2] 
http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows

[3] https://ralphholz.de/?q=node/298

[4] http://www.postgresql.org/message-id/4e5f469c.5070...@compulab.co.il

[5] 
http://www.postgresql.org/message-id/e94d85500801301153u6b976e31m89e311c7134a0...@mail.gmail.com

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.


 I haven't looked at pgcrypto because, as I said, performance wasn't my
 primary criterion either, but removing the unnessary data copy just
 seemed like good sense.

 I'll take a look at it, and then as you and Peter suggest, look to
 split the patch into two. I think I will withdraw md5_total() because
 I was never entirely happy with that anyway.

 Understood; feel free to back off from any performance improvements in which
 you didn't wish to involve yourself.

 If we do end up moving forward with md5_agg(), I note that the pgcrypto
 version already has an initialize/accumulate/finalize API division.  A patch
 importing that code largely-unchanged would be easier to verify than a patch
 restructuring the src/backend/libpq/md5.c implementation.  The two patches
 would then be a use pgcrypto md5.c in core patch and an md5_agg() patch.


I'll take a look at it, although I think there is still the potential
for bugs either way.

What I've done with libpq's md5.c is just to rearrange the internal
pieces, without touching the core algorithm code or the higher level
callers. So the most likely types of bugs are boundary/size errors. If
I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(),
then I'd have broken them all. It's possible to get a reasonable level
of confidence in those changes using queries like this on HEAD and
with the patch:

SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM (
  SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,1) i,
 generate_series(0,i) j
   GROUP BY i
) t;

and these with the patch:

SELECT md5_agg(md5(str) || repeat(' ', i)) FROM (
  SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,1) i,
 generate_series(0,i) j
   GROUP BY i
) t;

SELECT md5_agg(md5_agg || repeat(' ', i)) FROM (
  SELECT i, md5_agg(chr(32+(i+j*31)%224))
FROM generate_series(0,1) i,
 generate_series(0,i) j
   GROUP BY i
) t;

which all produce the same overall sum.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Dean Rasheed
On 26 June 2013 21:46, Peter Eisentraut pete...@gmx.net wrote:
 On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

 The thread discussed several other options of checksumming tables that
 did not have the air of a crytographic offering, as Noah put it.


True but md5 has the advantage of being directly comparable with the
output of Unix md5sum, which would be useful if you loaded data from
external files and wanted to confirm that your import process didn't
mangle it.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] in-catalog Extension Scripts and Control parameters (templates?)

2013-06-27 Thread Hitoshi Harada
On Mon, Jun 24, 2013 at 4:20 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote:

 Jaime Casanova ja...@2ndquadrant.com writes:
  just tried to build this one, but it doesn't apply cleanly anymore...
  specially the ColId_or_Sconst contruct in gram.y

 Please find attached a new version of the patch, v7, rebased to current
 master tree and with some more cleanup. I've been using the new grammar
 entry NonReservedWord_or_Sconst, I'm not sure about that.



I played a bit with this patch.

- If I have control file that has the same name as template, create
extension picks up control file?  Is this by design?
- Though control file is kind of global information among different
databases, pg_extension_template is not.  Sounds a little weird to me.
- Why do we need with() clause even if I don't need it?
- I copied and paste from my plv8.control file to template script, and
MODULE_PATHNAME didn't work.  By design?
-
foo=# create template for extension ex2 version '1.0' with (requires ex1)
as $$ $$;
ERROR:  template option requires takes an argument
- create template ex2, create extension ex2, alter template ex2 rename to
ex3, create extension ex3, drop template ex3;
foo=# drop template for extension ex3 version '1.0';
ERROR:  cannot drop unrecognized object 3179 16429 0 because other objects
depend on it
- a template that is created in another template script does not appear to
depend on the parent template.
- Without control file/template, attempt to create a new extension gives:
foo=# create extension plv8;
ERROR:  extension plv8 has no default control template
but can it be better, like extension plv8 has no default control file or
template?
- Do we really need separate tables, pg_extension_template and
pg_extension_control?
- Looks like both tables don't have toast tables.  Some experiment gives:
ERROR:  row is too big: size 8472, maximum size 8160

Thanks,

-- 
Hitoshi Harada


Re: [HACKERS] FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-27 Thread Andrew Gierth
Tom Lane said:
 Agreed, separating out the function-call-with-trailing-declaration
 syntaxes so they aren't considered in FROM and index_elem seems like
 the best compromise.

 If we do that for window function OVER clauses as well, can we make
 OVER less reserved?

Yes.

At least, I tried it with both OVER and FILTER unreserved and there
were no grammar conflicts (and I didn't have to do anything fancy to
avoid them), and it passed regression with the exception of the
changed error message for window functions in the from-clause.

So is this the final decision on how to proceed? It seems good to me,
and I can work with David to get it done.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-27 Thread Pavel Stehule
Hello

2013/6/27 Andrew Gierth and...@tao11.riddles.org.uk:
 Tom Lane said:
 Agreed, separating out the function-call-with-trailing-declaration
 syntaxes so they aren't considered in FROM and index_elem seems like
 the best compromise.

 If we do that for window function OVER clauses as well, can we make
 OVER less reserved?

 Yes.

 At least, I tried it with both OVER and FILTER unreserved and there
 were no grammar conflicts (and I didn't have to do anything fancy to
 avoid them), and it passed regression with the exception of the
 changed error message for window functions in the from-clause.

 So is this the final decision on how to proceed? It seems good to me,
 and I can work with David to get it done.


Isn't dangerous do OVER unreserved keyword??

Regards

Pavel

 --
 Andrew (irc:RhodiumToad)


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation/help for materialized and recursive views

2013-06-27 Thread Nicolas Barbier
2013/6/27 Magnus Hagander mag...@hagander.net:

 Is there a particular reason why CREATE RECURSIVE VIEW is part of the
 help for CREATE VIEW, but CREATE MATERIALIZED VIEW doesn't show up
 there?

 I realize the technical reason (they're different man pages, and that
 also controls what's in \h in psql which is where I ran into it), but
 was there any particular reason to split those up in the first place?

Normal views are an abstraction layer, while materialized views
(despite containing the word “view”) are mainly a performance tool (in
a way similar to indexes).

The functionality of materialized views will (over time) totally swamp
that of normal views, so mixing all the corresponding documentation
with the documentation for normal views probably doesn’t make things
easier for people that are only interested in normal views.

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Fix conversion for Decimal arguments in plpython functions

2013-06-27 Thread Szymon Guz
On 27 June 2013 05:21, Steve Singer st...@ssinger.info wrote:

 On 06/26/2013 04:47 PM, Szymon Guz wrote:





 Attached patch has all changes against trunk code.

 There is added a function for conversion from Postgres numeric to Python
 Decimal. The Decimal type is taken from cdecimal.Decimal, if it is
 available. It is an external library, quite fast, but may be not available.
 If it is not available, then decimal.Decimal will be used. It is in
 standard Python library, however it is rather slow.

 The initialization is done in the conversion function, the pointer to a
 proper Decimal constructor is stored as static variable inside the function
 and is lazy initialized.

 The documentation is updated.


 I've tested this version with python 2.7 with and without cdecimal and
 also with 3.3 that has the faster decimal performance. It seems fine.

 The v5 version of the patch makes only white-space changes to plpy_main.c
 you should excluded that from the patch if your making a new version (I
 have done this in the v6 version I'm attaching)



  Tests for python 2 and 3 have been added. They work only with standard
 decimal.Decimal, as the type is printed in the *.out files. I think there
 is nothing we can do with that now.



 I think we should make  test_type_conversion_numeric to do something that
 generates the same output in both cases.  ie
 py.info(str(x)).  I downside of having the test fail on installs with
 cdecimal installed is much greater than any benefit we get by ensuring that
 the type is really decimal.
 I've attached a v6 version of the patch that does this, do you agree with
 my thinking?


Hi Steve,
thanks for the changes.

You're idea about common code for decimal and cdecimal is good, however not
good enough. I like the idea of common code for decimal and cdecimal. But
we need class name, not the value.

I've changed the code from str(x) to x.__class__.__name__ so the function
prints class name (which is Decimal for both packages), not the value. We
need to have the class name check. The value is returned by the function
and is a couple of lines lower in the file.

patch is attached.

thanks,
Szymon


plpython_decimal_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugfix and new feature for PGXS

2013-06-27 Thread Cédric Villemain
Le mercredi 26 juin 2013 16:52:01, Andrew Dunstan a écrit :
 On 06/25/2013 11:29 AM, Cédric Villemain wrote:
  Le mardi 25 juin 2013 17:18:51, Andrew Dunstan a écrit :
  On 06/24/2013 07:24 PM, Cédric Villemain wrote:
  Le mardi 25 juin 2013 00:18:26, Andrew Dunstan a écrit :
  On 06/24/2013 04:02 PM, Cédric Villemain wrote:
  WIth extension, we do have to set VPATH explicitely if we want to use
  VPATH (note that contribs/extensions must not care that postgresql
  has been built with or without VPATH set). My patches try to fix
  that.
  
  No, this is exactly what I'm objecting to. I want to be able to do:
invoke_vpath_magic
standard_make_commands_as_for_non_vpath
  
  Your patches have been designed to overcome your particular issues,
  but they don't meet the general case, IMNSHO. This is why I want to
  have more discussion about how vpath builds could work for PGXS
  modules.
  
  The patch does not restrict anything, it is not supposed to lead to
  regression.
  The assignment of VPATH and srcdir are wrong in the PGXS case, the
  patch correct them. You're still free to do make VPATH=/mypath ...
  the VPATH provided won't be erased by the pgxs.mk logic.
  
  I still think this is premature.  I have spent some more time trying to
  make it work the way I think it should, so far without success. I think
  we need more discussion about how we'd like VPATH to work for PGXS
  would. There's no urgency about this - we've lived with the current
  situation for quite a while.
  
  Sure...
  I did a quick and dirty patch (I just validate without lot of testing),
  attached to this email to fix json_build (at least for make, make
  install) As you're constructing targets based on commands to execute in
  the srcdir directory, and because srcdir is only set in pgxs.mk, it is
  possible to define srcdir early in the json_build/Makefile and use it.
 
 This still doesn't do what I really want, which is to be able to invoke
 make without anything special in the invocation that triggers VPATH
 processing.

I believe it is the case currently (with my patches applyed), we just have to 
invoke Makefile like we invoke configure for PostgreSQL, except that we don't  
need a configure stage with the contribs.

Obviously it is not exactly the same because 'configure' is a script to 
execute, but 'make' is a command with a configuration file (the Makefile)

For PostgreSQL it is:
$ mkdir build_dir
$ cd build_dir
$ path/to/source/tree/configure [options go here]
$ gmake

For contribs it is:
$ mkdir build_dir
$ cd build_dir
$ gmake -f /path/to/source/tree/Makefile [options go here]

 Here's what I did that works like I think it should.
 
 First the attached patch on top of yours.
 
 Second, I did:
 
  mkdir vpath.json_build
  cd vpath.json_build
  sh/path/to/pgsource/config/prep_buildtree ../json_build .
  ln -s ../json_build/json_build.control .

OK, this creates supposedly required directories in the build process. This 
looks a bit wrong and more work than required: not all the source directories 
are required by the build process. But I understand the point.

Second, config/prep_buildtree is not installed (by make install), so it is not 
suitable to use it with PGXS (PGXS is found with pg_config, when the postgresql 
source tree is not accessible anymore). We may add config/prep_buildtree to 
INSTALL_PROGRAM. 

The 'ln -s' can probably be copied by prep_buildtree.

 Then I created vpath.mk with these contents:
 
  ext_srcdir = ../json_build
  USE_VPATH = $(ext_srcdir)

OK.

 Finally I vpath-ized the Makefile (see attached).

I don't see how this is more pretty than other solution.

 Given all of that, I was able to do, in the vpath directory:
 
  make
  make install
  make installcheck
  make clean
 
 and it all just worked, with exactly the same make invocations as work
 in an in-tree build.

It breaks others:
  mkdir /tmp/auth_delay  cd /tmp/auth_delay
  sh /path/prep_buildtree /path/contrib/auth_delay/ .
  (no .control, it's not an extension)
  make 
  make: *** Pas de règle pour fabriquer la cible « auth_delay.so », nécessaire 
pour « all ». Arrêt.

This works: 
  cd /tmp/auth_delay
  rm -rf *
  make -f /path/contrib/auth_delay/Makefile

PS: I exported USE_PGXS=1 because of the switch case in the makefile.

 So what's lacking to make this nice is a tool to automate the second and
 third steps above.

If the second and third steps are automatized, you'll still have to invoke 
them at some point. How ?

  mkdir /tmp/json_build  cd /tmp/json_build 
  make
  make install
  make installcheck
  make clean

- this will never work, make won't find anything to do. 
You'll need the equivalent of configure (as we do for PostgreSQL, I agree) to 
trigger prep_buildtree:

  mkdir /tmp/json_build  cd /tmp/json_build 
  $ path/to/source/tree/configure [options go here]
  make
  make install
  make installcheck
  make clean

 Maybe there are other ways of doing this, but this 

Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Nicolas Barbier
2013/6/26 Heikki Linnakangas hlinnakan...@vmware.com:

 On 26.06.2013 16:41, Yuri Levinsky wrote:

 Heikki,
 As far as I understand the height of the btree will affect the number of
 I/Os necessary. The height of the tree does not increase linearly with
 the number of records.

 Now let's compare that with a hash partitioned table, with 1000 partitions
 and a b-tree index on every partition. [..] This is almost equivalent to
 just having a b-tree that's one level taller [..] There certainly isn't
 any difference in the number of actual I/O performed.

Imagine that there are a lot of indexes, e.g., 50. Although a lookup
(walking one index) is equally fast, an insertion must update al 50
indexes. When each index requires one extra I/O (because each index is
one level taller), that is 50 extra I/Os. In the partitioned case,
each index would require the normal smaller amount of I/Os. Choosing
which partition to use must only be done once: The result “counts” for
all indexes that are to be updated.

Additionally: Imagine that the data can be partitioned along some
column that makes sense for performance reasons (e.g., some “date”
where most accesses are concentrated on rows with more recent dates).
The other indexes will probably not have such a performance
distribution. Using those other indexes (both for look-ups and
updates) in the non-partitioned case, will therefore pull a huge
portion of each index into cache (because of the “random distribution”
of the non-date data). In the partitioned case, more cache can be
spent on the indexes that correspond to the “hot partitions.”

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Nicolas Barbier
2013/6/27 Nicolas Barbier nicolas.barb...@gmail.com:

 When each index requires one extra I/O (because each index is
 one level taller), that is 50 extra I/Os. In the partitioned case,
 each index would require the normal smaller amount of I/Os.

[..]

 Using those other indexes (both for look-ups and
 updates) in the non-partitioned case, will therefore pull a huge
 portion of each index into cache (because of the “random distribution”
 of the non-date data). In the partitioned case, more cache can be
 spent on the indexes that correspond to the “hot partitions.”

It seems that the system described by Claudio fixes this problem another way:

Claudio wrote:

 Now I just have two indices. One that indexes only hot tuples, it's
 very heavily queried and works blazingly fast, and one that indexes by
 (hotness, key).

Yuri, maybe that is something you should investigate instead of partitioning?

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checking variadic any argument in parser - should be array

2013-06-27 Thread Jeevan Chalke
Hi Pavel,

I had a look over your new patch and it looks good to me.

My review comments on patch:

1. It cleanly applies with patch -p1 command.
2. make/make install/make check were smooth.
3. My own testing didn't find any issue.
4. I had a code walk-through and I am little bit worried or confused on
following changes:

  if (PG_ARGISNULL(argidx))
  return NULL;

! /*
!  * Non-null argument had better be an array.  The parser doesn't
!  * enforce this for VARIADIC ANY functions (maybe it should?), so
that
!  * check uses ereport not just elog.
!  */
! arr_typid = get_fn_expr_argtype(fcinfo-flinfo, argidx);
! if (!OidIsValid(arr_typid))
! elog(ERROR, could not determine data type of concat()
input);
!
! if (!OidIsValid(get_element_type(arr_typid)))
! ereport(ERROR,
! (errcode(ERRCODE_DATATYPE_MISMATCH),
!  errmsg(VARIADIC argument must be an array)));

- /* OK, safe to fetch the array value */
  arr = PG_GETARG_ARRAYTYPE_P(argidx);

  /*
--- 3820,3828 
  if (PG_ARGISNULL(argidx))
  return NULL;

! /* Non-null argument had better be an array */
!
Assert(OidIsValid(get_element_type(get_fn_expr_argtype(fcinfo-flinfo,
argidx;

  arr = PG_GETARG_ARRAYTYPE_P(argidx);

We have moved checking of array type in parser (ParseFuncOrColumn()) which
basically verifies that argument type is indeed an array. Which exactly same
as that of second if statement in earlier code, which you replaced by an
Assert.

However, what about first if statement ? Is it NO more required now? What if
get_fn_expr_argtype() returns InvalidOid, don't you think we need to throw
an error saying could not determine data type of concat() input?

Well, I tried hard to trigger that code, but didn't able to get any test
which fails with that error in earlier version and with your version. And
thus I believe it is a dead code, which you removed ? Is it so ?

Moreover, if in any case get_fn_expr_argtype() returns an InvalidOid, we
will hit an Assert rather than an error, is this what you expect ?

5. This patch has user visibility, i.e. now we are throwing an error when
user only says VARIADIC NULL like:

select concat(variadic NULL) is NULL;

Previously it was working but now we are throwing an error. Well we are now
more stricter than earlier with using VARIADIC + ANY, so I have no issue as
such. But I guess we need to document this user visibility change. I don't
know exactly where though. I searched for VARIADIC and all related
documentation says it needs an array, so nothing harmful as such, so you can
ignore this review comment but I thought it worth mentioning it.

Thanks


On Thu, Jun 27, 2013 at 12:35 AM, Pavel Stehule pavel.steh...@gmail.comwrote:

 Hello

 remastered version

 Regards

 Pavel

 2013/6/26 Jeevan Chalke jeevan.cha...@enterprisedb.com:
  Hi Pavel
 
 
  On Sat, Jan 26, 2013 at 9:22 AM, Pavel Stehule pavel.steh...@gmail.com
  wrote:
 
  Hello Tom
 
  you did comment
 
  !  * Non-null argument had better be an array.
  The parser doesn't
  !  * enforce this for VARIADIC ANY functions
  (maybe it should?), so
  !  * that check uses ereport not just elog.
  !  */
 
  So I moved this check to parser.
 
  It is little bit stricter - requests typed NULL instead unknown NULL,
  but it can mark error better and early
 
 
  Tom suggested that this check should be better done by parser.
  This patch tries to accomplish that.
 
  I will go review this.
 
  However, is it possible to you to re-base it on current master? I can't
  apply it using git apply but patch -p1 was succeeded with lot of
 offset.
 
  Thanks
 
 
 
  Regards
 
  Pavel
 
 
  --
  Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
  To make changes to your subscription:
  http://www.postgresql.org/mailpref/pgsql-hackers
 
 
 
 
  --
  Jeevan B Chalke
  Senior Software Engineer, RD
  EnterpriseDB Corporation
  The Enterprise PostgreSQL Company
 
  Phone: +91 20 30589500
 
  Website: www.enterprisedb.com
  EnterpriseDB Blog: http://blogs.enterprisedb.com/
  Follow us on Twitter: http://www.twitter.com/enterprisedb
 
  This e-mail message (and any attachment) is intended for the use of the
  individual or entity to whom it is addressed. This message contains
  information from EnterpriseDB Corporation that may be privileged,
  confidential, or exempt from disclosure under applicable law. If you are
 not
  the intended recipient or authorized to receive this for the intended
  recipient, any use, dissemination, distribution, retention, archiving, or
  copying of this communication is strictly prohibited. If you have
 received
  this e-mail in error, please notify the sender immediately by reply
 e-mail
  and delete this message.




-- 
Jeevan B Chalke
Senior Software Engineer, 

Re: [HACKERS] Documentation/help for materialized and recursive views

2013-06-27 Thread Magnus Hagander
On Thu, Jun 27, 2013 at 10:56 AM, Nicolas Barbier
nicolas.barb...@gmail.com wrote:
 2013/6/27 Magnus Hagander mag...@hagander.net:

 Is there a particular reason why CREATE RECURSIVE VIEW is part of the
 help for CREATE VIEW, but CREATE MATERIALIZED VIEW doesn't show up
 there?

 I realize the technical reason (they're different man pages, and that
 also controls what's in \h in psql which is where I ran into it), but
 was there any particular reason to split those up in the first place?

 Normal views are an abstraction layer, while materialized views
 (despite containing the word “view”) are mainly a performance tool (in
 a way similar to indexes).

Oh yes, I'm well aware of this of course.


 The functionality of materialized views will (over time) totally swamp
 that of normal views, so mixing all the corresponding documentation
 with the documentation for normal views probably doesn’t make things
 easier for people that are only interested in normal views.

That's a better point I think. That said, it would be very useful if
it actually showed up in \h CREATE VIEW in psql - I wonder if we
should just add the syntax to that page, and then link said future
information on a separate page somehow?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Ants Aasma
On Jun 27, 2013 12:24 PM, Nicolas Barbier nicolas.barb...@gmail.com
wrote:

 2013/6/27 Nicolas Barbier nicolas.barb...@gmail.com:

  When each index requires one extra I/O (because each index is
  one level taller), that is 50 extra I/Os. In the partitioned case,
  each index would require the normal smaller amount of I/Os.

 [..]

  Using those other indexes (both for look-ups and
  updates) in the non-partitioned case, will therefore pull a huge
  portion of each index into cache (because of the “random distribution”
  of the non-date data). In the partitioned case, more cache can be
  spent on the indexes that correspond to the “hot partitions.”

 It seems that the system described by Claudio fixes this problem another
way:

 Claudio wrote:

  Now I just have two indices. One that indexes only hot tuples, it's
  very heavily queried and works blazingly fast, and one that indexes by
  (hotness, key).

This is not really related to hash partitioning, but you can also do index
partitioning while having the tables unpartitioned. If the hotness field is
a timestamp like it often is, you can create a predicate index on (key,
tstamp) where tstamp  [some date in recent past], and replace the index
with a newer one every so often to keep the size small. This way you can
have a non-partitioned index for batch queries and a small one for the OLTP
workload. If we added the option to build indexes using an index only scan,
building the replacement index would be quite cheap.

Regards,
Ants Aasma


Re: [HACKERS] in-catalog Extension Scripts and Control parameters (templates?)

2013-06-27 Thread Dimitri Fontaine
Hi,

Thanks a lot for your review!

Some answers here, new version of the patch with fixes by tuesday.

Hitoshi Harada umi.tan...@gmail.com writes:
 - If I have control file that has the same name as template, create
 extension picks up control file?  Is this by design?

Yes. That should allow to answer most of Heikki's security complaint,
but isn't enough to allow us to open the feature to non superuser if I
understand correctly.

 - Though control file is kind of global information among different
 databases, pg_extension_template is not.  Sounds a little weird to me.

I think that's a limitation of the old model and we don't want to turn
templates for extensions into being shared catalogs. At least that's my
understanding of the design consensus.

 - Why do we need with() clause even if I don't need it?

Will have a fresh look, thanks.

 - I copied and paste from my plv8.control file to template script, and
 MODULE_PATHNAME didn't work.  By design?

Yes. MODULE_PATHNAME is deprecated by the control file parameter of the
same name, and there's no reason to use it in extension templates even
if we got to support C coded extensions in them, which is not the case
now.

 foo=# create template for extension ex2 version '1.0' with (requires ex1)
 as $$ $$;
 ERROR:  template option requires takes an argument

Will see about that.

 - create template ex2, create extension ex2, alter template ex2 rename to
 ex3, create extension ex3, drop template ex3;
 foo=# drop template for extension ex3 version '1.0';
 ERROR:  cannot drop unrecognized object 3179 16429 0 because other objects
 depend on it

Well, if I'm following, you're trying to remove a non-existing object. I
guess you would prefer a better error message, right?

 - a template that is created in another template script does not appear to
 depend on the parent template.

I don't think that should be automatically the case, even if I admit I
didn't think about that case.

 - Without control file/template, attempt to create a new extension gives:
 foo=# create extension plv8;
 ERROR:  extension plv8 has no default control template
 but can it be better, like extension plv8 has no default control file or
 template?

Will rework.

 - Do we really need separate tables, pg_extension_template and
 pg_extension_control?

Yes, to be able to have the same feature as we have today with
auxilliary control files, that is change properties of the extension
from a version to the next.

 - Looks like both tables don't have toast tables.  Some experiment gives:
 ERROR:  row is too big: size 8472, maximum size 8160

Will fix in next version of the patch.

Thanks,
-- 
Dimitri Fontaine06 63 07 10 78
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ASYNC Privileges proposal

2013-06-27 Thread Chris Farmiloe
So I would think that if this was to go further then channels would need
to be more of a first class citizen and created explicitly, with CREATE
CHANNEL, DROP CHANNEL etc:

CREATE CHANNEL channame;
GRANT LISTEN ON CHANNEL channame TO rolename;
GRANT NOTIFY ON CHANNEL channame TO rolename;
LISTEN channame; -- OK
NOTIFY channame, 'hi'; -- OK
LISTEN ; -- exception: no channel named 
NOTIFY , 'hi'; -- exception: no channel named 

Personally I think explicitly creating channels would be beneficial; I have
hit issues where an typo in a channel name has caused a bug to go unnoticed
for a while.
But of course this would break backwards compatibility with the current
model (with implicit channel names) so unless we wanted to force everyone
to add CREATE CHANNEL statements during their upgrade then, maybe there
would need to be some kind of system to workaround this

Possibly some kind of catch-all channel, that enables implicit channel
names?

GRANT LISTEN, NOTIFY ON CHANNEL * TO PUBLIC; -- enabled by default for
backwards compat
NOTIFY ; -- OK via * CHANNEL
LISTEN ; -- OK via * CHANNEL









Chris


On 18 June 2013 18:31, Josh Berkus j...@agliodbs.com wrote:


  I had a quick play to see what might be involved [attached], and would
 like to
  hear people thoughts; good idea, bad idea, not like that! etc
 
  I question the usefulness of allowing listen/notify to be restricted to
  an entire class of users.  The granularity of this seems too broad,
  though I am not sure if allowing message to be sent to a specific user
  is easily achievable.

 Well, if we're going to have privs on LISTEN/NOTIFY at all, they should
 be on specific message bands, i.e.:

 REVOKE LISTEN ON 'cacheupdates' FROM PUBLIC;
 GRANT LISTEN ON 'cacheupdates' TO webuser;
 GRANT LISTEN ON ALL TO admin;

 I can certainly see wanting this, but it'd be a great deal more
 sophisticated than what Chris has proposed.

 --
 Josh Berkus
 PostgreSQL Experts Inc.
 http://pgexperts.com



Re: [HACKERS] in-catalog Extension Scripts and Control parameters (templates?)

2013-06-27 Thread Hitoshi Harada
On Thu, Jun 27, 2013 at 2:49 AM, Dimitri Fontaine dimi...@2ndquadrant.frwrote:

 Hi,

 Thanks a lot for your review!

 Some answers here, new version of the patch with fixes by tuesday.

 Hitoshi Harada umi.tan...@gmail.com writes:
  - create template ex2, create extension ex2, alter template ex2 rename to
  ex3, create extension ex3, drop template ex3;
  foo=# drop template for extension ex3 version '1.0';
  ERROR:  cannot drop unrecognized object 3179 16429 0 because other
 objects
  depend on it

 Well, if I'm following, you're trying to remove a non-existing object. I
 guess you would prefer a better error message, right?


Right.  unrecognized object x y z doesn't look good.


  - a template that is created in another template script does not appear
 to
  depend on the parent template.

 I don't think that should be automatically the case, even if I admit I
 didn't think about that case.


Really?  My understanding is everything that is created under extension
depends on the extension, which depends on the template.  Why only template
is exception?




-- 
Hitoshi Harada


Re: [HACKERS] PostgreSQL 9.3 latest dev snapshot

2013-06-27 Thread Magnus Hagander
On Tue, Jun 25, 2013 at 3:31 PM, Michael Paquier
michael.paqu...@gmail.com wrote:

 On 2013/06/25, at 22:23, Fujii Masao masao.fu...@gmail.com wrote:

 On Tue, Jun 25, 2013 at 6:33 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Tue, Jun 25, 2013 at 5:33 PM, Misa Simic misa.si...@gmail.com wrote:
 Hi,

 Where we can find latest snapshot for 9.3 version?

 We have taken latest snapshot from
 http://ftp.postgresql.org/pub/snapshot/dev/

 But it seems it is for 9.4 version...
 9.3 has moved to branch REL9_3_STABLE a couple of days ago.

 Yes. We can find the snapshot from REL9_3_STABLE git branch.
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=shortlog;h=refs/heads/REL9_3_STABLE
 Indeed, I completely forgot that you can download snapshots from 
 postgresql.org's git. Simply use that instead of the FTP server now as long 
 as 9.3 snapshots are not generated there.

In case somebody is still looking, snapshots are properly building for 9.3 now.

Those snapshots aren't identical to a download from git, as they've
gone through a make dist-prep or whatever it's called. But they're
pretty close.

However, if oyu're looking for a snapshot, please use the one on the
ftpsite. Generating those snapshots on the git server is slow and
expensive...


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Min value for port

2013-06-27 Thread Magnus Hagander
Is there a reason why we have set the min allowed value for port to 1,
not 1024? Given that you can't actually start postgres with a value of
1024, shoulnd't the entry in pg_settings reference that as well?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-27 Thread MauMau

Hi, Alvaro san,

From: Alvaro Herrera alvhe...@2ndquadrant.com

MauMau escribió:
Yeah, I see that --- after removing that early exit, there are unwanted
messages.  And in fact there are some signals sent that weren't
previously sent.  Clearly we need something here: if we're in immediate
shutdown handler, don't signal anyone (because they have already been
signalled) and don't log any more messages; but the cleaning up of
postmaster's process list must still be carried out.

Would you please add that on top of the attached cleaned up version of
your patch?


I did this.  Please find attached the revised patch.  I modified 
HandleChildCrash().  I tested the immediate shutdown, and the child cleanup 
succeeded.


In addition, I added if condition at the end of the function.  This is to 
prevent resetting AbortStartTime every time one child terminates.


Regards
MauMau



reliable_immediate_shutdown-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: query result history in psql

2013-06-27 Thread Maciej Gajewski
Thank you for the review!


There were a few english/grammatical mistakes that I went ahead and fixed.


Thank  you for that. If you could send me a patch-to-a-patch so I can
correct all the mistakes in the next release?


 Additionally, I think some of the string manipulation might be placed
 outside of the main ans.c file. I don't know if there's a better place for
 'EscapeForCopy' and 'GetEscapedLen'. Not really a big deal, just an
 organizational idea. I also changed 'EscapeForCopy' to 'EscapeAndCopy'. I
 think that better describes the functionality. 'EscapeForCopy' kind of
 implies that another function is needed to copy the string.



The 'EscapeForCopy' was meant to mean 'Escape string in a format require by
the COPY TEXT format', so 'copy' in the name refers to the escaping format,
not the action performed by the function.

They could be, indeed, placed in separate module. I'll do it.



 What does 'ans' stand for? I am not sure how it relates to the concept of
 a query history. It didn't stop my understanding of the code, but I don't
 know if a user will immediately know the meaning.


Some mathematical toolkits, like Matlab or Mathematica, automatically set a
variable called 'ans' (short for answer) containing the result of the
last operation. I was trying to emulate exactly this behaviour.



 Probably the biggest problem is that the query history list is missing a
 maximum size variable. I think this could be valuable for preventing users
 from shooting themselves in the foot. If the user is running large queries,
 they might accidentally store too much data. This probably somewhat of an
 edge-case but I believe it is worth considering. We could provide a
 sensible default limit (10 queries?) and also allow the user to change it.


I was considering such a behaviour. But since the feature is turned off by
default, I decided that whoever is using it, is aware of cost. Instead of
truncating the history automatically (which could lead to a nasty
surprise), I decided to equip the user with \ansclean , a command erasing
the history. I believe that it is better to let the user decide when
history should be erased, instead of doing it automatically.


Finally, is it worth resetting the query history every time a user
 reconnects to the database? I can see how this might interrupt a user's
 workflow. If the user suddenly disconnects (network connection interrupted,
 etc) then they would lose their history. I think this is definitely up for
 debate. It would add more management overhead (psql options etc) and might
 just be unnecessary. However, with a sane limit to the size of the query
 history, I don't know if there would be too many drawbacks from a storage
 perspective.


The history is not erased. The history is always stored in the client's
memory. When a history item is used for the first time, a TEMPORARY table
is created in the database that stores the data server-side. When user
disconnects from the database, the session ends and all these tables are
dropped.
Tables names have to be removed from the history, so next time the item is
used, the table will be created and populated again.

I use the feature while switching often between databases, and it works
seamlessly. Actually, it's quite useful to move bits of data across
databases:
Connect to database A, run a query, connect to database B, run another
query joining local data with the results of the previous query.


 Those issues aside - I think it's a great feature! I can add the
 grammatical fixes I made whenever the final patch is ready. Or earlier,
 whatever works for you. Also, this is my first time reviewing a patch, so
 please let me know if I can improve on anything. Thanks!


This is  my first submitted patch, so I can't really comment on the
process. But if you could add the author's email to CC, the message would
be much easier to spot. I replied after two days only because I missed the
message in the flood of other pgsql-hacker messages. I think I need to scan
the list more carefully...

Maciej


Re: [HACKERS] Reduce maximum error in tuples estimation after vacuum.

2013-06-27 Thread Amit Kapila
On Wednesday, June 26, 2013 7:40 AM Kyotaro HORIGUCHI wrote:
 I've recovered from messing up.
 
 snip
  Please let me have a bit of time to diagnose this.
 
 I was completely messed up and walking on the wrong way. I looked into
 the vacuum for UPDATEs, not DELETE's so it's quite resonable to have
 such results.
 
 The renewed test script attached shows the verbose output of vacuum
 after the deletes. I had following output from it.
 
 # I belive this runs for you..
 
 | INFO: t: found 98 removable, 110 nonremovable row
 |   versions in 6308 out of 10829 pages
 
 On such a case of partially-scanned, lazy_scan_heap() tries to estimate
 resulting num_tuples in vac_estimate_reltuples() assuming the
 uniformity of tuple density, which failes for such a a strong imbalance
 made by bulk updates.
 
 Do you find any differences between what you will have and the
 following I had?

I could see the same output with your latest script, also I could reproduce
the test if I run the test with individual sql statements.
One of the main point for reproducing individual test was to keep autovacuum
= off.

Now I can look into it further, I have still not gone through in detail
about your new approach to calculate the reltuples, but I am wondering
whether there can be anyway with which estimates can be improved with
different calculation in vac_estimate_reltuples().

One thing I have observed that 2nd parameter is_analyze of
vac_estimate_reltuples() is currently not used.

I cannot work on it till early next week, so others are welcome to join
review.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Marko Kreen
On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 26 June 2013 21:46, Peter Eisentraut pete...@gmx.net wrote:
 On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

 The thread discussed several other options of checksumming tables that
 did not have the air of a crytographic offering, as Noah put it.


 True but md5 has the advantage of being directly comparable with the
 output of Unix md5sum, which would be useful if you loaded data from
 external files and wanted to confirm that your import process didn't
 mangle it.

The problem with md5_agg() is that it's only useful in toy scenarios.

It's more useful give people script that does same sum(hash(row))
on dump file than try to run MD5 on ordered rows.

Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
people want way to check if 2 tables or table and dump are same.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] in-catalog Extension Scripts and Control parameters (templates?)

2013-06-27 Thread Dimitri Fontaine
Hitoshi Harada umi.tan...@gmail.com writes:
  - a template that is created in another template script does not appear
 to
  depend on the parent template.

 I don't think that should be automatically the case, even if I admit I
 didn't think about that case.

 Really?  My understanding is everything that is created under extension
 depends on the extension, which depends on the template.  Why only template
 is exception?

Oh sorry, I understood you meant at CREATE TEMPLATE FOR EXTENSION time
rather than at CREATE EXTENSION time, and that you were refering to
dependency as in the require control parameter.

The pg_depend entry against the extension should be there, will fix.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [9.3 doc fix] ECPG VAR command does not describe the actual specification

2013-06-27 Thread MauMau

Hello,

Looking around the 9.3 doc, I found a small, but not-insignificant error in 
the documentation.



http://www.postgresql.org/docs/9.3/static/ecpg-sql-var.html

According to the description,

EXEC SQL VAR a IS int;

is equivalent to:

Exec sql begin declare section;
int a;
exec sql end declare section;

However, ecpg var.pgc produces the error:

var.pgc:1: ERROR: variable a is not declared

On the other hand, the following code can be preprocessed correctly, 
treating the host variable a as an int type.


--
Exec sql begin declare section;
short a;
exec sql end declare section;
EXEC SQL VAR a IS int;

EXEC SQL SELECT c INTO :a FROM mytable;
--


The attached patch fixes this documentation error.  This is not specific to 
9.3, so it needs to be backported.  Could you commit this?


Regards
MauMau


ecpg_var_doc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Min value for port

2013-06-27 Thread Peter Eisentraut
On 6/27/13 6:34 AM, Magnus Hagander wrote:
 Is there a reason why we have set the min allowed value for port to 1,
 not 1024? Given that you can't actually start postgres with a value of
 1024, shoulnd't the entry in pg_settings reference that as well?

Are you thinking of the restriction that you need to be root to use
ports 1024?  That restriction is not necessarily universal.  We can let
the kernel tell us at run time if it doesn't like our port.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 8:09 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Configuration Details
 O/S - Suse-11
 RAM - 128GB
 Number of Cores - 16
 Server Conf - checkpoint_segments = 300; checkpoint_timeout = 15 min,
 synchronous_commit = 0FF, shared_buffers = 14GB, AutoVacuum=off Pgbench -
 Select-only Scalefactor - 1200 Time - 30 mins

  8C-8T16C-16T32C-32T64C-64T
 Head   62403101810 99516  94707
 Patch  62827101404 99109  94744

 On 128GB RAM, if use scalefactor=1200 (database=approx 17GB) and 14GB shared
 buffers, this is no major difference.
 One of the reasons could be that there is no much swapping in shared buffers
 as most data already fits in shared buffers.

I'd like to just back up a minute here and talk about the broader
picture here.  What are we trying to accomplish with this patch?  Last
year, I did some benchmarking on a big IBM POWER7 machine (16 cores,
64 hardware threads).  Here are the results:

http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html

Now, if you look at these results, you see something interesting.
When there aren't too many concurrent connections, the higher scale
factors are only modestly slower than the lower scale factors.  But as
the number of connections increases, the performance continues to rise
at the lower scale factors, and at the higher scale factors, this
performance stops rising and in fact drops off.  So in other words,
there's no huge *performance* problem for a working set larger than
shared_buffers, but there is a huge *scalability* problem.  Now why is
that?

As far as I can tell, the answer is that we've got a scalability
problem around BufFreelistLock.  Contention on the buffer mapping
locks may also be a problem, but all of my previous benchmarking (with
LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
in the room.  My interest in having the background writer add buffers
to the free list is basically around solving that problem.  It's a
pretty dramatic problem, as the graph above shows, and this patch
doesn't solve it.  There may be corner cases where this patch improves
things (or, equally, makes them worse) but as a general point, the
difficulty I've had reproducing your test results and the specificity
of your instructions for reproducing them suggests to me that what we
have here is not a clear improvement on general workloads.  Yet such
an improvement should exist, because there are other products in the
world that have scalable buffer managers; we currently don't.  Instead
of spending a lot of time trying to figure out whether there's a small
win in narrow cases here (and there may well be), I think we should
back up and ask why this isn't a great big win, and what we'd need to
do to *get* a great big win.  I don't see much point in tinkering
around the edges here if things are broken in the middle; things that
seem like small wins or losses now may turn out otherwise in the face
of a more comprehensive solution.

One thing that occurred to me while writing this note is that the
background writer doesn't have any compelling reason to run on a
read-only workload.  It will still run at a certain minimum rate, so
that it cycles the buffer pool every 2 minutes, if I remember
correctly.  But it won't run anywhere near fast enough to keep up with
the buffer allocation demands of 8, or 32, or 64 sessions all reading
data not all of which is in shared_buffers at top speed.  In fact,
we've had reports that the background writer isn't too effective even
on read-write workloads.  The point is - if the background writer
isn't waking up and running frequently enough, what it does when it
does wake up isn't going to matter very much.  I think we need to
spend some energy poking at that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Developer meeting photos

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 10:22 AM, Alexander Korotkov
aekorot...@gmail.com wrote:
 at last developer meeting we missed Oleg Bartunov. So, it's not surprising
 that photos is also missed.
 I remember that somebody took photos, but unfortunately it appears that I
 don't remember who.
 My employer who sponsored my attendance in PGCon want to publish post about
 it on the website. So, it would be very nice to have a photo of developer
 meeting.
 Does anybody have it?

Didn't Kevin take some photos?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Developer meeting photos

2013-06-27 Thread Alexander Korotkov
On Thu, Jun 27, 2013 at 4:23 PM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jun 26, 2013 at 10:22 AM, Alexander Korotkov
 aekorot...@gmail.com wrote:
  at last developer meeting we missed Oleg Bartunov. So, it's not
 surprising
  that photos is also missed.
  I remember that somebody took photos, but unfortunately it appears that I
  don't remember who.
  My employer who sponsored my attendance in PGCon want to publish post
 about
  it on the website. So, it would be very nice to have a photo of developer
  meeting.
  Does anybody have it?

 Didn't Kevin take some photos?


I've received photos by private email. Thanks.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] XLogInsert scaling, revisited

2013-06-27 Thread Andres Freund
Hi,

On 2013-06-26 18:52:30 +0300, Heikki Linnakangas wrote:
 * Could you document the way slots prevent checkpoints from occurring
when XLogInsert rechecks for full page writes? I think it's correct -
but not very obvious on a glance.

 There's this in the comment near the top of the file:

  * To update RedoRecPtr or fullPageWrites, one has to make sure that all
  * subsequent inserters see the new value. This is done by reserving all the
  * insertion slots before changing the value. XLogInsert reads RedoRecPtr
 and
  * fullPageWrites after grabbing a slot, so by holding all the slots
  * simultaneously, you can ensure that all subsequent inserts see the new
  * value.  Those fields change very seldom, so we prefer to be fast and
  * non-contended when they need to be read, and slow when they're changed.

 Does that explain it well enough? XLogInsert holds onto a slot while it
 rechecks for full page writes.

Yes. Earlieron that was explained in XLogInsert() - maybe point to the
documentation ontop of the file? The file is too big to expect everyone
to reread the whole file in a new release...

 * The read of Insert-RedoRecPtr while rechecking whether it's out of
date now is unlocked, is that correct? And why?

 Same here, XLogInsert holds the slot while rechecking Insert-RedoRecptr.

I was wondering whether its guaranteed that we don't read a cached
value, but I didn't think of the memory barriers due to the spinlock in
Release/AcquireSlot. I think I thought that release wouldn't acquire the
spinlock at all.

 * Have you measured whether it works to just keep as many slots as
max_backends requires around and not bothering with dynamically
allocating them to inserters?
That seems to require to keep actually waiting slots in a separate
list which very well might make that too expensive.
 
Correctness wise the biggest problem for that probably is exlusive
acquiration of all slots CreateCheckpoint() does...

 Hmm. It wouldn't be much different, each backend would still need to reserve
 its own dedicated slot, because it might be held by the a backend that
 grabbed all the slots. Also, bgwriter and checkpointer need to flush the
 WAL, so they'd need slots too.

 More slots make WaitXLogInsertionsToFinish() more expensive, as it has to
 loop through all of them. IIRC some earlier pgbench tests I ran didn't show
 much difference in performance, whether there were 40 slots or 100, as long
 as there was enough of them. I can run some more tests with even more slots
 to see how it behaves.

In a very quick test I ran previously I did see the slot acquiration in
the profile when using short connections that all did some quick inserts
- which isn't surprising since they tend to all start with the same
slot so they will all fight for the first slots.

Maybe we could ammeliorate that by initializing MySlotNo to
(MyProc-pgprocno % num_xloginsert_slots) when uninitialized? That won't
be completely fair since the number of procs won't usually be a multiple
of num_insert_slots, but I doubt that will be problematic.

 * What about using some sort of linked list of unused slots for
WALInsertSlotAcquireOne? Everytime we're done we put it to the *end*
of the list so it's less likely to have been grabbed by somebody else
so we can reuse it.
a) To grab a new one go to the head of the linked list spinlock it and
recheck whether it's still free. If not, restart. Otherwise, remove
from list.
b) To reuse a previously used slot
 
That way we only have to do the PGSemaphoreLock() dance if there
really aren't any free slots.

 That adds a spinlock acquisition / release into the critical path, to
 protect the linked list. I'm trying very hard to avoid serialization points
 like that.

Hm. We already have at least one spinlock in that path? Or are you
thinking of a global one protecting the linked list? If so, I think we
should be able to get away with locking individual slots.
IIRC if you never need to delete list elements you can even get away
with a lockless implementation.

 While profiling the tests I've done, finding a free slot hasn't shown up
 much. So I don't think it's a problem performance-wise as it is, and I don't
 think it would make the code simpler.

It sure wouldn't make it simpler. As I said above, I saw the slot
acquiration in a profile when using -C and a short pgbench script (that
just inserts 10 rows).

 * The queuing logic around slots seems to lack documentation. It's
complex enough to warrant that imo.

 Yeah, it's complex. I expanded the comments above XLogInsertSlot, to explain
 how xlogInsertingAt works. Did that help, or was it some other part of that
 that needs more docs?

What I don't understand so far is why we don't reset xlogInsertingAt
during SlotReleaseOne. There are places documenting that we don't do so,
but not why unless I missed it.

Do we do this only to have some plausible value for a backend that been
acquired but 

Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Andres Freund
On 2013-06-27 08:23:31 -0400, Robert Haas wrote:
 I'd like to just back up a minute here and talk about the broader
 picture here.

Sounds like a very good plan.

 So in other words,
 there's no huge *performance* problem for a working set larger than
 shared_buffers, but there is a huge *scalability* problem.  Now why is
 that?

 As far as I can tell, the answer is that we've got a scalability
 problem around BufFreelistLock.

Part of the problem is it's name ;)

 Contention on the buffer mapping
 locks may also be a problem, but all of my previous benchmarking (with
 LWLOCK_STATS) suggests that BufFreelistLock is, by far, the elephant
 in the room.

Contention wise I aggree. What I have seen is that we have a huge
amount of cacheline bouncing around the buffer header spinlocks.

 My interest in having the background writer add buffers
 to the free list is basically around solving that problem.  It's a
 pretty dramatic problem, as the graph above shows, and this patch
 doesn't solve it.

 One thing that occurred to me while writing this note is that the
 background writer doesn't have any compelling reason to run on a
 read-only workload.  It will still run at a certain minimum rate, so
 that it cycles the buffer pool every 2 minutes, if I remember
 correctly.

I have previously added some adhoc instrumentation that printed the
amount of buffers that were required (by other backends) during a
bgwriter cycle and the amount of buffers that the buffer manager could
actually write out. I don't think I actually found any workload where
the bgwriter actually wroute out a relevant percentage of the necessary
pages.
Which would explain why the patch doesn't have a big benefit. The
freelist is empty most of the time, so we don't benefit from the reduced
work done under the lock.

I think the whole algorithm that guides how much the background writer
actually does, including its pacing/sleeping logic, needs to be
rewritten from scratch before we are actually able to measure the
benefit from this patch. I personally don't think there's much to
salvage from the current code.

Problems with the current code:

* doesn't manipulate the usage_count and never does anything to used
  pages. Which means it will just about never find a victim buffer in a
  busy database.
* by far not aggressive enough, touches only a few buffers ahead of the
  clock sweep.
* does not advance the clock sweep, so the individual backends will
  touch the same buffers again and transfer all the buffer spinlock
  cacheline around
* The adaption logic it has makes it so slow to adapt that it takes
  several minutes to adapt.
* ...


There's another thing we could do to noticeably improve scalability of
buffer acquiration. Currently we do a huge amount of work under the
freelist lock.
In StrategyGetBuffer:
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
...
// check freelist, will usually be empty
...
for (;;)
{
buf = BufferDescriptors[StrategyControl-nextVictimBuffer];

++StrategyControl-nextVictimBuffer;

LockBufHdr(buf);
if (buf-refcount == 0)
{
if (buf-usage_count  0)
{
buf-usage_count--;
}
else
{
/* Found a usable buffer */
if (strategy != NULL)
AddBufferToRing(strategy, buf);
return buf;
}
}
UnlockBufHdr(buf);
}

So, we perform the entire clock sweep until we found a single buffer we
can use inside a *global* lock. At times we need to iterate over the
whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
the usage counts enough (if the database is busy it can take even
longer...).
In a busy database where usually all the usagecounts are high the next
backend will touch a lot of those buffers again which causes massive
cache eviction  bouncing.

It seems far more sensible to only protect the clock sweep's
nextVictimBuffer with a spinlock. With some care all the rest can happen
without any global interlock.

I think even after fixing this - which we definitely should do - having
a sensible/more aggressive bgwriter moving pages onto the freelist makes
sense because then backends then don't need to deal with dirty pages.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add long options to pgbench (submission 1)

2013-06-27 Thread Robert Haas
On Tue, Jun 25, 2013 at 3:09 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 I think --quiet-log should be spelled --quiet.

 ISTM that --quiet usually means not verbose on stdout, so I added log
 because this was specific to the log output, and that there may be a need
 for a --quiet option for stdout at some time.

The output that is quieted by -q is not the log output produced by
--log; it's the regular progress output on stdout/stderr.

So I changed that, and committed this, with some further cosmetic
changes.  I made the formatting of the help message more like psql's
help message, including adjusting pgbench to start the description of
each option in the same column that psql does.  This got rid of a lot
of line breaks and IMHO makes the output of pgbench --help quite a bit
more readable.  I made stylistic adjustments to the documentation
portion of the patch as well, again to match the markup used for psql.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Markus Wanner
On 06/27/2013 11:12 AM, Nicolas Barbier wrote:
 Imagine that there are a lot of indexes, e.g., 50. Although a lookup
 (walking one index) is equally fast, an insertion must update al 50
 indexes. When each index requires one extra I/O (because each index is
 one level taller), that is 50 extra I/Os. In the partitioned case,
 each index would require the normal smaller amount of I/Os. Choosing
 which partition to use must only be done once: The result “counts” for
 all indexes that are to be updated.

I think you're underestimating the cost of partitioning. After all, the
lookup of what index to update for a given partition is a a lookup in
pg_index via pg_index_indrelid_index - a btree index.

Additionally, the depth of an index doesn't directly translate to the
number of I/O writes per insert (or delete). I'd rather expect the avg.
number of I/O writes per insert into a b-tree to be reasonably close to
one - depending mostly on the number of keys per page, not depth.

 Additionally: Imagine that the data can be partitioned along some
 column that makes sense for performance reasons (e.g., some “date”
 where most accesses are concentrated on rows with more recent dates).
 The other indexes will probably not have such a performance
 distribution. Using those other indexes (both for look-ups and
 updates) in the non-partitioned case, will therefore pull a huge
 portion of each index into cache (because of the “random distribution”
 of the non-date data). In the partitioned case, more cache can be
 spent on the indexes that correspond to the “hot partitions.”

That's a valid point, yes. I'd call this index partitioning. And with
partial indices, Postgres already has something that gets pretty close,
I think. Though, I don't consider this to be related to how the tuples
of the relation are laid out on disk.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extensible external toast tuple support snappy prototype

2013-06-27 Thread Robert Haas
On Wed, Jun 19, 2013 at 3:27 AM, Andres Freund and...@2ndquadrant.com wrote:
 There will be a newer version of the patch coming today or tomorrow, so
 there's probably no point in looking at the one linked above before
 that...

This patch is marked as Ready for Committer in the CommitFest app.
But it is not clear to me where the patch is that is being proposed
for commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Min value for port

2013-06-27 Thread Magnus Hagander
On Thu, Jun 27, 2013 at 2:16 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 6/27/13 6:34 AM, Magnus Hagander wrote:
 Is there a reason why we have set the min allowed value for port to 1,
 not 1024? Given that you can't actually start postgres with a value of
 1024, shoulnd't the entry in pg_settings reference that as well?

 Are you thinking of the restriction that you need to be root to use
 ports 1024?  That restriction is not necessarily universal.  We can let
 the kernel tell us at run time if it doesn't like our port.

Yes, that's the restriction I was talking about. It's just a bit
annoying that if you look at pg_settings.min_value it doesn't actually
tell you the truth. But yeah, I believe Windows actually lets you use
a lower port number, so it'd at least have to be #ifdef'ed for that if
we wanted to change it.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Min value for port

2013-06-27 Thread Jan Urbański

On 27/06/13 15:11, Magnus Hagander wrote:

On Thu, Jun 27, 2013 at 2:16 PM, Peter Eisentraut pete...@gmx.net wrote:

On 6/27/13 6:34 AM, Magnus Hagander wrote:

Is there a reason why we have set the min allowed value for port to 1,
not 1024? Given that you can't actually start postgres with a value of
1024, shoulnd't the entry in pg_settings reference that as well?


Are you thinking of the restriction that you need to be root to use
ports 1024?  That restriction is not necessarily universal.  We can let
the kernel tell us at run time if it doesn't like our port.


Yes, that's the restriction I was talking about. It's just a bit
annoying that if you look at pg_settings.min_value it doesn't actually
tell you the truth. But yeah, I believe Windows actually lets you use
a lower port number, so it'd at least have to be #ifdef'ed for that if
we wanted to change it.


There's also authbind and CAP_NET_BIND_SERVICE.

Jan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Min value for port

2013-06-27 Thread Andres Freund
On 2013-06-27 15:11:26 +0200, Magnus Hagander wrote:
 On Thu, Jun 27, 2013 at 2:16 PM, Peter Eisentraut pete...@gmx.net wrote:
  On 6/27/13 6:34 AM, Magnus Hagander wrote:
  Is there a reason why we have set the min allowed value for port to 1,
  not 1024? Given that you can't actually start postgres with a value of
  1024, shoulnd't the entry in pg_settings reference that as well?
 
  Are you thinking of the restriction that you need to be root to use
  ports 1024?  That restriction is not necessarily universal.  We can let
  the kernel tell us at run time if it doesn't like our port.
 
 Yes, that's the restriction I was talking about. It's just a bit
 annoying that if you look at pg_settings.min_value it doesn't actually
 tell you the truth. But yeah, I believe Windows actually lets you use
 a lower port number, so it'd at least have to be #ifdef'ed for that if
 we wanted to change it.

You can easily change the setting on linux as well. And you can grant
specific binaries the permission to bind to restricted ports without
being root.
I don't think the additional complexity to get a sensible value in there
is warranted.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-27 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-06-26 20:07:40 -0400, Tom Lane wrote:
 However, some more trolling of the intertubes suggests that Cygwin's
 emulation of socket() does indeed return EINPROGRESS; see for instance
 this ancient thread of ours:
 http://www.postgresql.org/message-id/flat/14855.49635.565990.716...@kryten.bedford.waii.com#14855.49635.565990.716...@kryten.bedford.waii.com
 Unless we want to distinguish Cygwin from native Windows in this code
 chunk, maybe we'd better leave well enough alone.

 After some looking it gets funnier. There currently doesn't seem to be
 any chance that we actually can get an EINPROGRESS at that level on
 windows. We #define connect() to pgwin32_connect() which fudges errno
 around. Where WSAEINPROGRESS is mapped to EINVAL.

That's only in the backend though: note the #define is controlled by
#ifndef FRONTEND, and we don't link backend/port/win32/socket.c into
libpq anyway.  The signal managing it's doing wouldn't work at all in
client-side programs.

But yeah, on the backend side we would definitely treat WSAEINPROGRESS
as a hard error.  OTOH, we don't use nonblocking connect (much?) in the
backend so I'm not sure whether that's a directly comparable case or
not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund and...@2ndquadrant.com wrote:
 Contention wise I aggree. What I have seen is that we have a huge
 amount of cacheline bouncing around the buffer header spinlocks.

How did you measure that?

 I have previously added some adhoc instrumentation that printed the
 amount of buffers that were required (by other backends) during a
 bgwriter cycle and the amount of buffers that the buffer manager could
 actually write out.

I think you can see how many are needed from buffers_alloc.  No?

 I don't think I actually found any workload where
 the bgwriter actually wroute out a relevant percentage of the necessary
 pages.

Check.

 Problems with the current code:

 * doesn't manipulate the usage_count and never does anything to used
   pages. Which means it will just about never find a victim buffer in a
   busy database.

Right.  I was thinking that was part of this patch, but it isn't.  I
think we should definitely add that.  In other words, the background
writer's job should be to run the clock sweep and add buffers to the
free list.  I think we should also split the lock: a spinlock for the
freelist, and an lwlock for the clock sweep.

 * by far not aggressive enough, touches only a few buffers ahead of the
   clock sweep.

Check.  Fixing this might be a separate patch, but then again maybe
not.  The changes we're talking about here provide a natural feedback
mechanism: if we observe that the freelist is empty (or less than some
length, like 32 buffers?) set the background writer's latch, because
we know it's not keeping up.

 * does not advance the clock sweep, so the individual backends will
   touch the same buffers again and transfer all the buffer spinlock
   cacheline around

Yes, I think that should be fixed as part of this patch too.  It's
obviously connected to the point about usage counts.

 * The adaption logic it has makes it so slow to adapt that it takes
   several minutes to adapt.

Yeah.  I don't know if fixing that will fall naturally out of these
other changes or not, but I think it's a second-order concern in any
event.

 There's another thing we could do to noticeably improve scalability of
 buffer acquiration. Currently we do a huge amount of work under the
 freelist lock.
 In StrategyGetBuffer:
 LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
 ...
 // check freelist, will usually be empty
 ...
 for (;;)
 {
 buf = BufferDescriptors[StrategyControl-nextVictimBuffer];

 ++StrategyControl-nextVictimBuffer;

 LockBufHdr(buf);
 if (buf-refcount == 0)
 {
 if (buf-usage_count  0)
 {
 buf-usage_count--;
 }
 else
 {
 /* Found a usable buffer */
 if (strategy != NULL)
 AddBufferToRing(strategy, buf);
 return buf;
 }
 }
 UnlockBufHdr(buf);
 }

 So, we perform the entire clock sweep until we found a single buffer we
 can use inside a *global* lock. At times we need to iterate over the
 whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
 the usage counts enough (if the database is busy it can take even
 longer...).
 In a busy database where usually all the usagecounts are high the next
 backend will touch a lot of those buffers again which causes massive
 cache eviction  bouncing.

 It seems far more sensible to only protect the clock sweep's
 nextVictimBuffer with a spinlock. With some care all the rest can happen
 without any global interlock.

That's a lot more spinlock acquire/release cycles, but it might work
out to a win anyway.  Or it might lead to the system suffering a
horrible spinlock-induced death spiral on eviction-heavy workloads.

 I think even after fixing this - which we definitely should do - having
 a sensible/more aggressive bgwriter moving pages onto the freelist makes
 sense because then backends then don't need to deal with dirty pages.

Or scanning to find evictable pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-27 Thread David Fetter
On Thu, Jun 27, 2013 at 08:41:59AM +, Andrew Gierth wrote:
 Tom Lane said:
  Agreed, separating out the function-call-with-trailing-declaration
  syntaxes so they aren't considered in FROM and index_elem seems
  like the best compromise.
 
  If we do that for window function OVER clauses as well, can we
  make OVER less reserved?
 
 Yes.
 
 At least, I tried it with both OVER and FILTER unreserved and there
 were no grammar conflicts (and I didn't have to do anything fancy to
 avoid them), and it passed regression with the exception of the
 changed error message for window functions in the from-clause.
 
 So is this the final decision on how to proceed? It seems good to
 me, and I can work with David to get it done.

If this is really the direction people want to go, I'm in.  Is there
some code I can look at?

I still submit that having our reserved word ducks in a row in advance
is a saner way to go about this, and will work up a patch for that as
I have time.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 3:29 AM, Szymon Guz mabew...@gmail.com wrote:
 OK, so I think this patch can be committed, I will change the status.

We have a convention that roles created by the regression tests needs
to have regress or something of the sort in the name, and that they
need to be dropped by the regression tests.  The idea is that if
someone runs make installcheck against an installed server, it
should pass - even if you run it twice in succession.  And also, it
shouldn't be likely to try to create (and then drop!) a role name that
already exists.

Setting this to Waiting on Author.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Andres Freund
On 2013-06-27 09:51:07 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-06-26 21:18:49 -0700, Peter Geoghegan wrote:
  Heroku are interested in online verification of basebackups (i.e.
  using checksums to verify the integrity of heap files as they are
  backed up, with a view to relying less and less on logical backups).
 
  Why not do this from a function/background worker in the backend where
  you can go via the buffer manager to avoid torn pages et al. If you use
  a buffer strategy the cache poisoning et al should be controlleable.
 
 That would require having a functioning postmaster environment, which
 seems like it would make such a tool much less flexible.

I personally wouldn't trust online backups that aren't proven to be
replayable into a runnable state very much. I have seen too many cases
where that failed.

Maybe the trick is to add a recovery.conf option to make postgres replay
to the first restartpoint and then shutdown. At that point you can be
sure there aren't any torn pages anymore (bugs aside).
In fact that sounds like a rather useful pg_basebackup extension...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-06-27 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-06-26 21:18:49 -0700, Peter Geoghegan wrote:
 Heroku are interested in online verification of basebackups (i.e.
 using checksums to verify the integrity of heap files as they are
 backed up, with a view to relying less and less on logical backups).

 Why not do this from a function/background worker in the backend where
 you can go via the buffer manager to avoid torn pages et al. If you use
 a buffer strategy the cache poisoning et al should be controlleable.

That would require having a functioning postmaster environment, which
seems like it would make such a tool much less flexible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-27 Thread Sawada Masahiko
On Mon, Jun 24, 2013 at 10:47 PM, Sawada Masahiko sawada.m...@gmail.com wrote:
 1. synchronous standby and make same as failback safe standby
 2. asynchronous standby and make same as failback safe standby

 in above case, adding new parameter might be meaningless. but I think
 that we should handle case not only case 1,2  but also following case
 3, 4 for DR.

to support case 1 and 2, I'm thinking that following another 2 ideas.


we add synchronous_transfer( commit/ data_flush/ all) . This GUC will
only affect the standbys mentioned in the list of
synchronous_standby_names.

1. If synchronous_transfer is set to commit, current synchronous
replication behavior is achieved
2. If synchronous_transfer is set to data_flush, the standbys named in
synchronous_standby_names will act as ASYNC failback safe standbys
3. If synchronous_transfer is set to all, the standbys named in
synchronous_standby_names will act as SYNC failback safe standbys

in this approach, 3 is confusing because we are actually setting up a
ASYNC standby by using the GUCs meant for sync standby setup.

-
we extend synchronous_commit so that it also accepts like 'all'. (
this approach dosen't provide 'synchronous_transfer' parameter)
'all' value means that master wait for not only replicated WAL but
also replicated data page (e.g., CLOG, pg_control). and master changes
the process by whether standby is connected as sync or async.

1. If synchronous_commit is set to 'all' and synchronous_standby_name
is set to standby name, the standbys named in
synchronous_standby_names will act as SYNC failback safe standby.
2. If synchronous_commit is set to 'all' and synchronous_standby_name
is NOT set to standby name, the standbys which is connecting to master
 will act as ASYNC failback safe standby.

one problem with not naming ASYNC standby explicitly is that the
master has no clue which standby to wait on.
If it chooses to wait on all async standbys for failback-safety that
can be quite detrimental, especially because async standbys can become
easily unreachable if they are on a slow link or at remote location.

please give me feedback.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-27 Thread Tom Lane
Andrew Gierth and...@tao11.riddles.org.uk writes:
 Tom Lane said:
 Agreed, separating out the function-call-with-trailing-declaration
 syntaxes so they aren't considered in FROM and index_elem seems like
 the best compromise.
 
 If we do that for window function OVER clauses as well, can we make
 OVER less reserved?

 Yes.

 At least, I tried it with both OVER and FILTER unreserved and there
 were no grammar conflicts (and I didn't have to do anything fancy to
 avoid them), and it passed regression with the exception of the
 changed error message for window functions in the from-clause.

 So is this the final decision on how to proceed? It seems good to me,
 and I can work with David to get it done.

Yeah, please submit a separate patch that just refactors the existing
grammar as above; that'll simplify reviewing.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-27 Thread Tom Lane
Pavel Stehule pavel.steh...@gmail.com writes:
 Tom Lane said:
 If we do that for window function OVER clauses as well, can we make
 OVER less reserved?

 Isn't dangerous do OVER unreserved keyword??

How so?  The worst-case scenario is that we find we have to make it more
reserved again in some future release, as a consequence of some new
randomness from the SQL committee.  That will just return us to the
status quo, in which anybody who uses OVER as a table/column name has
been broken since about 8.4.  Since we still hear of people using
releases as old as 7.2.x, I'm sure there are a few out there who would
still be helped if we could de-reserve OVER again.  (Not to mention
people migrating from other systems in which it's not a keyword.)

In any case, the general project policy has been to never make keywords
any more reserved than we absolutely have to.  If we didn't care about
this, we wouldn't be bothering with four separate categories of keywords.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-27 Thread Robins Tharakan
Sure Robert.
I 'll update the tests and get back.

Two questions, while we're at it:

1. Any other conventions (for naming)?
2. Should I assume that all database objects that get created, need to be
dropped explicitly? Or is this point specifically about ROLES?

--
Robins Tharakan


On 27 June 2013 09:00, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jun 26, 2013 at 3:29 AM, Szymon Guz mabew...@gmail.com wrote:
  OK, so I think this patch can be committed, I will change the status.

 We have a convention that roles created by the regression tests needs
 to have regress or something of the sort in the name, and that they
 need to be dropped by the regression tests.  The idea is that if
 someone runs make installcheck against an installed server, it
 should pass - even if you run it twice in succession.  And also, it
 shouldn't be likely to try to create (and then drop!) a role name that
 already exists.

 Setting this to Waiting on Author.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



Re: [HACKERS] Add more regression tests for dbcommands

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 11:33 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 On 06/26/2013 12:08 PM, Fabien COELHO wrote:
 I have been suggesting something upon that line in some of the reviews
 I've posted about Robins non regression tests, if they were to be
 rejected on the basis that they add a few seconds for checks. They are
 well made to test corner cases quite systematically, and I feel that it
 would be sad if they were lost.

 My thinking was that someone should add all of his new tests at once,
 and then see how much of a time difference they make.  If it's 7
 seconds, who cares?

 Making that measurement on the current set of tests doesn't seem to me
 to prove much.  I assume Robins' eventual goal is to make a significant
 improvement in the tests' code coverage across the entire backend, and
 what we see submitted now is just as much as he's been able to do yet
 in that line.  So even if the current cost is negligible, I don't think
 it'll stay that way.

Well, this is a discussion we've had before.  I'm certainly in favor
of having a bigger set of regression tests, so that people can just
run the small suite if they want to.  But I'm not keen on pushing
tests into that extended suite when they run in a fractions of a
second.  That's likely to create more hassle than it solves, since
people will forget to update the expected outputs if they don't run
those tests, and if they do run those tests, then having them in a
separate suite isn't saving anything.  The place for a separate test
suite, IMHO, is when you have tests that run for a long time
individually.

FWIW, internally to EDB, we have it broken down just about exactly
that way.  Our core set of regression tests for PPAS takes about 5
minutes to run on my laptop, and unfortunately quite a bit longer on
some older laptops.  It's a nuisance, but it's a manageable nuisance,
and it finds a lot of coding errors.  Then we have separate schedules
for test suites that contain long-running tests or have dependencies
on which compiler flags you use; most developers don't run that suite
regularly, but it gets run every night.  That keeps those annoyances
out of the foreground path of developers.  I really don't see why we
shouldn't take the same approach here.

I don't think we want as large a regression test suite for PostgreSQL
as we have for PPAS, because among other things, PostgreSQL doesn't
have a paid staff of people who can be roped into helping manage it
when needed.  But the management effort for the kind of expansion that
Robins Tharakan is proposing here is not going to be all that large,
at least if my EDB experience is any indication.  And I think it will
help find bugs.  I also think (and this is also something we've seen
internally) that part of the value of a regression suite is that it
exposes when you've changed the behavior.  The regression tests fail
and you have to change the expected outputs; fine.  But now it's much
less likely that you're going to make those kinds of changes without
*realizing* that you've changed the behavior.  I don't have a specific
example top of mind right now, but I am pretty sure there have been at
least a few cases where PostgreSQL changes have had knock-on
consequences that weren't obvious at the time they were made.  The
kind of work that Robins is doing here is not going to fix that
problem entirely, but I think it will help, and I think that has a lot
of value.

So I'd like to endorse Josh's idea: subject to appropriate review,
let's add these test cases.  Then, if it really turns out to be too
burdensome, we can take them out, or figure out a sensible way to
split the suite.  Pushing all of Robins work into a secondary suite,
or throwing it out altogether, both seem to me to be things which will
not be to the long-term benefit of the project.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 1: additional information

2013-06-27 Thread Antonin Houska

On 06/25/2013 12:03 AM, Alexander Korotkov wrote:



New revision of patch is attached. Now it includes some docs.




Hi,
I was curious about the new layout of the data page, so I spent a while 
looking into the code.

It's interesting, but I suspect 2 things are not o.k.:

* gindatapage.c:dataIsEnoughSpace() - 'i++' in the for loop should 
probably be 'j++', otherwise it loops forever


* gin_private.h:ginDataPageLeafRead() - fetch_att() is used to retrieve 
the additional info, but per the definition and comments in tupmacs.h it 
expects aligned pointer.


* gindatapage.c:ginCheckPlaceToDataPageLeaf() -  comment if leaf data 
page should probably be on a leaf data page or so.


Regards,
Antonin Houska (Tony)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add long options to pgbench (submission 1)

2013-06-27 Thread Fujii Masao
On Thu, Jun 27, 2013 at 10:02 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 25, 2013 at 3:09 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 I think --quiet-log should be spelled --quiet.

 ISTM that --quiet usually means not verbose on stdout, so I added log
 because this was specific to the log output, and that there may be a need
 for a --quiet option for stdout at some time.

 The output that is quieted by -q is not the log output produced by
 --log; it's the regular progress output on stdout/stderr.

 So I changed that, and committed this, with some further cosmetic
 changes.  I made the formatting of the help message more like psql's
 help message, including adjusting pgbench to start the description of
 each option in the same column that psql does.  This got rid of a lot
 of line breaks and IMHO makes the output of pgbench --help quite a bit
 more readable.  I made stylistic adjustments to the documentation
 portion of the patch as well, again to match the markup used for psql.

In help messages:

+-s NUM, --scale NUM  scaling factor\n

This should be -s, --scale=NUM for the sake of consistency with other
options.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-27 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:

 We can play cute tricks akin to what CREATE INDEX CONCURRENTLY currently
 does, i.e. wait for all other relations that could have possibly seen
 the old relfilenode (they must have at least a share lock on the
 relation) before dropping the actual storage.

 The reason we cannot currently do that in most scenarios is that we
 cannot perform transactional/mvcc updates of non-exclusively locked
 objects due to the SnapshotNow problems of seeing multiple or no
 versions of a row during a single scan.

Not only would that be slower than the submitted patch in cases
where only a few rows differ, but it could be waiting to swap in
that new heap for an unbounded amount of time.  I think the current
patch will play nicer with incremental maintenance than what you
suggest here.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Andres Freund
On 2013-06-27 09:50:32 -0400, Robert Haas wrote:
 On Thu, Jun 27, 2013 at 9:01 AM, Andres Freund and...@2ndquadrant.com wrote:
  Contention wise I aggree. What I have seen is that we have a huge
  amount of cacheline bouncing around the buffer header spinlocks.
 
 How did you measure that?

perf record -e cache-misses. If you want it more detailed looking at
{L1,LLC}-{load,store}{s,misses} can sometimes be helpful too.
Also, running perf stat -vvv postgres -D ... for a whole benchmark can
be useful to compare how much a change influences cache misses and such.

For very detailed analysis running something under valgrind/cachegrind
can be helpful too, but I usually find perf to be sufficient.

  I have previously added some adhoc instrumentation that printed the
  amount of buffers that were required (by other backends) during a
  bgwriter cycle and the amount of buffers that the buffer manager could
  actually write out.
 
 I think you can see how many are needed from buffers_alloc.  No?

Not easily correlated with bgwriter activity. If we cannot keep up
because it's 100% busy writing out buffers I don't have many problems
with that. But I don't think we often are.

  Problems with the current code:
 
  * doesn't manipulate the usage_count and never does anything to used
pages. Which means it will just about never find a victim buffer in a
busy database.
 
 Right.  I was thinking that was part of this patch, but it isn't.  I
 think we should definitely add that.  In other words, the background
 writer's job should be to run the clock sweep and add buffers to the
 free list.

We might need to split it into two for that. One process to writeout
dirty pages, one to populate the freelist.
Otherwise we will probably regularly hit the current scalability issues
because we're currently io contended. Say during a busy or even
immediate checkpoint.

  I think we should also split the lock: a spinlock for the
 freelist, and an lwlock for the clock sweep.

Yea, thought about that when writing the thing about the exclusive lock
during the clocksweep.

  * by far not aggressive enough, touches only a few buffers ahead of the
clock sweep.
 
 Check.  Fixing this might be a separate patch, but then again maybe
 not.  The changes we're talking about here provide a natural feedback
 mechanism: if we observe that the freelist is empty (or less than some
 length, like 32 buffers?) set the background writer's latch, because
 we know it's not keeping up.

Yes, that makes sense. Also provides adaptability to bursty workloads
which means we don't have too complex logic in the bgwriter for that.

  There's another thing we could do to noticeably improve scalability of
  buffer acquiration. Currently we do a huge amount of work under the
  freelist lock.
  ...
  So, we perform the entire clock sweep until we found a single buffer we
  can use inside a *global* lock. At times we need to iterate over the
  whole shared buffers BM_MAX_USAGE_COUNT (5) times till we pushed down all
  the usage counts enough (if the database is busy it can take even
  longer...).
  In a busy database where usually all the usagecounts are high the next
  backend will touch a lot of those buffers again which causes massive
  cache eviction  bouncing.
 
  It seems far more sensible to only protect the clock sweep's
  nextVictimBuffer with a spinlock. With some care all the rest can happen
  without any global interlock.
 
 That's a lot more spinlock acquire/release cycles, but it might work
 out to a win anyway.  Or it might lead to the system suffering a
 horrible spinlock-induced death spiral on eviction-heavy workloads.

I can't imagine it to be worse that what we have today. Also, nobody
requires us to only advance the clocksweep by one page, we can easily do
it say 29 pages at a time or so if we detect the lock is contended.

Alternatively it shouldn't be too hard to make it into an atomic
increment, although that requires some trickery to handle the wraparound
sanely.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Spin Lock sleep resolution

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 5:49 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Jun 18, 2013 at 12:09 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 Jeff's patch seems to somewhat alleviate the huge fall in performance I'm
 otherwise seeing without the nonlocked-test patch. With the nonlocked-test
 patch, if you squint you can see a miniscule benefit.

 I wasn't expecting much of a gain from this, just wanted to verify that
 it's not making things worse. So looks good to me.

 Hi Heikki,

 Thanks for trying out the patch.

 I see in the commitfest app it is set to Waiting on Author (but I don't
 know who maiku41 is).

 Based on the comments so far, I don't know what I should be doing on it at
 the moment, and I thought perhaps your comment above meant it should be
 ready for committer.

 If we think the patch has a risk of introducing subtle errors, then it
 probably can't be justified based on the small performance gains you saw.

 But if we think it has little risk, then I think it is justified simply
 based on cleaner code, and less confusion for people who are tracing a
 problematic process.  If we want it to do a random escalation, it should
 probably look like a random escalation to the interested observer.

I think it has little risk.  If it turns out to be worse for
performance, we can always revert it, but I expect it'll be better or
a wash, and the results so far seem to bear that out.  Another
interesting question is whether we should fool with the actual values
for minimum and maximum delays, but that's a separate and much harder
question, so I think we should just do this for now and call it good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add long options to pgbench (submission 1)

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 10:24 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Jun 27, 2013 at 10:02 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 25, 2013 at 3:09 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 I think --quiet-log should be spelled --quiet.

 ISTM that --quiet usually means not verbose on stdout, so I added log
 because this was specific to the log output, and that there may be a need
 for a --quiet option for stdout at some time.

 The output that is quieted by -q is not the log output produced by
 --log; it's the regular progress output on stdout/stderr.

 So I changed that, and committed this, with some further cosmetic
 changes.  I made the formatting of the help message more like psql's
 help message, including adjusting pgbench to start the description of
 each option in the same column that psql does.  This got rid of a lot
 of line breaks and IMHO makes the output of pgbench --help quite a bit
 more readable.  I made stylistic adjustments to the documentation
 portion of the patch as well, again to match the markup used for psql.

 In help messages:

 +-s NUM, --scale NUM  scaling factor\n

 This should be -s, --scale=NUM for the sake of consistency with other
 options.

Woops, missed that one.  Fixed, thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-27 Thread Robert Haas
On Tue, Jun 25, 2013 at 2:27 PM, Andrew Dunstan and...@dunslane.net wrote:
 I'd like to see prizes each release for best contribution and best
 reviewer - I've thought for years something like this would be worth
 trying. Committers and core members should not be eligible - this is about
 encouraging new people.

Encouraging new people is good, but recognizing sustained, long-term
contributions is good, too.  I think we should do more of that, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: FILTER for aggregates [was Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division]

2013-06-27 Thread Peter Eisentraut
On 6/23/13 10:50 PM, Tom Lane wrote:
 It'd sure be interesting to know what the SQL committee's target parsing
 algorithm is.

It's whatever Oracle and IBM implement.

 Or maybe they really don't give a damn about breaking
 applications every time they invent a new reserved word?

Well, yes, I think that policy was built into the language at the very
beginning.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-27 Thread Kevin Grittner
Hitoshi Harada umi.tan...@gmail.com wrote:
 Kevin Grittner kgri...@ymail.com wrote:
 Hitoshi Harada umi.tan...@gmail.com wrote:

 As far as I can tell, the overall approach is as follows.

 - create a new temp heap as non-concurrent does, but with
 ExclusiveLock on the matview, so that reader wouldn't be
 blocked

 Non-concurrent creates the heap in the matview's tablespace and
 namespace, so the temp part is different in concurrent
 generation.  This difference is why concurrent can be faster
 when few rows change.

 It's still not clear to me why you need temp in concurrent and
 not in non-concurrent.

Well, temp tables can be in an entirely different tablespace, so
you can't just move the heap of a temp table into place as the new
heap for the matview (as we do for a non-concurrent REFRESH) -- at
least not without potentially copying everything an extra time.

For concurrent we are modifying the existing matview heap, and the
freshly generated set of values, as well as the diff table, are
just needed, well, temporarily.  That makes the behavior of temp
tables ideal.  Not only are they not logged, they are potentially
placed on faster tablespaces, and the data written to them might
never be written to disk:

http://www.postgresql.org/docs/9.2/interactive/runtime-config-resource.html#GUC-TEMP-BUFFERS

 If this type of operations is always creating temp table and
 just swap it with existing one, why can't we just make it temp
 always?

Because of the potentially differrent tablespaces.

 And if the performance is the only concern, is temp better than
 just turning off WAL for the table or UNLOGGED table?

Yes, it is.

 - this operation requires unique index for performance reason
 (or correctness reason too)

 It is primarily for correctness in the face of duplicate rows
 which have no nulls.  Do you think the reasons need to be better
 documented with comments?

 Ah, yes, even after looking at patch I was confused if it was for
 performance or correctness.

This is the part of the function's comment which attempts to
explain the problem.

 * This join cannot work if there are any
 * duplicated rows in either the old or new versions, in the sense that every
 * column would compare as equal between the two rows.  It does work correctly
 * in the face of rows which have at least one NULL value, with all non-NULL
 * columns equal.  The behavior of NULLs on equality tests and on UNIQUE
 * indexes turns out to be quite convenient here; the tests we need to make
 * are consistent with default behavior.  If there is at least one UNIQUE
 * index on the materialized view, we have exactly the guarantee we need.  By
 * joining based on equality on all columns which are part of any unique
 * index, we identify the rows on which we can use UPDATE without any problem.
 * If any column is NULL in either the old or new version of a row (or both),
 * we must use DELETE and INSERT, since there could be multiple rows which are
 * NOT DISTINCT FROM each other, and we could otherwise end up with the wrong
 * number of occurrences in the updated relation.

I'm open to suggestions on better wording.

As an example of the way the full join gets into trouble with
duplicate rows when used for a diff, see this example:

test=# create table old (c1 int, c2 int);
CREATE TABLE
test=# create table new (c1 int, c2 int);
CREATE TABLE
test=# insert into old values
test-#   (1,1),(1,2),(1,2),(1,2),(1,3),(1,null),(1,null);
INSERT 0 7
test=# insert into new values
test-#   (1,1),(1,2),(1,2),(1,2),(1,2),(1,2),(1,4),(1,null),(1,null),(1,null);
INSERT 0 10

At this point it is clear that we need to add two rows with values
(1,2) and we need to wind up with one more row with values (1,null)
than we already have.  We also need to delete (1,3) and add (1,4). 
But full join logic fails to get things right for the case of
duplicate rows with no nulls:

test=# select old, new
test-#   from old
test-#   full join new on old.c1 = new.c1 and old.c2 = new.c2
test-#   where (old.*) is distinct from (new.*);
  old  |  new  
---+---
 (1,3) | 
 (1,)  | 
 (1,)  | 
   | (1,4)
   | (1,)
   | (1,)
   | (1,)
(7 rows)

 It's a shame we cannot refresh it concurrently if we have
 duplicate rows in the matview.  I thought it would make sense to
 allow it without unique key if it as only performance tradeoffs.

Well, we *could* allow it without a unique index, but the code
would be more complex and significantly slower.  I think we would
still want to handle it the way the current patch does when a
unique index is present, even if we have a way to handle cases
where such an index is not present.  Even if I were convinced it
was worthwhile to support the more general case, I would want to
commit this patch first, and add the more complicated code as a
follow-on patch.

At this point I'm not convinced of the value of supporting
concurrent refresh of materialized views which contain duplicate
rows, nor of the benefit of allowing it to work ten 

Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-27 Thread Tom Lane
Robins Tharakan thara...@gmail.com writes:
 2. Should I assume that all database objects that get created, need to be
 dropped explicitly? Or is this point specifically about ROLES?

It's about any global objects (that wouldn't get dropped by dropping the
regression database).  As far as local objects go, there are benefits to
leaving them around, particularly if they present interesting test cases
for pg_dump/pg_restore.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-27 Thread Peter Eisentraut
On 6/27/13 10:20 AM, Robert Haas wrote:
 So I'd like to endorse Josh's idea: subject to appropriate review,
 let's add these test cases.  Then, if it really turns out to be too
 burdensome, we can take them out, or figure out a sensible way to
 split the suite.  Pushing all of Robins work into a secondary suite,
 or throwing it out altogether, both seem to me to be things which will
 not be to the long-term benefit of the project.

I agree.  If it gets too big, let's deal with it then.  But let's not
make things more complicated because they *might* get too big later.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-27 Thread Peter Eisentraut
On 6/26/13 12:17 PM, Tom Lane wrote:
 (I like to
 point at mysql's regression tests, which take well over an hour even on
 fast machines.  If lots of tests are so helpful, why is their bug rate
 no better than ours?)

Tests are not (primarily) there to prevent bugs.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move unused buffers to freelist

2013-06-27 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:

 I don't think I actually found any workload where the bgwriter
 actually wroute out a relevant percentage of the necessary pages.

I had one at Wisconsin Courts.  The database which we targeted with
logical replication from the 72 circuit court databases (plus a few
others) on six database connection pool with about 20 to (at peaks)
hundreds of transactions per second modifying the database (the
average transaction involving about 20 modifying statements with
potentially hundreds of affected rows), with maybe 2000 to 3000
queries per second on a 30 connection pool, wrote about one-third
each of the dirty buffers with checkpoints, background writer, and
backends needing to read a page.  I shared my numbers with Greg,
who I believe used them as one of his examples for how to tune
memory, checkpoints, and background writer, so you might want to
check with him if you want more detail.

Of course, we set bgwriter_lru_maxpages = 1000 and
bgwriter_lru_multiplier = 4, and kept shared_buffers to 2GB to hit
that.  Without the reduced shared_buffers and more aggressive
bgwriter we hit the problem with writes overwhelming the RAID
controller's cache and causing everything in the database to
freeze until it cleared some cache space.

I'm not saying this invalidates your general argument; just that
such cases do exist.  Hopefully this data point is useful.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-27 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 On 6/26/13 12:17 PM, Tom Lane wrote:
 (I like to
 point at mysql's regression tests, which take well over an hour even on
 fast machines.  If lots of tests are so helpful, why is their bug rate
 no better than ours?)

 Tests are not (primarily) there to prevent bugs.

Really?  Pray tell, what do you think they're for?  Particularly code
coverage tests, which is what these are?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error code returned by lock_timeout

2013-06-27 Thread Fujii Masao
On Thu, Jun 27, 2013 at 2:22 PM, Boszormenyi Zoltan z...@cybertec.at wrote:
 Hi,

 I just realized that in the original incarnation of lock_timeout,
 I used ERRCODE_LOCK_NOT_AVAILABLE (to be consistent with NOWAIT)
 but the patch that was accepted into 9.3 contained ERRCODE_QUERY_CANCELED
 which is the same as for statement_timeout.

 Which would be more correct?

ISTM that ERRCODE_LOCK_NOT_AVAILABLE is more suitable...

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.3 doc fix] ECPG VAR command does not describe the actual specification

2013-06-27 Thread Michael Meskes
 Looking around the 9.3 doc, I found a small, but not-insignificant
 error in the documentation.

Thanks for finding and fixing. Patch committed.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-27 Thread Robert Haas
On Tue, Jun 25, 2013 at 4:28 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 The only feedback we have on how bad things are is how long it took
 the last fsync to complete, so I actually think that's a much better
 way to go than any fixed sleep - which will often be unnecessarily
 long on a well-behaved system, and which will often be far too short
 on one that's having trouble. I'm inclined to think think Kondo-san
 has got it right.

 Quite possible, I really don't know. I'm inclined to first try the simplest
 thing possible, and only make it more complicated if that's not good enough.
 Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep
 between every fsync, unless you're behind the schedule, is even simpler.

I'm pretty sure Greg Smith tried it the fixed-sleep thing before and
it didn't work that well.  I have also tried it and the resulting
behavior was unimpressive.  It makes checkpoints take a long time to
complete even when there's very little data to flush out to the OS,
which is annoying; and when things actually do get ugly, the sleeps
aren't long enough to matter.  See the timings Kondo-san posted
downthread: 100ms delays aren't going let the system recover in any
useful way when the fsync can take 13 s for one file.  On a system
that's badly weighed down by I/O, the fsync times are often
*extremely* long - 13 s is far from the worst you can see.  You have
to give the system a meaningful time to recover from that, allowing
other processes to make meaningful progress before you hit it again,
or system performance just goes down the tubes.  Greg's test, IIRC,
used 3 s sleeps rather than your proposal of 100 ms, but it still
wasn't enough.

 In
 particular, it's easier to tie that into the checkpoint scheduler - I'm not
 sure how you'd measure progress or determine how long to sleep unless you
 assume that every fsync is the same.

I think the thing to do is assume that the fsync phase will take 10%
or so of the total checkpoint time, but then be prepared to let the
checkpoint run a bit longer if the fsyncs end up being slow.  As Greg
has pointed out during prior discussions of this, the normal scenario
when things get bad here is that there is no way in hell you're going
to fit the checkpoint into the originally planned time.  Once all of
the write caches between PostgreSQL and the spinning rust are full,
the system is in trouble and things are going to suck.  The hope is
that we can stop beating the horse while it is merely in intensive
care rather than continuing until the corpse is fully skeletized.
Fixed delays don't work because - to push an already-overdone metaphor
a bit further - we have no idea how much of a beating the horse can
take; we need something adaptive so that we respond to what actually
happens rather than making predictions that will almost certainly be
wrong a large fraction of the time.

To put this another way, when we start the fsync() phase, it often
consumes 100% of the available I/O on the machine, completing starving
every other process that might need any.  This is certainly a
deficiency in the Linux I/O scheduler, but as they seem in no hurry to
fix it we'll have to cope with it as best we can.  If you do the
fsyncs in fast succession (and 100ms separation might as well be no
separation at all), then the I/O starvation of the entire system
persists through the entire fsync phase.  If, on the other hand, you
sleep for the same amount of time the previous fsync took, then on the
average, 50% of the machine's I/O capacity will be available for all
other system activity throughout the fsync phase, rather than 0%.

Now, unfortunately, this is still not that good, because it's often
the case that all of the fsyncs except one are reasonably fast, and
there's one monster one that is very slow.  ext3 has a known bad
behavior that dumps all dirty data for the entire *filesystem* when
you fsync, which tends to create these kinds of effects.  But even on
better-behaved filesystem, like ext4, it's fairly common to have one
fsync that is painfully longer than all the others.   So even with
this patch, there are still going to be cases where the whole system
becomes unresponsive.  I don't see any way to to do better without a
better kernel API, or a better I/O scheduler, but that doesn't mean we
shouldn't do at least this much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error code returned by lock_timeout

2013-06-27 Thread Boszormenyi Zoltan

2013-06-27 17:03 keltezéssel, Fujii Masao írta:

On Thu, Jun 27, 2013 at 2:22 PM, Boszormenyi Zoltan z...@cybertec.at wrote:

Hi,

I just realized that in the original incarnation of lock_timeout,
I used ERRCODE_LOCK_NOT_AVAILABLE (to be consistent with NOWAIT)
but the patch that was accepted into 9.3 contained ERRCODE_QUERY_CANCELED
which is the same as for statement_timeout.

Which would be more correct?

ISTM that ERRCODE_LOCK_NOT_AVAILABLE is more suitable...


I think, too. Can someone commit this one-liner to master and 9.3?

Best regards,
Zoltán Bsöszörményi

--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/

--- src/backend/tcop/postgres.c~	2013-06-27 07:05:08.0 +0200
+++ src/backend/tcop/postgres.c	2013-06-27 17:10:28.040972642 +0200
@@ -2900,7 +2900,7 @@
 			DisableNotifyInterrupt();
 			DisableCatchupInterrupt();
 			ereport(ERROR,
-	(errcode(ERRCODE_QUERY_CANCELED),
+	(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
 	 errmsg(canceling statement due to lock timeout)));
 		}
 		if (get_timeout_indicator(STATEMENT_TIMEOUT, true))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)

2013-06-27 Thread Robert Haas
On Wed, Jun 26, 2013 at 7:16 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Here is a v4 that takes into account most of your points: The report is
 performed for all threads by thread 0, however --progress is not supported
 under thread fork emulation if there are more than one thread. The report
 time does not slip anymore.

I don't believe that to be an acceptable restriction.  We generally
require features to work on all platforms we support.  We have made
occasional compromises there, but generally only when the restriction
is fundamental to the platform rather than for developer convenience.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commits Vs WAL Writes

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 3:56 AM, Atri Sharma atri.j...@gmail.com wrote:
 When we do a commit, WAL buffers are written to the disk. This has a
 disk latency for the required I/O.

Check.

 Now, with group commits, do we see a spike in that disk write latency,
 especially in the cases where the user has set wal_buffers to a high
 value?

Well, it does take longer to fsync a larger byte range to disk than a
smaller byte range, in some cases.  But it's generally more efficient
to write one larger range than many smaller ranges, so you come out
ahead on the whole.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Documentation/help for materialized and recursive views

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 5:29 AM, Magnus Hagander mag...@hagander.net wrote:
 The functionality of materialized views will (over time) totally swamp
 that of normal views, so mixing all the corresponding documentation
 with the documentation for normal views probably doesn’t make things
 easier for people that are only interested in normal views.

 That's a better point I think. That said, it would be very useful if
 it actually showed up in \h CREATE VIEW in psql - I wonder if we
 should just add the syntax to that page, and then link said future
 information on a separate page somehow?

IMHO, it's better to keep them separate; they are very different beasts.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for dbcommands

2013-06-27 Thread Peter Eisentraut
On 6/27/13 10:57 AM, Tom Lane wrote:
 Peter Eisentraut pete...@gmx.net writes:
 On 6/26/13 12:17 PM, Tom Lane wrote:
 (I like to
 point at mysql's regression tests, which take well over an hour even on
 fast machines.  If lots of tests are so helpful, why is their bug rate
 no better than ours?)
 
 Tests are not (primarily) there to prevent bugs.
 
 Really?  Pray tell, what do you think they're for?  Particularly code
 coverage tests, which is what these are?

Tests document the existing functionality of the code, to facilitate
refactoring. (paraphrased, Uncle Bob)

Case in point, the existing ALTER DDL code could arguably use even more
refactoring.  But no one wants to do it because it's unclear what will
break.  With the proposed set of tests (which I have not read to
completion), this could become quite a bit easier, both for the coder
and the reviewer.  But these tests probably won't detect, say, locking
bugs in such new code.  That can only be prevented by careful code
review and a clean architecture.  Perhaps, MySQL has neither. ;-)

Code coverage is not an end itself, it's a tool.

In this sense, tests prevent existing functionality being broken, which
might be classified as a bug.  But it's wrong to infer that because
system X has a lot of bugs and a lot of tests, having a lot of tests
must be useless.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce maximum error in tuples estimation after vacuum.

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 7:27 AM, Amit Kapila amit.kap...@huawei.com wrote:
 Now I can look into it further, I have still not gone through in detail
 about your new approach to calculate the reltuples, but I am wondering
 whether there can be anyway with which estimates can be improved with
 different calculation in vac_estimate_reltuples().

I think this is getting at the threshold question for this patch,
which is whether it's really making things better or just moving the
problems around.  I mean, I have no problem accepting that the new
algorithm is (1) reasonably cheap and (2) better in some cases.  But
if it's worse in other cases, which AFAICS hasn't been discussed, then
it's no good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 7:29 AM, Marko Kreen mark...@gmail.com wrote:
 On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 On 26 June 2013 21:46, Peter Eisentraut pete...@gmx.net wrote:
 On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

 The thread discussed several other options of checksumming tables that
 did not have the air of a crytographic offering, as Noah put it.


 True but md5 has the advantage of being directly comparable with the
 output of Unix md5sum, which would be useful if you loaded data from
 external files and wanted to confirm that your import process didn't
 mangle it.

 The problem with md5_agg() is that it's only useful in toy scenarios.

 It's more useful give people script that does same sum(hash(row))
 on dump file than try to run MD5 on ordered rows.

 Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
 people want way to check if 2 tables or table and dump are same.

I think you're trying to tell Dean to write the patch that you want
instead of the patch that he wants.  There are certainly other things
that could be done that some people might sometimes prefer, but that
doesn't mean what he did isn't useful.

That having been said, I basically agree with Noah: I think this would
be a useful extension (perhaps even in contrib?) but I don't think we
need to install it by default.  It's useful, but it's also narrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] in-catalog Extension Scripts and Control parameters (templates?)

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 5:49 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 I think that's a limitation of the old model and we don't want to turn
 templates for extensions into being shared catalogs. At least that's my
 understanding of the design consensus.

I agree.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fixing pg_ctl with relative paths

2013-06-27 Thread Fujii Masao
On Thu, Jun 27, 2013 at 10:36 AM, Josh Kupershmidt schmi...@gmail.com wrote:
 On Wed, Jun 26, 2013 at 12:22 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jun 26, 2013 at 2:36 PM, Hari Babu haribabu.ko...@huawei.com wrote:
 On June 26, 2013 5:02 AM Josh Kupershmidt wrote:
Thanks for the feedback. Attached is a rebased version of the patch with
 the two small issues noted fixed.

 The following description in the document of pg_ctl needs to be modified?

 restart might fail if relative paths specified were specified on
 the command-line during server start.

 Right, that caveat could go away.

 +#define DATADIR_SPEC   \-D\ \
 +
 +   datadir = strstr(post_opts, DATADIR_SPEC);

 Though this is a corner case, the patch doesn't seem to handle properly the 
 case
 where -D appears as other option value, e.g., -k option value, in
 postmaster.opts
 file.

 Could I see a command-line example of what you mean?

postmaster -k -D, for example. Of course, it's really a corner case :)

Another corner case is, for example, pg_ctl -D test1 -o -D test2, 
that is, multiple -D specifications appear in the command-line.

Can we overlook these cases?

 Just idea to work around that problem is to just append the specified -D 
 option
 and value to post_opts. IOW, -D option and value appear twice in post_opts.
 In this case, posteriorly-located ones are used in the end. Thought?

 Hrm, I think we'd have to be careful that postmaster.opts doesn't
 accumulate an additional -D specification with every restart.

Yes. Oh, I was thinking that postmaster writes only -D specification which
postmaster actually uses, in the opts file. So that accumulation would not
happen, I thought. But that's not true. Postmaster writes all the specified
arguments in the opts file.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-27 Thread Christopher Browne
On Thu, Jun 27, 2013 at 10:37 AM, Robert Haas robertmh...@gmail.com wrote:

 On Tue, Jun 25, 2013 at 2:27 PM, Andrew Dunstan and...@dunslane.net
 wrote:
  I'd like to see prizes each release for best contribution and best
  reviewer - I've thought for years something like this would be worth
  trying. Committers and core members should not be eligible - this is
 about
  encouraging new people.

 Encouraging new people is good, but recognizing sustained, long-term
 contributions is good, too.  I think we should do more of that, too.


Conforming with David Fetter's pointer to the notion that sometimes attempts
to reward can backfire, I'm not sure that it will be super-helpful to
create special
rewards.

On the other hand, to recognize reviewer contributions in places relevant
to where
they take place seems pretty apropos, which could include:

a) Obviously we already capture this in the CommitFest web site (but it's
worth mentioning when trying to do a census)

b) It would be a pretty good thing to mention reviewers within commit notes;
that provides some direct trace-back as to who it was that either validated
that the change was good, or that let a bad one slip through.

c) The release notes indicate authors of changes; to have a list of
reviewers
would be a fine thing.

If it requires inordinate effort to get the reviewers directly attached to
each
and every change, perhaps it isn't worthwhile to go to extreme efforts to
that
end.

It could be pretty satisfactory to have a simple listing, in the release
notes,
of the set of reviewers.  That's a lot less bookkeeping than tracking this
for
each and every change.

The statement of such a list is a public acknowledgement of those that help
assure that the quality of PostgreSQL code remains excellent. (And that may
represent a good way to sell this kudo.)

This allows organizations that are sponsoring PostgreSQL development to
have an extra metric by which *they* can recognize that their staff that
do such work are being recognized as contributors.  It seems to me that
this is way more useful than a free t-shirt or the like.


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-27 Thread Bruce Momjian
On Thu, Jun 27, 2013 at 11:50:07AM -0400, Christopher Browne wrote:
 b) It would be a pretty good thing to mention reviewers within commit notes;
 that provides some direct trace-back as to who it was that either validated
 that the change was good, or that let a bad one slip through.
 
 c) The release notes indicate authors of changes; to have a list of reviewers
 would be a fine thing.
 
 If it requires inordinate effort to get the reviewers directly attached to 
 each
 and every change, perhaps it isn't worthwhile to go to extreme efforts to that
 end.
 
 It could be pretty satisfactory to have a simple listing, in the release 
 notes,
 of the set of reviewers.  That's a lot less bookkeeping than tracking this for
 each and every change.

Adding the names to each release note item is not a problem;  the
problem is the volume of names that overwhelms the release note text.  If
we went that direction, I predict we would just remove _all_ names from
the release notes.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Min value for port

2013-06-27 Thread Christopher Browne
On Thu, Jun 27, 2013 at 9:22 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-06-27 15:11:26 +0200, Magnus Hagander wrote:
  On Thu, Jun 27, 2013 at 2:16 PM, Peter Eisentraut pete...@gmx.net
 wrote:
   On 6/27/13 6:34 AM, Magnus Hagander wrote:
   Is there a reason why we have set the min allowed value for port to 1,
   not 1024? Given that you can't actually start postgres with a value of
   1024, shoulnd't the entry in pg_settings reference that as well?
  
   Are you thinking of the restriction that you need to be root to use
   ports 1024?  That restriction is not necessarily universal.  We can
 let
   the kernel tell us at run time if it doesn't like our port.
 
  Yes, that's the restriction I was talking about. It's just a bit
  annoying that if you look at pg_settings.min_value it doesn't actually
  tell you the truth. But yeah, I believe Windows actually lets you use
  a lower port number, so it'd at least have to be #ifdef'ed for that if
  we wanted to change it.

 You can easily change the setting on linux as well. And you can grant
 specific binaries the permission to bind to restricted ports without
 being root.
 I don't think the additional complexity to get a sensible value in there
 is warranted.


With that large a set of local policies that can change the usual
 1024 policy, yep, I agree that it's not worth trying too hard on this
one.

And supposing something like SE-Linux can grant bindings for a particular
user/binary to access a *specific* port, that represents a model that is
pretty incompatible with the notion of a minimum value.

On the one hand, the idea of having to add a lot of platform-specific
code (which may further be specific to a framework like SE-Linux)
is not terribly appealing.

Further, if the result is something that doesn't really fit with a
minimum,
is it much worth fighting with the platform localities?

Indeed, I begin to question whether indicating a minimum is actually
meaningful.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?


Re: [HACKERS] extensible external toast tuple support snappy prototype

2013-06-27 Thread Hitoshi Harada
On Thu, Jun 27, 2013 at 6:08 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Jun 19, 2013 at 3:27 AM, Andres Freund and...@2ndquadrant.com
 wrote:
  There will be a newer version of the patch coming today or tomorrow, so
  there's probably no point in looking at the one linked above before
  that...

 This patch is marked as Ready for Committer in the CommitFest app.
 But it is not clear to me where the patch is that is being proposed
 for commit.



 No, this conversation is about patch #1153, pluggable toast compression,
which is in Needs Review, and you may be confused with #1127, extensible
external toast tuple support.


-- 
Hitoshi Harada


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-27 Thread Andrew Dunstan


On 06/27/2013 12:12 PM, Tom Lane wrote:

Bruce Momjian br...@momjian.us writes:

On Thu, Jun 27, 2013 at 11:50:07AM -0400, Christopher Browne wrote:

It could be pretty satisfactory to have a simple listing, in the
release notes, of the set of reviewers.  That's a lot less
bookkeeping than tracking this for each and every change.

Adding the names to each release note item is not a problem;  the
problem is the volume of names that overwhelms the release note text.  If
we went that direction, I predict we would just remove _all_ names from
the release notes.

Yeah.  Keep in mind that the overwhelming majority of the audience for
the release notes doesn't actually give a darn who implemented what.



Maybe we should have a Kudos / Bragging rights wiki page, with a table 
something like this:


   Release
   Feature Name
   Principal Author(s)
   Contributing Author(s)
   Code Reviewer(s)
   Tester(s)


Constructing it going backwards would be an interesting task :-)


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commits Vs WAL Writes

2013-06-27 Thread Peter Geoghegan
On Thu, Jun 27, 2013 at 12:56 AM, Atri Sharma atri.j...@gmail.com wrote:
 Now, with group commits, do we see a spike in that disk write latency,
 especially in the cases where the user has set wal_buffers to a high
 value?

commit_delay exists to artificially increase the window in which the
leader backend waits for more group commit followers. At higher client
counts, that isn't terribly useful because you'll naturally have
enough clients anyway, but at lower client counts particularly where
fsyncs have high latency, it can help quite a bit. I mention this
because clearly commit_delay is intended to trade off latency for
throughput. Although having said that, when I worked on commit_delay,
the average and worse-case latencies actually *improved* for the
workload in question, which consisted of lots of small write
transactions. Though I wouldn't be surprised if you could produce a
reasonable case where latency was hurt a bit, but throughput improved.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hash partitioning.

2013-06-27 Thread Nicolas Barbier
2013/6/27 Markus Wanner mar...@bluegap.ch:

 On 06/27/2013 11:12 AM, Nicolas Barbier wrote:

 Imagine that there are a lot of indexes, e.g., 50. Although a lookup
 (walking one index) is equally fast, an insertion must update al 50
 indexes. When each index requires one extra I/O (because each index is
 one level taller), that is 50 extra I/Os. In the partitioned case,
 each index would require the normal smaller amount of I/Os. Choosing
 which partition to use must only be done once: The result “counts” for
 all indexes that are to be updated.

 I think you're underestimating the cost of partitioning. After all, the
 lookup of what index to update for a given partition is a a lookup in
 pg_index via pg_index_indrelid_index - a btree index.

I am assuming that this (comparatively very small and super-hot) index
is cached all the time, while for the other indexes (that are
supposedly super-huge) only the top part stays cached.

I am mostly just trying to find out where Yuri’s “partitioning is
needed for super-huge tables” experience might come from, and noting
that Heikki’s argument might not be 100% valid. I think that the
“PostgreSQL-answer” to this problem is to somehow cluster the data on
the “hotness column” (so that all hot rows are close to one another,
thereby improving the efficiency of the caching of relation blocks) +
partial indexes for the hot rows (as first mentioned by Claudio; to
improve the efficiency of the caching of index blocks).

 Additionally, the depth of an index doesn't directly translate to the
 number of I/O writes per insert (or delete). I'd rather expect the avg.
 number of I/O writes per insert into a b-tree to be reasonably close to
 one - depending mostly on the number of keys per page, not depth.

My reasoning was: To determine which index block to update (typically
one in both the partitioned and non-partitioned cases), one needs to
walk the index first, which supposedly causes one additional (read)
I/O in the non-partitioned case on average, because there is one extra
level and the lower part of the index is not cached (because of the
size of the index). I think that pokes a hole in Heikki’s argument of
“it really doesn’t matter, partitioning == using one big table with
big non-partial indexes.”

Nicolas

-- 
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] add --progress option to pgbench (submission 3)

2013-06-27 Thread Fabien COELHO


Dear Robert,


Here is a v4 that takes into account most of your points: The report is
performed for all threads by thread 0, however --progress is not supported
under thread fork emulation if there are more than one thread. The report
time does not slip anymore.


I don't believe that to be an acceptable restriction.


The pthread fork emulation is just an ugly hack to run pgbench on a host 
that does not have pthreads (portable threads). I'm not sure that it 
applies on any significant system, but I can assure you that it imposes 
severe limitations about how to do things properly in pgbench: As there is 
no threads, there is no shared memory, no locking mecanism, nothing 
really. So it is hard to generated a shared report in such conditions.


My first proposal is to remove the fork emulation altogether, which would 
remove many artificial limitations to pgbench and simplify the code 
significantly. That would be an improvement.


Otherwise, he simplest possible adaptation, if it is required to have the 
progress feature under fork emulation to pass it, is that under fork 
emulation each processus reports its current progress instead of having a 
collective summing.


Note that it is possible to implement the feature with interprocess 
communications, but really generating many pipes will add a lot of 
complexity to the code, and I do not thing that the code nor this simple 
feature deserve that.


Another option is to have each thread to report its progression 
indenpently with all implementations, that what I did in the first 
instance. It is much less interesting, but it would be homogeneous 
although poor for every versions.


We generally require features to work on all platforms we support.  We 
have made occasional compromises there, but generally only when the 
restriction is fundamental to the platform rather than for developer 
convenience.


I agree with this kind of generally, but please consider that pthread 
fork emulation really means processes, so that simple things with 
threads become significantly more complex to implement.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-27 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-06-26 20:07:40 -0400, Tom Lane wrote:
 I still want to delete the test for SOCK_ERRNO == 0.  I traced that back
 to commit da9501bddb4dc33c031b1db6ce2133bcee7b, but I can't find
 anything in the mailing list archives to explain that.  I have an
 inquiry in to Jan to see if he can remember the reason ...

 That looks strange, yes.

Committed that way.  We can always undo it if Jan can defend the logic.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Kudos for Reviewers -- straw poll

2013-06-27 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 On Thu, Jun 27, 2013 at 11:50:07AM -0400, Christopher Browne wrote:
 It could be pretty satisfactory to have a simple listing, in the
 release notes, of the set of reviewers.  That's a lot less
 bookkeeping than tracking this for each and every change.

 Adding the names to each release note item is not a problem;  the
 problem is the volume of names that overwhelms the release note text.  If
 we went that direction, I predict we would just remove _all_ names from
 the release notes.

Yeah.  Keep in mind that the overwhelming majority of the audience for
the release notes doesn't actually give a darn who implemented what.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Peter Eisentraut
On 6/27/13 4:19 AM, Dean Rasheed wrote:
 I'd say there are clearly people who want it, and the nature of some
 of those answers suggests to me that we ought to have a better answer
 in core.

It's not clear what these people wanted this functionality for.  They
all wanted to analyze a table to compare with another table (or the same
table later).  Either, they wanted this to detect data changes, in which
case the right tool is a checksum, not a cryptographic hash.  We already
have several checksum implementations in core, so we could expose on of
them.  Or they wanted this to protect their data from tampering, in
which case the right tool is a cryptographic hash, but Noah argues that
a sum of MD5 hashes is not cryptographically sound.  (And in any case,
we don't put cryptographic functionality into the core.)

The reason md5_agg is proposed here and in all those cited posts is
presumably because the md5() function was already there anyway.  The the
md5() function is there because the md5 code was already there anyway
because of the authentication.  Let's not add higher-order
already-there-anyway code. ;-)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group Commits Vs WAL Writes

2013-06-27 Thread Atri Sharma

 commit_delay exists to artificially increase the window in which the
 leader backend waits for more group commit followers. At higher client
 counts, that isn't terribly useful because you'll naturally have
 enough clients anyway, but at lower client counts particularly where
 fsyncs have high latency, it can help quite a bit. I mention this
 because clearly commit_delay is intended to trade off latency for
 throughput. Although having said that, when I worked on commit_delay,
 the average and worse-case latencies actually *improved* for the
 workload in question, which consisted of lots of small write
 transactions. Though I wouldn't be surprised if you could produce a
 reasonable case where latency was hurt a bit, but throughput improved.

Thanks for your reply.

The logic says that latency will be hit when commit_delay is applied,
but I am really interested in why we get an improvement instead.

Can small writes be the reason?

Regards,

Atri


--
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >