[HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-07-23 Thread Satoshi Nagayasu
Hi,

I've been working on new pgstattuple function to allow
block sampling [1] in order to reduce block reads while
scanning a table. A PoC patch is attached.

[1] Re: [RFC] pgstattuple/pgstatindex enhancement

http://www.postgresql.org/message-id/ca+tgmoaxjhgz2c4ayfbr9muuvnhgwu4co-cthqpzrwwdtam...@mail.gmail.com

This new function, pgstattuple2(), samples only 3,000 blocks
(which accounts 24MB) from the table randomly, and estimates
several parameters of the entire table.

The function calculates the averages of the samples, estimates
the parameters (averages and SDs), and shows standard errors
(SE) to allow estimating status of the table with statistical
approach.

And, of course, it reduces number of physical block reads
while scanning a bigger table.

The following example shows that new pgstattuple2 function
runs x100 faster than the original pgstattuple function with
well-estimated results.

--
postgres=# select * from pgstattuple('pgbench_accounts');
-[ RECORD 1 ]--+---
table_len  | 1402642432
tuple_count| 1000
tuple_len  | 121000
tuple_percent  | 86.27
dead_tuple_count   | 182895
dead_tuple_len | 22130295
dead_tuple_percent | 1.58
free_space | 21012328
free_percent   | 1.5

Time: 1615.651 ms
postgres=# select * from pgstattuple2('pgbench_accounts');
NOTICE:  pgstattuple2: SE tuple_count 2376.47, tuple_len 287552.58,
dead_tuple_count 497.63, dead_tuple_len 60213.08, free_space 289752.38
-[ RECORD 1 ]--+---
table_len  | 1402642432
tuple_count| 9978074
tuple_len  | 1207347074
tuple_percent  | 86.08
dead_tuple_count   | 187315
dead_tuple_len | 22665208
dead_tuple_percent | 1.62
free_space | 23400431
free_percent   | 1.67

Time: 15.026 ms
postgres=#
--

In addition to that, see attached chart to know how pgstattuple2
estimates well during repeating (long-running) pgbench.

I understand that pgbench would generate random transactions,
and those update operations might not have any skew over the table,
so estimating table status seems to be easy in this test.

However, I'm still curious to know whether it would work in
real-world worklaod.

Is it worth having this? Any comment or suggestion?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
index 2783a63..8ebec6f 100644
--- a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
@@ -37,3 +37,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+CREATE FUNCTION pgstattuple2(IN relname regclass,
+OUT table_len BIGINT,  -- physical table length in bytes
+OUT tuple_count BIGINT,-- number of live tuples
+OUT tuple_len BIGINT,  -- total tuples length in bytes
+OUT tuple_percent FLOAT8,  -- live tuples in %
+OUT dead_tuple_count BIGINT,   -- number of dead tuples
+OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes
+OUT dead_tuple_percent FLOAT8, -- dead tuples in %
+OUT free_space BIGINT, -- free space in bytes
+OUT free_percent FLOAT8)   -- free space in %
+AS 'MODULE_PATHNAME', 'pgstattuple2'
+LANGUAGE C STRICT;
+
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.2.sql
index e5fa2f5..963eb00 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.2.sql
@@ -77,3 +77,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+CREATE FUNCTION pgstattuple2(IN relname regclass,
+OUT table_len BIGINT,  -- physical table length in bytes
+OUT tuple_count BIGINT,-- number of live tuples
+OUT tuple_len BIGINT,  -- total tuples length in bytes
+OUT tuple_percent FLOAT8,  -- live tuples in %
+OUT dead_tuple_count BIGINT,   -- number of dead tuples
+OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes
+OUT dead_tuple_percent FLOAT8, -- dead tuples in %
+OUT free_space BIGINT, -- free space in bytes
+OUT free_percent FLOAT8)   -- free space in %
+AS 'MODULE_PATHNAME', 'pgstattuple2'
+LANGUAGE C STRICT;
+
diff --git a/contrib/pgstattuple/pgstattuple.c 
b/contrib/pgstattuple/pgstattuple.c
index 7f41ec3..c2adb75 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -41,9 +41,22 @@ PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pgstattuple);
 PG_FUNCTION_INFO_V1(pgstattuplebyid);
+PG_FUNCTION_INFO_V1(pgstattuple2);
 
 extern Datum pgstattuple(PG_FUNCTION_ARGS);
 extern 

Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Peter Geoghegan
On Mon, Jul 22, 2013 at 8:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 And I can't get too excited about making this as my volunteer effort when I
 consider what the resulting credit will look like.  Coding is by far the
 smallest part of work like this, first behind coming up with the design in
 the first place.  And both of those are way, way behind how long review
 benchmarking takes on something like this.  The way credit is distributed
 for this sort of feature puts coding first, design not credited at all, and
 maybe you'll see some small review credit for benchmarks.  That's completely
 backwards from the actual work ratio.  If all I'm getting out of something
 is credit, I'd at least like it to be an appropriate amount of it.

FWIW, I think that's a reasonable request.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] anchovy failing on 9.1 and earlier since using gcc 4.8

2013-07-23 Thread Andres Freund
Hi,

Anchovy is failing on 9.1 and earlier for quite some time now:
http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=anchovybr=REL9_1_STABLE
The commit at that date looks rather unconspicuous. Looking at the
'config' section reveals the difference between the failing
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=anchovydt=2013-03-31%2023%3A23%3A01
and
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=anchovydt=2013-03-30%2000%3A23%3A01

2013-03-30: config:

configure:3244: checking for C compiler version
configure:3252: gcc --version 5
gcc (GCC) 4.7.2

2013-03-31: config:

configure:3244: checking for C compiler version
configure:3252: gcc --version 5
gcc (GCC) 4.8.0

(note that it's using 4.8.1 these days)

So, it started failing the day it switched to gcc 4.8.

Given that anchovy builds 9.2 and master happily and given that it is
the only gcc-4.8 animal in the buildfarm it seems quite possible that
there is some misoptimization issue in 9.1 and earlier with 4.8.

I seem to recall some discussions about whether to backpatch some
hardening against compiler optimizations, but I don't remember any
specifics.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSOC13 proposal - extend RETURNING syntax

2013-07-23 Thread Karol Trzcionka
W dniu 23.07.2013 06:22, David Fetter pisze:
 What problem or problems did you notice, and what did you change to
 fix them?
UPDATE ... FROM generated ERROR:  variable not found in subplan
target lists. I've added some workaround in add_vars_to_targetlist:
- if it is an after - simple change var-varno to base RTE (it should
always exists, the value is temporary, it will change to OUTER_VAR)
- if it is a before - add to targetlist new var independently from
rel-attr_needed[attno].
Additionally I've change build_joinrel_tlist to ignore any BEFORE
RTEs. The regression tests are updated.
Regards,
Karol


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using ini file to setup replication

2013-07-23 Thread Sawada Masahiko
On Mon, Jul 22, 2013 at 10:59 PM, Greg Stark st...@mit.edu wrote:
 On Fri, Jul 19, 2013 at 2:20 PM, Samrat Revagade
 revagade.sam...@gmail.com wrote:

 for example:
 if i want to configure 2 servers then it will add 6 lines,for 3 -9, for 4-12
 setting's for particular server will be:

 considering the way of setting value to conf parameters  : guc . value

 standby_name.'AAA'
 synchronous_transfer. commit
 wal_sender_timeout.60


 I have a feeling Samrat and Sawada-san have some good use cases where
 this extra syntax could be a big step up in expressiveness. But I'm
 having a hard time figuring out exactly what they have in mind. If
 either of you could explain in more detail how the extra syntax would
 apply to your use case and how it would let you do something that you
 can't already do it might be helpful.

 I'm assuming the idea is something like having a single config file
 which can work for the server regardless of whether it's acting as the
 primary or standby and then be able to switch roles by switching a
 single flag which would control which parts of the config file were
 applied. But I'm having a hard time seeing how exactly that would
 work.
in this approach which GUC parameters is written in postgresql.conf,
user would write many extra line in postgresql.conf by a standby
server as Samrat suggested. It will increase size of postgresql.conf.
I think it is not good that all those parameters are written in
postgresql.conf.
that is, I think that those parameters should be written in separately
file. e.g., we can set separately any parameter using include (or
include if exist) in postgresql.conf.
if include file doesn't exist, we would set default value to each wal
sender. that is, we give up ini file, and we provide mechanism of
setting to each wal sender as option of overwrite.
of course to support this approach, it needs to use the patch which
Andres suggested, and server should be able to handle toke which is
two or mote separated by a dot. so we would like to help this
approach.

Regards,


Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] anchovy failing on 9.1 and earlier since using gcc 4.8

2013-07-23 Thread Andres Freund
On 2013-07-23 09:11:00 +0200, Andres Freund wrote:
 Hi,
 
 Anchovy is failing on 9.1 and earlier for quite some time now:
 http://pgbuildfarm.org/cgi-bin/show_history.pl?nm=anchovybr=REL9_1_STABLE
 The commit at that date looks rather unconspicuous. Looking at the
 'config' section reveals the difference between the failing
 http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=anchovydt=2013-03-31%2023%3A23%3A01
 and
 http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=anchovydt=2013-03-30%2000%3A23%3A01
 
 2013-03-30: config:
 
 configure:3244: checking for C compiler version
 configure:3252: gcc --version 5
 gcc (GCC) 4.7.2
 
 2013-03-31: config:
 
 configure:3244: checking for C compiler version
 configure:3252: gcc --version 5
 gcc (GCC) 4.8.0
 
 (note that it's using 4.8.1 these days)
 
 So, it started failing the day it switched to gcc 4.8.
 
 Given that anchovy builds 9.2 and master happily and given that it is
 the only gcc-4.8 animal in the buildfarm it seems quite possible that
 there is some misoptimization issue in 9.1 and earlier with 4.8.
 
 I seem to recall some discussions about whether to backpatch some
 hardening against compiler optimizations, but I don't remember any
 specifics.

Ah, yes. Exactly this case has been discussed some weeks ago:
http://archives.postgresql.org/message-id/14242.1365200084%40sss.pgh.pa.us

I'd personally vote for doing both (1) and (2) mentioned in that email
in the backbranches. I don't think -fno-agressive-loop-optimizations is
the only case the compiler can actually use such knowledge. It very well
might optimize entire code blocks away.
And I don't think this change is the only one where the aggressive loop
optimizations might play a role, especially in the older branches.

I naturally only found the mailing list entry after debugging the issue
myself...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

2013-07-23 Thread Amit Langote
On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote amitlangot...@gmail.com wrote:
 Hello,

 While understanding the effect of maintenance_work_mem on time taken
 by CREATE INDEX, I observed that for the values of
 maintenance_work_mem less than the value for which an internal sort is
 performed, the time taken by CREATE INDEX increases as
 maintenance_work_increases.

 My guess is that for all those values an external sort is chosen at
 some point and larger the value of maintenance_work_mem, later the
 switch to external sort would be made causing CREATE INDEX to take
 longer. That is a smaller value of maintenance_work_mem would be
 preferred for when external sort is performed anyway. Does that make
 sense?


Upon further investigation, it is found that the delay to switch to
external sort caused by a larger value of maintenance_work_mem is
small compared to the total time of CREATE INDEX. So, plotting for a
number of maintenance_work_mem values shows that its effect is
negligible. Are there any other parts of external sort logic that
might make it slower with increasing values of maintenance_work_mem.
It seems merge order, number of tapes seem are related with
state-allowedMem.

Does that mean, external sort is affected by the value of
maintenance_work_mem in a way roughly similar to above?

--
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Back branches vs. gcc 4.8.0

2013-07-23 Thread Andres Freund
On 2013-04-29 23:37:43 -0400, Peter Eisentraut wrote:
 On Sat, 2013-04-06 at 12:59 -0400, Tom Lane wrote:
  The reason I'm thinking it's a good idea is that it would expose any
  remaining places where we have nominally var-length arrays embedded in
  larger structs.  Now that I've seen the failures with gcc 4.8.0, I'm
  quite worried that there might be some more declarations like that
  which we've not identified yet, but that by chance aren't causing
  obvious failures today. 
 
 Here is a rough patch that replaces almost all occurrences of
 something[1] in a struct with FLEXIBLE_ARRAY_MEMBER.  It crashes left
 and right (because of sizeof issues, probably), but at least so far the
 compiler hasn't complained about any flexible-array members not at the
 end of the struct, which is what it did last time.  So the answer to
 your concern so far is negative.

I think this point in the cycle would be a good one to apply something
like this.

 Completing this patch will be quite a bit more debugging work.  Some
 kind of electric fence for palloc would be helpful.

Noah's recently added valgrind mode should provide this.

Do you have an updated version of this patch already? I'd be willing to
make a pass over it to check whether I find any missed updates...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] changeset generation v5-01 - Patches git tree

2013-07-23 Thread Andres Freund
On 2013-07-22 13:50:08 -0400, Robert Haas wrote:
 On Fri, Jun 14, 2013 at 6:51 PM, Andres Freund and...@2ndquadrant.com wrote:
  The git tree is at:
  git://git.postgresql.org/git/users/andresfreund/postgres.git branch 
  xlog-decoding-rebasing-cf4
  http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/xlog-decoding-rebasing-cf4
 
  On 2013-06-15 00:48:17 +0200, Andres Freund wrote:
  Overview of the attached patches:
  0001: indirect toast tuples; required but submitted independently
  0002: functions for testing; not required,
  0003: (tablespace, filenode) syscache; required
  0004: RelationMapFilenodeToOid: required, simple
  0005: pg_relation_by_filenode() function; not required but useful
  0006: Introduce InvalidCommandId: required, simple
  0007: Adjust Satisfies* interface: required, mechanical,
  0008: Allow walsender to attach to a database: required, needs review
  0009: New GetOldestXmin() parameter; required, pretty boring
  0010: Log xl_running_xact regularly in the bgwriter: required
  0011: make fsync_fname() public; required, needs to be in a different file
  0012: Relcache support for an Relation's primary key: required
  0013: Actual changeset extraction; required
  0014: Output plugin demo; not required (except for testing) but useful
  0015: Add pg_receivellog program: not required but useful
  0016: Add test_logical_decoding extension; not required, but contains
the tests for the feature. Uses 0014
  0017: Snapshot building docs; not required
 
 I've now also committed patch #7 from this series.  My earlier commit
 fulfilled the needs of patches #3, #4, and #5; and somewhat longer ago
 I committed #1.

Thanks!

 I am not entirely convinced of the necessity or
 desirability of patch #6, but as of now I haven't studied the issues
 closely.

Fair enough. It's certainly possible to work around not having it, but
it seems cleaner to introduce the notion of an invalid CommandId like we
have for transaction ids et al.
Allowing 2^32-2 instead of 2^32-1 subtransactions doesn't seem like a
problem to me ;)

 Patch #2 does not seem useful in isolation; it adds new
 regression-testing stuff but doesn't use it anywhere.

Yes. I found it useful to test stuff around making replication
synchronous or such, but while I think we should have a facility like it
in core for both, logical and physical replication, I don't think this
patch is ready for prime time due to it's busy looping. I've even marked
it as such above ;)
My first idea to properly implement that seems to be to reuse the
syncrep infrastructure but that doesn't look trivial.

 I doubt that any of the remaining patches (#8-#17) can be applied
 separately without understanding the shape of the whole patch set, so
 I think I, or someone else, will need to set aside more time for
 detailed review before proceeding further with this patch set.  I
 suggest that we close out the CommitFest entry for this patch set one
 way or another, as there is no way we're going to get the whole thing
 done under the auspices of CF1.

Generally agreed. The biggest chunk of the code is in #13 anyway...

Some may be applyable independently:

 0010: Log xl_running_xact regularly in the bgwriter: required
Should be useful independently since it can significantly speed up
startup of physical replicas. Ony many systems checkpoint_timeout
will be set to an hour which can make the time till a standby gets
consistent be quite high since that will be first time it sees a
xl_running_xacts again.

 0011: make fsync_fname() public; required, needs to be in a different file
Isn't in the shape for it atm, but could be applied as an
independent infrastructure patch. And it should be easy enough to
clean it up.

 0012: Relcache support for an Relation's primary key: required
Might actually be a good idea independently as well. E.g. the
materalized key patch could use the information that there's a
candidate key around to avoid a good bit of useless work.


 I'll try to find some more time to spend on this relatively soon, but
 I think this is about as far as I can take this today.

Was pretty helpful already, so ... ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LDAP: bugfix and deprecated OpenLDAP API

2013-07-23 Thread Albe Laurenz
Magnus Hagander wrote:
 In that case, doesn't this patch break Windows? We no longer do the
 anonymous bind on Windows, since it's now in the #ifdef HAVE_LIBLDAP.

 Don't we need to keep the ldap_simple_bind() call in the Windows case,
 or break it up so the call to ldap_sasl_bind_s() is moved outside the
 #ifdef? At least I can't find anything in the docs that indicate that
 ldap_connect() on Windows would actually call that for us - only the
 other way around?


This patch works for the Windows case, because ldap_connect performs
an anonymous bind, see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366171%28v=vs.85%29.aspx

  If the call to ldap_connect succeeds, the client is connected
  to the LDAP server as an anonymous user. The session handle
  should be freed with a call to ldap_unbind when it is no longer required.

 I'm going to set this patch as returned with feedback for now, but
 please feel free to comment on above and possibly resubmit if
 necessary before the CF and I'll see if I can deal with it before the
 next CF anyway, as it's a bug fix.

The patch should still be good, but if we keep the deprecated
OpenLDAP API, it might be more consistent to use ldap_simple_bind_s
instead of ldap_sasl_bind_s.

If you agree, I'll change that.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Performance problem in PLPgSQL

2013-07-23 Thread Marc Cousin
Hi,

I've been trying to diagnose a severe performance regression we've been
having in one of our plpgsql procedures.

The example below is of course extremely simplified, and obviously not
what we are really doing in the database, but it exhibits the slowdown
between 9.1.9 and 9.2.4.

So here is the example:

create table table_test_int (col_01 int);
create table table_test_numeric (col_01 numeric);

CREATE OR REPLACE FUNCTION public.test_insert(nb integer)
 RETURNS text
 LANGUAGE plpgsql
AS $function$
DECLARE
time_start timestamp;
time_stop timestamp;
tmp_numeric numeric;
BEGIN

time_start :=clock_timestamp();
FOR i IN 1..nb LOOP
INSERT INTO table_test_int(col_01) VALUES (i);
END LOOP;
time_stop :=clock_timestamp();
RAISE NOTICE 'time for int:%',time_stop-time_start;



time_start :=clock_timestamp();
FOR i IN 1..nb LOOP
INSERT INTO table_test_numeric(col_01) VALUES (i);
END LOOP;
time_stop :=clock_timestamp();
RAISE NOTICE 'time for numeric:%',time_stop-time_start;



time_start :=clock_timestamp();
FOR i IN 1..nb LOOP
INSERT INTO table_test_numeric(col_01) VALUES (i::numeric);
END LOOP;
time_stop :=clock_timestamp();
RAISE NOTICE 'time for numeric, casted:%',time_stop-time_start;



time_start :=clock_timestamp();
FOR i IN 1..nb LOOP
tmp_numeric:=cast(i as numeric);
INSERT INTO table_test_numeric(col_01) VALUES (tmp_numeric);
END LOOP;
time_stop :=clock_timestamp();
RAISE NOTICE 'time for numeric with tmp variable:%',time_stop-time_start;



RETURN 1;
END;
$function$
;


It just inserts nb records in a loop in 4 different maneers:
- Directly in an int field
- Then in a numeric field (that's where we're having problems)
- Then in the same numeric field, but trying a cast (it doesn't change a
thing)
- Then tries with an intermediary temp variable of numeric type (which
solves the problem).


Here are the runtimes (tables were truncated beforehand):

9.1.9:
select test_insert(100);
NOTICE:  time for int:00:00:09.526009
NOTICE:  time for numeric:00:00:10.557126
NOTICE:  time for numeric, casted:00:00:10.821369
NOTICE:  time for numeric with tmp variable:00:00:10.850847


9.2.4:
select test_insert(100);
NOTICE:  time for int:00:00:09.477044
NOTICE:  time for numeric:00:00:24.757032  
NOTICE:  time for numeric, casted:00:00:24.791016  
NOTICE:  time for numeric with tmp variable:00:00:10.89332


I really don't know exactly where the problem comes from… but it's been
hurting a function very badly (there are several of these static queries
with types mismatch). And of course, the problem is not limited to
numeric… text has the exact same problem.

Regards,

Marc


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: template-ify (binary) extensions

2013-07-23 Thread Markus Wanner
On 07/16/2013 09:14 PM, I wrote:
 But okay, you're saying we *have* and *want* a guarantee that even a
 superuser cannot execute arbitrary native code via libpq (at least in
 default installs w/o extensions).

I stand corrected and have to change my position, again. For the record:

We do not have such a guarantee. Nor does it seem reasonable to want
one. On a default install, it's well possible for the superuser to run
arbitrary code via just libpq.

There are various ways to do it, but the simplest one I was shown is:
 - upload a DSO from the client into a large object
 - SELECT lo_export() that LO to a file on the server
 - LOAD it

There are a couple other options, so even if we let LOAD perform
permission checks (as I proposed before in this thread), the superuser
can still fiddle with function definitions. To the point that it doesn't
seem reasonable to try to protect against that.

Thus, the argument against the original proposal based on security
grounds is moot. Put another way: There already are a couple of
backdoors a superuser can use. By default. Or with plpgsql removed.

Thanks to Dimitri and Andres for patiently explaining and providing
examples.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-07-23 Thread Greg Smith
On 7/23/13 2:16 AM, Satoshi Nagayasu wrote:
 I've been working on new pgstattuple function to allow
 block sampling [1] in order to reduce block reads while
 scanning a table. A PoC patch is attached.

Take a look at all of the messages linked in
https://commitfest.postgresql.org/action/patch_view?id=778

Jaime and I tried to do what you're working on then, including a random
block sampling mechanism modeled on the stats_target mechanism.  We
didn't do that as part of pgstattuple though, which was a mistake.

Noah created some test cases as part of his thorough review that were
not computing the correct results.  Getting the results correct for all
of the various types of PostgreSQL tables and indexes ended up being
much harder than the sampling part.  See
http://www.postgresql.org/message-id/20120222052747.ge8...@tornado.leadboat.com
in particular for that.

 This new function, pgstattuple2(), samples only 3,000 blocks
 (which accounts 24MB) from the table randomly, and estimates
 several parameters of the entire table.

There should be an input parameter to the function for how much sampling
to do, and if it's possible to make the scale for it to look like
ANALYZE that's helpful too.

I have a project for this summer that includes reviving this topic and
making sure it works on some real-world systems.  If you want to work on
this too, I can easily combine that project into what you're doing.

-- 
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_upgrade across more than two neighboring major releases

2013-07-23 Thread Pavel Raiskup
Hello all,

don't know much about pg_upgrade implementation so I rather asking here before
I start trying the process once again with debugging and observing deeply
the code.

I tried to setup our postgresql RPM to package pg_upgrade in version
8.4.13 to automatize post-rpm-upgrade update of postgresql's data to
9.2.4.  I was unable to reach successful result (getting weird errors).

Is pg_upgrade supposed to work even across multiple major releases?  (I know
that the README.rpm-dist is mentioning just one major release step, but it
could be bound just to actual pg_upgrade which is packaged..)

I can imagine that it may be difficult to handle everything even between two
following major releases - thus if it was possible, I would consider it as
heartwarming :).

Thanks for your answer,
Pavel



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Suggestion for concurrent index creation using a single full scan operation

2013-07-23 Thread Tim Kane
Hi all,

I haven't given this a lot of thought, but it struck me that when
rebuilding tables (be it for a restore process, or some other operational
activity) - there is more often than not a need to build an index or two,
sometimes many indexes, against the same relation.

It strikes me that in order to build just one index, we probably need to
perform a full table scan (in a lot of cases).   If we are building
multiple indexes sequentially against that same table, then we're probably
performing multiple sequential scans in succession, once for each index.

Could we architect a mechanism that allowed multiple index creation
statements to execute concurrently, with all of their inputs fed directly
from a single sequential scan against the full relation?

From a language construct point of view, this may not be trivial to
implement for raw/interactive SQL - but possibly this is a candidate for
the custom format restore?

I presume this would substantially increase the memory overhead required to
build those indexes, though the performance gains may be advantageous.

Feel free to shoot holes through this :)

Apologies in advance if this is not the correct forum for suggestions..


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-23 Thread Amit Kapila
On Tuesday, July 23, 2013 12:02 AM Greg Smith wrote:
 The v3 patch applies perfectly here now.  Attached is a spreadsheet
 with test results from two platforms, a Mac laptop and a Linux server.
 I used systems with high disk speed because that seemed like a worst
 case for this improvement.  The actual improvement for shrinking WAL
 should be even better on a system with slower disks.

You are absolutely right. 
To mimic it on our system, by configuring RAMFS for database, it shows similar 
results.
 
 There are enough problems with the hundred tiny fields results that I
 think this not quite ready for commit yet.  More comments on that
 below.
   This seems close though, close enough that I am planning to follow up
 to see if the slow disk results are better.

Thanks for going extra mile to try for slower disks. 

 Reviewing the wal-update-testsuite.sh test program, I think the only
 case missing that would be useful to add is ten tiny fields, one
 changed.  I think that one is interesting to highlight because it's
 what benchmark programs like pgbench do very often.
 The GUC added by the program looks like this:
 
 postgres=# show wal_update_compression_ratio ;
   wal_update_compression_ratio
 --
   25
 
 Is possible to add a setting here that disables the feature altogether?

Yes, it can be done in below 2 ways:
1. Provide a new configuration parameter (wal_update_compression) to turn 
on/off this feature.
2. At table level user can be given option to configure

   That always makes it easier to consider a commit, knowing people can
 roll back the change if it makes performance worse.  That would make
 performance testing easier too.  wal-update-testsuit.sh takes as long
 as
 13 minutes, it's long enough that I'd like the easier to automate
 comparison GUC disabling adds.  If that's not practical to do given the
 intrusiveness of the code, it's not really necessary.  I haven't looked
 at the change enough to be sure how hard this is.
 
 On the Mac, the only case that seems to have a slowdown now is hundred
 tiny fields, half nulled.  It would be nice to understand just what is
 going on with that one.  I got some ugly results in two short fields,
 no change too, along with a couple of other weird results, but I think
 those were testing procedure issues that can be ignored.  The pgbench
 throttle work I did recently highlights that I can't really make a Mac
 quiet/consistent for benchmarking very well.  Note that I ran all of
 the Mac tests with assertions on, to try and catch platform specific
 bugs.
 The Linux ones used the default build parameters.
 
 On Linux hundred tiny fields, half nulled was also by far the worst
 performing one, with a 30% increase in duration despite the 14% drop
 in WAL.  Exactly what's going on there really needs to be investigated
 before this seems safe to commit.  All of the hundred tiny fields
 cases seem pretty bad on Linux, with the rest of them running about a
 11% duration increase.

The main benefit of this patch is to reduce WAL for improving time in I/O,
But I think for faster I/O systems, the calculation to reduce WAL has overhead. 
I will check how to optimize that calculation, but I think this feature should 
be consider 
with configuration knob as it can improve many cases.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Robert Haas
On Mon, Jul 22, 2013 at 12:49 PM, Greg Stark st...@mit.edu wrote:
 On Mon, Jul 22, 2013 at 12:50 PM, Peter Eisentraut pete...@gmx.net wrote:
 I think part of the problem is that we call strcoll for each comparison,
 instead of doing strxfrm once for each datum and then just strcmp for
 each comparison.  That is effectively equivalent to what the proposal
 implements.

 Fwiw I used to be a big proponent of using strxfrm. But upon further
 analysis I realized it was a real difficult tradeoff. strxrfm saves
 potentially a lot of cpu cost but at the expense of expanding the size
 of the sort key. If the sort spills to disk or even if it's just
 memory bandwidth limited it might actually be slower than doing the
 additional cpu work of calling strcoll.

 It's hard to see how to decide in advance which way will be faster. I
 suspect strxfrm is still the better bet, especially for complex large
 character set based locales like Chinese. strcoll might still win by a
 large margin on simple mostly-ascii character sets.

The storage blow-up on systems I've tested is on the order of 10x.
That's possibly fine if the data still fits in memory, but it pretty
much sucks if it makes your sort spill to disk, which seems like a
likely outcome in many cases.

But I don't have much trouble believing the OP's contention that he's
coded a locale-specific version that is faster than the version that
ships with the OS.  On glibc, for example, we copy the strings we want
to compare, so that we can add a terminating zero byte.  The first
thing that glibc does is call strlen().  That's pretty horrible, and
I'm not at all sure the horror ends there, either.

It would be great to have support for user-defined collations in
PostgreSQL.  Let the user provide their own comparison function and
whatever else is needed and use that instead of the OS-specific
support.  Aside from the performance advantages, one could even create
collations that have the same names and orderings on all platforms we
support.  Our support team has gotten more than one inquiry of the
form what's the equivalent of Linux collation XYZ on Windows? - and
telling them that there is no exact equivalent is not the answer the
want to hear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-23 Thread Stephen Frost
Greg,

First, thanks for helping out with the CF.

* Greg Smith (g...@2ndquadrant.com) wrote:
 That leaves only 1 patch where I have no idea who will commit it:
 
   access to calls stack from GET DIAGNOSTICS statement

I'll take a look at this.

Thanks again,

Stephen


signature.asc
Description: Digital signature


[HACKERS] make --silent

2013-07-23 Thread Peter Eisentraut
I have noticed that some people post examples using make --silent (-s).
I found this actually kind of neat to use from time to time, because
then you only see output if you have warnings or errors.  But we get
some extra output that doesn't quite fit:

Writing postgres.bki
Writing schemapg.h
Writing postgres.description
Writing postgres.shdescription
Writing fmgroids.h
Writing fmgrtab.c
Processing HTML.index...
2406 entries loaded...
0 entries ignored...
Done.
Note: Writing man3/SPI_connect.3
Note: Writing man3/SPI_finish.3
Note: Writing man3/SPI_push.3
...
Note: Writing man1/pg_test_timing.1
Note: Writing man1/pg_upgrade.1
Note: Writing man1/pg_xlogdump.1
In file included from gram.y:13612:0:
scan.c: In function ‘yy_try_NUL_trans’:
scan.c:10181:23: warning: unused variable ‘yyg’ [-Wunused-variable]
PostgreSQL, contrib, and documentation successfully made. Ready to install.

Is there interest in silencing this extra output (not the compiler
warning) when make -s is used?  The documentation tools have options to
do so, the rest are our own scripts.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - psql - show longest tables

2013-07-23 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Even if we thought the functionality was worth the trouble, which I
 continue to doubt, this particular syntax proposal is a disaster.

Agreed.  While there might be things worthwhile to add to psql's
backslash commands, this isn't one of those.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade across more than two neighboring major releases

2013-07-23 Thread Michael Paquier
On Tue, Jul 23, 2013 at 8:53 PM, Pavel Raiskup prais...@redhat.com wrote:

 Is pg_upgrade supposed to work even across multiple major releases?  (I
 know
 that the README.rpm-dist is mentioning just one major release step, but it
 could be bound just to actual pg_upgrade which is packaged..)

Yes you should be able to do a one-step upgrade as long as you update from
at least a 8.3 server. There are some restrictions depending on the server
version that is going to be upgraded though, you can have a look at the
documentation for more details here:
http://www.postgresql.org/docs/9.2/static/pgupgrade.html

Thanks,
-- 
Michael


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Dean Rasheed
On 23 July 2013 06:10, Tom Lane t...@sss.pgh.pa.us wrote:
 Andrew Gierth and...@tao11.riddles.org.uk writes:
 I must admit to finding all of this criticism of unread code a bit
 bizarre.

 If you yourself are still finding bugs in the code as of this afternoon,
 onlookers could be forgiven for doubting that the code is quite as
 readable or ready to commit as all that.


I had another look at this --- the bug fix looks reasonable, and
includes a sensible set of additional regression tests.

This was not a bug that implies anything fundamentally wrong with the
patch. Rather it was just a fairly trivial easy-to-overlook bug of
omission --- I certainly overlooked it in my previous reviews (sorry)
and at least 3 more experienced hackers than me overlooked it during
detailed code inspection.

I don't think that really reflects negatively on the quality of the
patch, or the approach taken, which I still think is good. That's also
backed up by the fact that Greg isn't able to find much he wants to
change.

I also don't see much that needs changing in the patch, except
possibly in the area where Greg expressed concerns over the comments
and code clarity. I had similar concerns during my inital review, so I
tend to agree that perhaps it's worth adding a separate boolean
has_ordinality flag to FunctionScanState just to improve code
readability. FWIW, I would probably have extended FunctionScanState
like so:

/* 
 *   FunctionScanState information
 *
 *  Function nodes are used to scan the results of a
 *  function appearing in FROM (typically a function returning set).
 *
 *  eflags  node's capability flags
 *  tupdesc node's expected return tuple description
 *  tuplestorestate private state of tuplestore.c
 *  funcexprstate for function expression being evaluated
 *  has_ordinality  function scan WITH ORDINALITY?
 *  func_tupdescfor WITH ORDINALITY, the raw function tuple
 *  description, without the added ordinality column
 *  func_slot   for WITH ORDINALITY, a slot for the raw function
 *  return tuples
 *  ordinal for WITH ORDINALITY, the ordinality of the return
 *  tuple
 * 
 */
typedef struct FunctionScanState
{
ScanState   ss; /* its first field is NodeTag */
int eflags;
TupleDesc   tupdesc;
Tuplestorestate *tuplestorestate;
ExprState  *funcexpr;
boolhas_ordinality;
/* these fields are used for a function scan WITH ORDINALITY */
TupleDesc   func_tupdesc;
TupleTableSlot *func_slot;
int64   ordinal;
} FunctionScanState;


Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make --silent

2013-07-23 Thread Andres Freund
On 2013-07-23 08:38:03 -0400, Peter Eisentraut wrote:
 I have noticed that some people post examples using make --silent (-s).
 I found this actually kind of neat to use from time to time, because
 then you only see output if you have warnings or errors.  But we get
 some extra output that doesn't quite fit:

I am one of the people doing that regularly. I think it's fantastic
because it's so much harder to miss errors that way... Also it shortens
the buildtime measurably.

 Writing postgres.bki
 Writing schemapg.h
 Writing postgres.description
 Writing postgres.shdescription
 Writing fmgroids.h
 Writing fmgrtab.c

I personally don't feel the need for those to go away. They only appear
when doing a clean rebuild and/or when changing something
significant. Also it's just a couple of lines.

 Processing HTML.index...
 2406 entries loaded...
 0 entries ignored...
 Done.
 Note: Writing man3/SPI_connect.3
 Note: Writing man3/SPI_finish.3
 Note: Writing man3/SPI_push.3
 ...
 Note: Writing man1/pg_test_timing.1
 Note: Writing man1/pg_upgrade.1
 Note: Writing man1/pg_xlogdump.1

Those should imo be silenced. It's quite a bit of output, and especially
during a parallel build they tend to hide more important output.

 In file included from gram.y:13612:0:
 scan.c: In function ‘yy_try_NUL_trans’:
 scan.c:10181:23: warning: unused variable ‘yyg’ [-Wunused-variable]
 PostgreSQL, contrib, and documentation successfully made. Ready to install.

FWIW, I've patched debian's flex just to get rid of this ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for ALTER OPERATOR FAMILY.. ADD / DROP

2013-07-23 Thread Robert Haas
On Thu, May 23, 2013 at 6:52 PM, Robins rob...@pobox.com wrote:
 Please find attached a patch to take code-coverage of ALTER OPERATOR
 FAMILY.. ADD / DROP (src/backend/commands/opclasscmds.c) from 50% to 87%.

 Any and all feedback is welcome.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade across more than two neighboring major releases

2013-07-23 Thread Bruce Momjian
On Tue, Jul 23, 2013 at 09:48:16PM +0900, Michael Paquier wrote:
 
 
 
 On Tue, Jul 23, 2013 at 8:53 PM, Pavel Raiskup prais...@redhat.com wrote:
 
 Is pg_upgrade supposed to work even across multiple major releases?  (I
 know
 that the README.rpm-dist is mentioning just one major release step, but it
 could be bound just to actual pg_upgrade which is packaged..)
 
 Yes you should be able to do a one-step upgrade as long as you update from at
 least a 8.3 server. There are some restrictions depending on the server 
 version
 that is going to be upgraded though, you can have a look at the documentation
 for more details here:
 http://www.postgresql.org/docs/9.2/static/pgupgrade.html

Yes, you can easily skip many major versions in on pg_upgrade run. 
Please show us the errors you see.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-23 Thread Robert Haas
On Tue, Jul 23, 2013 at 12:04 AM, Greg Smith g...@2ndquadrant.com wrote:
 A further look at recently ALTER OPERATOR FAMILY changes shows that Robert
 has been working on those quite a bit.  I'm putting him down as the
 committer on this last one.

I missed that one, now committed.  But just BTW, you kind of messed up
that CommitFest entry.  The link you added as a new version of the
patch was actually to a different patch from the same flock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestion for concurrent index creation using a single full scan operation

2013-07-23 Thread Greg Stark
We already do this in pg_restore by starting multiple worker processes.
Those processes should get the benefit of synchronised sequential scans.

The way the api for indexes works y wouldn't really be hard to start
multiple parallel index builds. I'm not sure how well the pg_restore thing
works and sometimes the goal isn't to maximise the speed so starting
multiple processes isn't always ideal.  It might sometimes be interesting
to be able to do it explicit in a single process.

-- 
greg
On Jul 23, 2013 1:06 PM, Tim Kane tim.k...@gmail.com wrote:

 Hi all,

 I haven't given this a lot of thought, but it struck me that when
 rebuilding tables (be it for a restore process, or some other operational
 activity) - there is more often than not a need to build an index or two,
 sometimes many indexes, against the same relation.

 It strikes me that in order to build just one index, we probably need to
 perform a full table scan (in a lot of cases).   If we are building
 multiple indexes sequentially against that same table, then we're probably
 performing multiple sequential scans in succession, once for each index.

 Could we architect a mechanism that allowed multiple index creation
 statements to execute concurrently, with all of their inputs fed directly
 from a single sequential scan against the full relation?

 From a language construct point of view, this may not be trivial to
 implement for raw/interactive SQL - but possibly this is a candidate for
 the custom format restore?

 I presume this would substantially increase the memory overhead required
 to build those indexes, though the performance gains may be advantageous.

 Feel free to shoot holes through this :)

 Apologies in advance if this is not the correct forum for suggestions..




Re: [HACKERS] make --silent

2013-07-23 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2013-07-23 08:38:03 -0400, Peter Eisentraut wrote:
  Writing postgres.bki
  Writing schemapg.h
  Writing postgres.description
  Writing postgres.shdescription
  Writing fmgroids.h
  Writing fmgrtab.c
 
 I personally don't feel the need for those to go away. They only appear
 when doing a clean rebuild and/or when changing something
 significant. Also it's just a couple of lines.

For my 2c, I'd be for getting rid of the above.  I agree it's not a huge
deal, but I'm also a big fan of don't spam me when things go right.
Is there any particular reason we need to see them with every (clean)
build..?

  Processing HTML.index...
  2406 entries loaded...
  0 entries ignored...
  Done.
  Note: Writing man3/SPI_connect.3
  Note: Writing man3/SPI_finish.3
  Note: Writing man3/SPI_push.3
  ...
  Note: Writing man1/pg_test_timing.1
  Note: Writing man1/pg_upgrade.1
  Note: Writing man1/pg_xlogdump.1
 
 Those should imo be silenced. It's quite a bit of output, and especially
 during a parallel build they tend to hide more important output.

Agreed.

  In file included from gram.y:13612:0:
  scan.c: In function ‘yy_try_NUL_trans’:
  scan.c:10181:23: warning: unused variable ‘yyg’ [-Wunused-variable]
  PostgreSQL, contrib, and documentation successfully made. Ready to install.
 
 FWIW, I've patched debian's flex just to get rid of this ;)

Any idea if that might be accepted upstream or perhaps by Debian..?  It
would surely be nice to make that error go away..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-23 Thread Greg Smith

On 7/23/13 9:06 AM, Robert Haas wrote:

But just BTW, you kind of messed up that CommitFest entry.  The link you added 
as a new version of the
patch was actually to a different patch from the same flock.


Fixed now, since I find myself looking back through history on 
submissions often enough to care about it being right.  My eyes were 
starting to glaze over by the end of staring at this flock.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make --silent

2013-07-23 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-07-23 08:38:03 -0400, Peter Eisentraut wrote:
 I have noticed that some people post examples using make --silent (-s).
 I found this actually kind of neat to use from time to time, because
 then you only see output if you have warnings or errors.  But we get
 some extra output that doesn't quite fit:

 Writing postgres.bki
 Writing schemapg.h
 Writing postgres.description
 Writing postgres.shdescription
 Writing fmgroids.h
 Writing fmgrtab.c

 I personally don't feel the need for those to go away.

I agree --- these let you know that something happened that does not
happen every build, so they're kind of useful.  I concur that the noise
from doc building is just noise, though.

 In file included from gram.y:13612:0:
 scan.c: In function ‘yy_try_NUL_trans’:
 scan.c:10181:23: warning: unused variable ‘yyg’ [-Wunused-variable]
 PostgreSQL, contrib, and documentation successfully made. Ready to install.

 FWIW, I've patched debian's flex just to get rid of this ;)

I think it's fixed officially as of flex 2.5.35.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-23 Thread Amit Kapila
On Tuesday, July 23, 2013 12:27 AM Andres Freund wrote:
 On 2013-07-19 10:40:01 +0530, Hari Babu wrote:
 
  On Friday, July 19, 2013 4:11 AM Greg Smith wrote:
  On 7/9/13 12:09 AM, Amit Kapila wrote:
  I think the first thing to verify is whether the results posted
 can be validated in some other environment setup by another person.
  The testcase used is posted at below link:
  http://www.postgresql.org/message-
 id/51366323.8070...@vmware.com
 
  That seems easy enough to do here, Heikki's test script is
 excellent.
  The latest patch Hari posted on July 2 has one hunk that doesn't
 apply
  anymore now.
 
  The Head code change from Heikki is correct.
  During the patch rebase to latest PG LZ optimization code, the above
 code change is missed.
 
  Apart from the above changed some more changes are done in the patch,
 those are.
 
 FWIW I don't like this approach very much:
 
 * I'd be very surprised if this doesn't make WAL replay of update heavy
   workloads slower by at least factor of 2.

Yes, if you just consider the cost of replay, but it involves other
operations as well
like for standby case transfer of WAL, Write of WAL, Read from WAL and
then apply.
So among them most operation's will be benefited from reduced WAL size,
except apply where you need to decode. 
 
 * It makes data recovery from WAL *noticeably* harder since data
   corruption now is carried forwards and you need the old data to
 decode
   new data
  
   This is one of the reasons why this optimization is done only when the
new row goes in same page.

 * It makes changeset extraction either more expensive or it would have
   to be disabled there.

I think, if there is any such implication, we can probably have the
option of disable it

 I think my primary issue is that philosophically/architecturally I am
 of
 the opinion that a wal record should make sense of it's own without
 depending on heap data. And this patch looses that.

Is the main worry about corruption getting propagated?

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestion for concurrent index creation using a single full scan operation

2013-07-23 Thread Andres Freund
On 2013-07-23 14:17:13 +0100, Greg Stark wrote:
 We already do this in pg_restore by starting multiple worker processes.
 Those processes should get the benefit of synchronised sequential scans.
 
 The way the api for indexes works y wouldn't really be hard to start
 multiple parallel index builds. I'm not sure how well the pg_restore thing
 works and sometimes the goal isn't to maximise the speed so starting
 multiple processes isn't always ideal.  It might sometimes be interesting
 to be able to do it explicit in a single process.

That's true for normal index modifications, but for ambuild (the initial
index creation function) much less so. That's pretty much a black box
from the outside. It's possible to change that, but it's certainly not
trivial.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make --silent

2013-07-23 Thread Andres Freund
On 2013-07-23 09:21:54 -0400, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  On 2013-07-23 08:38:03 -0400, Peter Eisentraut wrote:
   Writing postgres.bki
   Writing schemapg.h
   Writing postgres.description
   Writing postgres.shdescription
   Writing fmgroids.h
   Writing fmgrtab.c
  
  I personally don't feel the need for those to go away. They only appear
  when doing a clean rebuild and/or when changing something
  significant. Also it's just a couple of lines.
 
 For my 2c, I'd be for getting rid of the above.  I agree it's not a huge
 deal, but I'm also a big fan of don't spam me when things go right.
 Is there any particular reason we need to see them with every (clean)
 build..?

I don't have a big problem with getting rid of them either. I find them
moderately reassuring to see those triggered if I change something
relevant because I don't fully trust the dependency mechanism around
them, but that's about it.

   In file included from gram.y:13612:0:
   scan.c: In function ‘yy_try_NUL_trans’:
   scan.c:10181:23: warning: unused variable ‘yyg’ [-Wunused-variable]
   PostgreSQL, contrib, and documentation successfully made. Ready to 
   install.
  
  FWIW, I've patched debian's flex just to get rid of this ;)
 
 Any idea if that might be accepted upstream or perhaps by Debian..?  It
 would surely be nice to make that error go away..

I think Tom ushered a patch upstream, debian just hasn't updated in a
good while, even in unstable. I've looked at the packages bug page some
time back and it didn't look very active so I didn't think further about
proposing a debian-only backport of that patch.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-23 Thread Andres Freund
On 2013-07-23 18:59:11 +0530, Amit Kapila wrote:
  * I'd be very surprised if this doesn't make WAL replay of update heavy
workloads slower by at least factor of 2.
 
 Yes, if you just consider the cost of replay, but it involves other
 operations as well
 like for standby case transfer of WAL, Write of WAL, Read from WAL and
 then apply.
 So among them most operation's will be benefited from reduced WAL size,
 except apply where you need to decode.

I still think it's rather unlikely that they offset those. I've seen wal
replay be a major bottleneck more than once...

  * It makes data recovery from WAL *noticeably* harder since data
corruption now is carried forwards and you need the old data to
  decode
new data
   
This is one of the reasons why this optimization is done only when the
 new row goes in same page.

That doesn't help all that much. It somewhat eases recovering data if
full_page_writes are on, but it's realy hard to stitch together all
changes if the corruption occured within a 1h long checkpoint...

  * It makes changeset extraction either more expensive or it would have
to be disabled there.

 I think, if there is any such implication, we can probably have the
 option of disable it

That can just be done on wal_level = logical, that's not the
problem. It's certainly not with precedence that we have wal_level
dependent optimizations.

  I think my primary issue is that philosophically/architecturally I am
  of
  the opinion that a wal record should make sense of it's own without
  depending on heap data. And this patch looses that.
 
 Is the main worry about corruption getting propagated?

Not really. It feels wrong to me architecturally. That's subjective, I
know.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Craig Ringer
(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent 
collations?

I'd be interested in seeing the rest data for this performance report, partly 
as I'd like to see how ICU collations would compare when ICU is crudely hacked 
into place for testing.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] make --silent

2013-07-23 Thread Andres Freund
On 2013-07-23 09:27:18 -0400, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2013-07-23 08:38:03 -0400, Peter Eisentraut wrote:
  In file included from gram.y:13612:0:
  scan.c: In function ‘yy_try_NUL_trans’:
  scan.c:10181:23: warning: unused variable ‘yyg’ [-Wunused-variable]
  PostgreSQL, contrib, and documentation successfully made. Ready to install.
 
  FWIW, I've patched debian's flex just to get rid of this ;)
 
 I think it's fixed officially as of flex 2.5.35.

Unfortunately (because debian has 2.5.35) it seems to be 2.5.36. The
former was released 2008 :(

It's 62617a3183abb2945beedc0b0ff106182af4f266 in flex's repository if
somebody wants to submit it to debian.

The repository is at git://git.code.sf.net/p/flex/flex - their website
stated a different url the last time I looked.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance problem in PLPgSQL

2013-07-23 Thread Tom Lane
Marc Cousin cousinm...@gmail.com writes:
 The example below is of course extremely simplified, and obviously not
 what we are really doing in the database, but it exhibits the slowdown
 between 9.1.9 and 9.2.4.

Hm.  Some experimentation shows that the plan cache is failing to switch
to a generic plan, but I'm not sure why the cast would have anything to
do with that ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-07-23 Thread Greg Stark
This exact idea was discussed a whole back. I think it was even
implemented.

The problem Tom raised at the time is that the memory usage of the bloom
filter implies smaller or less efficient hash table. It's difficult to
determine whether you're coming out ahead or behind.

I think it should be possible to figure this out though. Bloom fillers have
well understood math for the error rate given the size and number of hash
functions (and please read up on it and implement the optimal combination
for the target error rate, not just an wag) and it should be possible to
determine the resulting advantage.

Measuring the cost of the memory usage is harder but some empirical results
should give some idea.  I expect the result to be wildly uneven one way or
the other so hopefully it doesn't matter of its not perfect. If it's close
then probably is not worth doing anyways.

I would suggest looking up the archives of the previous discussion. You
mind find the patch still usable. Iirc there's been no major changes to the
hash join code.

-- 
greg
On Jun 26, 2013 7:31 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Wed, Jun 26, 2013 at 5:01 AM, Atri Sharma atri.j...@gmail.com wrote:


  The problem here is that if the hash table is in memory, doing a hash
  table lookup directly is likely to be cheaper than a bloom filter
  lookup, even if the bloom filter fits into the processor cache and the
  hash table doesn't (10 last level cache hits is slower than one cache
  miss). Bloom filter will help when its not feasible to use an actual
  hash table (lots of large keys), the real lookup is slow (e.g. an
  index lookup), you are doing a lot of lookups to amortize the
  construction cost and the condition is expected to be selective (most
  lookups give a negative). There might be some dataware house types of
  queries where it would help, but it seems like an awfully narrow use
  case with a potential for performance regressions when the planner has
  a bad estimate.

 Ok, sounds good. Cant we use bloom filters for the case where the hash
 table doesnt fit in memory? Specifically, when reading from disk is
 inevitable for accessing the hash table, we can use bloom filters for
 deciding which keys to actually read from disk.


 I don't think that sounds all that promising.  When the hash table does
 not fit in memory, it is either partitioned into multiple passes, each of
 which do fit in memory, or it chooses a different plan altogether.  Do we
 know under what conditions a Bloom filter would be superior to those
 options, and could we reliably detect those conditions?

 Cheers,

 Jeff



Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-07-23 Thread Amit Kapila
On Tuesday, July 23, 2013 7:06 PM Andres Freund wrote:
 On 2013-07-23 18:59:11 +0530, Amit Kapila wrote:
   * I'd be very surprised if this doesn't make WAL replay of update
 heavy
 workloads slower by at least factor of 2.
 
  Yes, if you just consider the cost of replay, but it involves
 other
  operations as well
  like for standby case transfer of WAL, Write of WAL, Read from
 WAL and
  then apply.
  So among them most operation's will be benefited from reduced WAL
 size,
  except apply where you need to decode.
 
 I still think it's rather unlikely that they offset those. I've seen
 wal
 replay be a major bottleneck more than once...

   * It makes data recovery from WAL *noticeably* harder since data
 corruption now is carried forwards and you need the old data to
   decode
 new data
 
 This is one of the reasons why this optimization is done only when
 the
  new row goes in same page.
 
 That doesn't help all that much. It somewhat eases recovering data if
 full_page_writes are on, but it's realy hard to stitch together all
 changes if the corruption occured within a 1h long checkpoint...
 
I think once a record is corrupted on page, user has to reconstruct that
page, it might be difficult
to just reconstruct a record and this optimization will not carry forward
any corruption from one to another 
Page.   
 
   * It makes changeset extraction either more expensive or it would
 have
 to be disabled there.
 
  I think, if there is any such implication, we can probably have
 the
  option of disable it
 
 That can just be done on wal_level = logical, that's not the
 problem. It's certainly not with precedence that we have wal_level
 dependent optimizations.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Robert Haas
On Tue, Jul 23, 2013 at 9:42 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 (Replying on phone, please forgive bad quoting)

 Isn't this pretty much what adopting ICU is supposed to give us? 
 OS-independent collations?

Yes.

 I'd be interested in seeing the rest data for this performance report, partly 
 as I'd like to see how ICU collations would compare when ICU is crudely 
 hacked into place for testing.

I pretty much lost interest in ICU upon reading that they use UTF-16
as their internal format.

http://userguide.icu-project.org/strings#TOC-Strings-in-ICU

What that would mean for us is that instead of copying the input
strings into a temporary buffer and passing the buffer to strcoll(),
we'd need to convert them to ICU's representation (which means writing
twice as many bytes as the length of the input string in cases where
the input string is mostly single-byte characters) and then call ICU's
strcoll() equivalent.  I agree that it might be worth testing, but I
can't work up much optimism.  It seems to me that something that
operates directly on the server encoding could run a whole lot faster.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] And then there were 5

2013-07-23 Thread Robert Haas
On Tue, Jul 23, 2013 at 9:22 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 7/23/13 9:06 AM, Robert Haas wrote:

 But just BTW, you kind of messed up that CommitFest entry.  The link you
 added as a new version of the
 patch was actually to a different patch from the same flock.

 Fixed now, since I find myself looking back through history on submissions
 often enough to care about it being right.  My eyes were starting to glaze
 over by the end of staring at this flock.

Cool.

I'm pretty sure that's the correct collective noun for patches.  Right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Robert Haas
On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:
 Recently I've been dismissing a lot of suggested changes to checkpoint fsync
 timing without suggesting an alternative.  I have a simple one in mind that
 captures the biggest problem I see:  that the number of backend and
 checkpoint writes to a file are not connected at all.

 We know that a 1GB relation segment can take a really long time to write
 out.  That could include up to 128 changed 8K pages, and we allow all of
 them to get dirty before any are forced to disk with fsync.

By my count, it can include up to 131,072 changed 8K pages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-23 Thread Robert Haas
On Thu, Jul 18, 2013 at 8:46 AM, Robert Haas robertmh...@gmail.com wrote:
 There seems to be a consensus that we should try to get rid of
 SnapshotNow entirely now that we have MVCC catalog scans, so I'm
 attaching two patches that together come close to achieving that goal:

 1. snapshot-error-v1.patch introduces a new special snapshot, called
 SnapshotError.  In the cases where we set SnapshotNow as a sort of
 default snapshot, this patch changes the code to use SnapshotError
 instead.  This affects scan-xs_snapshot in genam.c and
 estate-es_snapshot in execUtils.c.  This passes make check-world, so
 apparently there is no code in the core distribution that does this.
 However, this is safer for third-party code, which will ERROR instead
 of seg faulting.  The alternative approach would be to use
 InvalidSnapshot, which I think would be OK too if people dislike this
 approach.

It seems the consensus was mildly for InvalidSnapshot, so I did it that way.

 2. snapshot-self-not-now-v1.patch changes several uses of SnapshotNow
 to use SnapshotSelf instead.  These include pgrowlocks(),
 pgstat_heap(), and get_actual_variable_range().  In all of those
 cases, only an approximately-correct answer is needed, so the change
 should be fine.  I'd also generally expect that it's very unlikely for
 any of these things to get called in a context where the table being
 scanned has been updated by the current transaction after the most
 recent command-counter increment, so in practice the change in
 semantics will probably not be noticeable at all.

Tom proposed that we use SnapshotDirty for this case; let me just ask
whether there are any security concerns around that.  pgstattuple only
displays aggregate information so I think that's OK, but I wonder if
the value found in get_actual_variable_range() can leak out in EXPLAIN
output or whatever.  I can't particularly think of any reason why that
would actually matter, but I've generally shied away from exposing
data written by uncommitted transactions, and this would be a step in
the other direction.  Does this worry anyone else or am I being
paranoid?

But thinking about it a little more, I wonder why
get_actual_variable_range() is using a snapshot at all.  Presumably
what we want there is to find the last index key, regardless of the
visibility of the heap tuple to which it points.  We don't really need
to consult the heap at all, one would think; the value we need ought
to be present in the index tuple.  If we're going to use a snapshot
for simplicity of coding, maybe the right thing is SnapshotAny.  After
all, even if the index tuples are all dead, we still have to scan
them, so it's still relevant for costing purposes.

Thoughts?

 With that done, the only remaining uses of SnapshotNow in our code
 base will be in currtid_byreloid() and currtid_byrelname().  So far no
 one on this list has been able to understand clearly what the purpose
 of those functions is, so I'm copying this email to pgsql-odbc in case
 someone there can provide more insight.  If I were a betting man, I'd
 bet that they are used in contexts where the difference between
 SnapshotNow and SnapshotSelf wouldn't matter there, either.  For
 example, if those functions are always invoked in a query that does
 nothing but call those functions, the difference wouldn't be visible.
 If we don't want to risk any change to the semantics, we can (1) grit
 our teeth and keep SnapshotNow around or (2) use an instantaneous MVCC
 snapshot there, and accept that people who have very large connection
 counts and extremely heavy use of those functions may see a
 performance regression.

It seems like we're leaning toward a fresh MVCC snapshot for this case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bloom Filter lookup for hash joins

2013-07-23 Thread Atri Sharma
On Tue, Jul 23, 2013 at 7:32 PM, Greg Stark st...@mit.edu wrote:
 This exact idea was discussed a whole back. I think it was even implemented.

 The problem Tom raised at the time is that the memory usage of the bloom
 filter implies smaller or less efficient hash table. It's difficult to
 determine whether you're coming out ahead or behind.

 I think it should be possible to figure this out though. Bloom fillers have
 well understood math for the error rate given the size and number of hash
 functions (and please read up on it and implement the optimal combination
 for the target error rate, not just an wag) and it should be possible to
 determine the resulting advantage.

 Measuring the cost of the memory usage is harder but some empirical results
 should give some idea.  I expect the result to be wildly uneven one way or
 the other so hopefully it doesn't matter of its not perfect. If it's close
 then probably is not worth doing anyways.

 I would suggest looking up the archives of the previous discussion. You mind
 find the patch still usable. Iirc there's been no major changes to the hash
 join code.

Right, I will definitely have a look on the thread. Thanks for the info!

Regards,

Atri


-- 
Regards,

Atri
l'apprenant


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Josh Berkus
On 07/22/2013 01:27 PM, Greg Smith wrote:
 
 Anyone who would like to see RLS committed should consider how you can
 get resources to support a skilled PostgreSQL reviewer spending time on
 it.  (This is a bit much to expect new reviewers to chew on usefully)
 I've been working on that here, but I don't have anything I can publicly
 commit to yet.

Apparently it's a little much for experienced reviewers to chew on, too,
since I've been trying to get someone to review it since the beginning
of the Commitfest.

While I understand the call for resources, this is a bit unfair to
KaiGai, who has put in his time reviewing other people's patches.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Design proposal: fsync absorb linear slider

2013-07-23 Thread Greg Smith

On 7/23/13 10:56 AM, Robert Haas wrote:

On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith g...@2ndquadrant.com wrote:

We know that a 1GB relation segment can take a really long time to write
out.  That could include up to 128 changed 8K pages, and we allow all of
them to get dirty before any are forced to disk with fsync.


By my count, it can include up to 131,072 changed 8K pages.


Even better!  I can pinpoint exactly what time last night I got tired 
enough to start making trivial mistakes.  Everywhere I said 128 it's 
actually 131,072, which just changes the range of the GUC I proposed.


Getting the number right really highlights just how bad the current 
situation is.  Would you expect the database to dump up to 128K writes 
into a file and then have low latency when it's flushed to disk with 
fsync?  Of course not.  But that's the job the checkpointer process is 
trying to do right now.  And it's doing it blind--it has no idea how 
many dirty pages might have accumulated before it started.


I'm not exactly sure how best to use the information collected.  fsync 
every N writes is one approach.  Another is to use accumulated writes to 
predict how long fsync on that relation should take.  Whenever I tried 
to spread fsync calls out before, the scale of the piled up writes from 
backends was the input I really wanted available.  The segment write 
count gives an alternate way to sort the blocks too, you might start 
with the heaviest hit ones.


In all these cases, the fundamental I keep coming back to is wanting to 
cue off past write statistics.  If you want to predict relative I/O 
delay times with any hope of accuracy, you have to start the checkpoint 
knowing something about the backend and background writer activity since 
the last one.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] The Commitfest Slacker List

2013-07-23 Thread Josh Berkus
Greg,

 As far as motivating new reviewers goes, let's talk about positive
 feedback.  Anything that complicates the release notes is a non-starter
 because that resource is tightly controlled by a small number of people,
 and it's trying to satisfy a lot of purposes.  

Greg, you're re-arguing something we obtained a consensus on a week ago.
  Please read the thread Kudos for reviewers.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Comma Comma Comma 8601

2013-07-23 Thread David E. Wheeler
On Jul 23, 2013, at 1:17 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Does that create any ambiguities against formats we already support?
 I'm worried about examples like this one:
 
 select 'monday, july 22, 22:30 2013'::timestamptz;
  timestamptz   
 
 2013-07-22 22:30:00-04
 (1 row)
 
 Right now I'm pretty sure that the datetime parser treats comma as a
 noise symbol.  If that stops being true, we're likely to break some
 applications out there (admittedly, possibly rather strange ones,
 but still ...)

I kind of suspect not, since this fails:

david=# select '12:24:53 654'::time;
ERROR:  invalid input syntax for type time: 12:24:53 654
LINE 1: select '12:24:53 654'::time;
   ^

I would have guessed that the time parser gets to a state where it knows it is 
dealing with a ISO-8601-style time. But I have not looked at the code, of 
course.

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Add support for REFRESH MATERIALIZED VIEW CONCURRENTLY.

2013-07-23 Thread Kevin Grittner
Andres Freund and...@2ndquadrant.com wrote:

 Hm. There seems to be more things that need some more improvement
 from a quick look.

 First, I have my doubts of the general modus operandy of
 CONCURRENTLY here. I think in many, many cases it would be faster
 to simply swap in the newly built heap + indexes in concurrently.
 I think I have seen that mentioned in the review while mostly
 skipping the discussion, but still. That would be easy enough as
 of Robert's mvcc patch.

Yeah, in many cases you would not want to choose to specify this
technique. The point was to have a technique which could run
without blocking while a heavy load of queries (including perhaps a
very long-running query) was executing.  If subsequent events
provide an alternative way to do that, it's worth looking at it;
but my hope was to get this out of the way to allow time to develop
incremental maintenance of matviews for the next CF.  If we spend
too much time reworking this to micro-optimize it for new patches,
incremental maintenance might not happen for 9.4.

The bigger point is perhaps syntax.  If MVCC catalogs are really at
a point where we can substitute a new heap and new indexes for a
matview without taking an AccessExclusiveLock, then we should just
make the current code do that.  I think there is still a place for
a differential update for large matviews which only need small
changes on a REFRESH, but perhaps the right term for that is not
CONCURRENTLY.

 * Shouldn't the GetUserIdAndSecContext(), NewGUCNestLevel() dance
   be done a good bit earlier? We're executing queries before
   that.

This can't be in effect while we are creating temp tables.  If
we're going to support a differential REFRESH technique, it must be
done more surgically than to wrap the whole REFRESH statement
execution.

 * The loop over indexes in refresh_by_match_merge should
   index_open(ExclusiveLock) the indexes initially instead of
   searching the syscache manually. They are opened inside the
   loop in many cases anyway. There probably aren't any hazards
   currently, but I am not even sure about that. The index_open()
   in the loop at the very least processes the invalidation
   messages other backends send...

Fair enough.  That seems like a simple change.

   I'd even suggest using BuildIndexInfo() or such on the indexes,
   then you could use -ii_Expressions et al instead of peeking
   into indkeys by hand.

Given that the function is in a source file described as containing
code to create and destroy POSTGRES index relations and the
comments for that function say that it stores the information
about the index that's needed by FormIndexDatum, which is used for
both index_build() and later insertion of individual index tuples,
and we're not going to create or destroy an index or call
FormIndexDatum or insert individual index tuples from this code, it
would seem to be a significant expansion of the charter of that
function.  What benefits do you see to using that level?

 * All not explicitly looked up operators should be qualified
   using OPERATOR(). It's easy to define your own = operator for
   tid and set the search_path to public,pg_catalog. Now, the
   whole thing is restricted to talbe owners afaics, making this
   decidedly less dicey, but still. But if anyobdy actually uses a
   search_path like the above you can easily hurt them.
 * Same seems to hold true for the y = x and y.* IS DISTINCT FROM
   x.* operations.
 * (y.*) = (x.*) also needs to do proper operator lookups.

I wasn't aware that people could override the equality operators
for tid and RECORD, so that seemed like unnecessary overhead.  I
can pull the operators from the btree AM if people can do that.

 * OpenMatViewIncrementalMaintenance() et al seem to be
   underdocumented.

If the adjacent comment for
MatViewIncrementalMaintenanceIsEnabled() left you at all uncertain
about what OpenMatViewIncrementalMaintenance() and
CloseMatViewIncrementalMaintenance() are for, I can add comments.
At the time I wrote it, it seemed to me to be redundant to have
such comments, and would cause people to waste more time looking at
them then would be saved by anything they could add to what the
function names and one-line function bodies conveyed; but if there
was doubt in your mind about when they should be called, I'll add
something.

Would it help to move the static functions underneath the
(documented) public function and its comment?

 * why is it even allowed that matview_maintenance_depth is  1?
   Unless I miss something there doesn't seem to be support for a
   recursive refresh whatever that would mean?

I was trying to create the function in a way that would be useful
for incremental maintenance, too.  Those could conceivably recurse.
Also, this way bugs in usage are more easily detected.

 * INSERT and DELETE should also alias the table names, to make
   sure there cannot be issues with the nested queries.

I don't see the hazard -- could you elaborate?

 * I'd strongly 

Re: [HACKERS] getting rid of SnapshotNow

2013-07-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 2. snapshot-self-not-now-v1.patch changes several uses of SnapshotNow
 to use SnapshotSelf instead.  These include pgrowlocks(),
 pgstat_heap(), and get_actual_variable_range().

 Tom proposed that we use SnapshotDirty for this case; let me just ask
 whether there are any security concerns around that.  pgstattuple only
 displays aggregate information so I think that's OK, but I wonder if
 the value found in get_actual_variable_range() can leak out in EXPLAIN
 output or whatever.  I can't particularly think of any reason why that
 would actually matter, but I've generally shied away from exposing
 data written by uncommitted transactions, and this would be a step in
 the other direction.  Does this worry anyone else or am I being
 paranoid?

As far as get_actual_variable_range() is concerned, an MVCC snapshot
would probably be the thing to use anyway; I see no need for the planner
to be using estimates that are more up to date than that.  pgrowlocks
and pgstat_heap() might be in a different category.

 But thinking about it a little more, I wonder why
 get_actual_variable_range() is using a snapshot at all.  Presumably
 what we want there is to find the last index key, regardless of the
 visibility of the heap tuple to which it points.

No, what we ideally want is to know the current variable range that
would be seen by the query being planned.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Greg Smith

On 7/23/13 12:10 PM, Josh Berkus wrote:

Apparently it's a little much for experienced reviewers to chew on, too,
since I've been trying to get someone to review it since the beginning
of the Commitfest.


It's more than the available experienced reviewers are willing to chew 
on fully as volunteers.  The reward for spending review time is pretty 
low right now.



While I understand the call for resources, this is a bit unfair to
KaiGai, who has put in his time reviewing other people's patches.


If you read Dean Rasheed's comments, it's obvious he put a useful amount 
of work into his review suggestions.  It is not the case here that 
KaiGai worked on a review and got nothing in return.  Unfortunately that 
has happened to this particular patch before, but the community did a 
little better this time.


The goal of the CF is usually to reach one of these outcomes for every 
submission:


-The submission is ready for commit
-The submission needs improvement in X

Review here went far enough to identify an X to be improved.  It was a 
big enough issue that a rewrite is needed, commit at this time isn't 
possible, and now KaiGai has something we hope is productive he can 
continue working on.  That's all we can really promise here--that if we 
say something isn't ready for commit yet, that there's some feedback as 
to why.


I would have preferred to see multiple X issues identified here, given 
that we know there has to be more than just the one in a submission of 
this size.  The rough fairness promises of the CommitFest seem satisfied 
to me though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-23 Thread Robert Haas
On Tue, Jul 23, 2013 at 12:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 2. snapshot-self-not-now-v1.patch changes several uses of SnapshotNow
 to use SnapshotSelf instead.  These include pgrowlocks(),
 pgstat_heap(), and get_actual_variable_range().

 Tom proposed that we use SnapshotDirty for this case; let me just ask
 whether there are any security concerns around that.  pgstattuple only
 displays aggregate information so I think that's OK, but I wonder if
 the value found in get_actual_variable_range() can leak out in EXPLAIN
 output or whatever.  I can't particularly think of any reason why that
 would actually matter, but I've generally shied away from exposing
 data written by uncommitted transactions, and this would be a step in
 the other direction.  Does this worry anyone else or am I being
 paranoid?

 As far as get_actual_variable_range() is concerned, an MVCC snapshot
 would probably be the thing to use anyway; I see no need for the planner
 to be using estimates that are more up to date than that.  pgrowlocks
 and pgstat_heap() might be in a different category.

 But thinking about it a little more, I wonder why
 get_actual_variable_range() is using a snapshot at all.  Presumably
 what we want there is to find the last index key, regardless of the
 visibility of the heap tuple to which it points.

 No, what we ideally want is to know the current variable range that
 would be seen by the query being planned.

Oh, really?  Well, should we use GetActiveSnapshot() then?

That surprises me, though.  I really thought the point was to cost the
index scan, and surely that will be slowed down even by entries we
can't see.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] new row-level lock error messages

2013-07-23 Thread Alvaro Herrera
Alvaro Herrera wrote:
 Peter Eisentraut wrote:
 
  I would suggest that these changes be undone, except that the old
  SELECT FOR ... be replaced by a dynamic string that reverse-parses the
  LockingClause to provide the actual clause that was used.
 
 Here's a patch for this.

Pushed to 9.3 and master.  Sample output:

alvherre=# select * from foo, bar for update of foof for share of bar;
ERROR:  relation foof in FOR UPDATE clause not found in FROM clause

alvherre=# select * from foo, bar for update of foo for share of barf;
ERROR:  relation barf in FOR SHARE clause not found in FROM clause

Amusingly, the only test in which these error messages appeared, in
contrib/file_fdw, was removed after the two commits that changed the
wording.  So there's not a single test which needed to be tweaked for
this change.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

2013-07-23 Thread Jeff Janes
On Mon, Jul 22, 2013 at 9:11 PM, Amit Langote amitlangot...@gmail.com wrote:
 Hello,

 While understanding the effect of maintenance_work_mem on time taken
 by CREATE INDEX, I observed that for the values of
 maintenance_work_mem less than the value for which an internal sort is
 performed, the time taken by CREATE INDEX increases as
 maintenance_work_increases.

 My guess is that for all those values an external sort is chosen at
 some point and larger the value of maintenance_work_mem, later the
 switch to external sort would be made causing CREATE INDEX to take
 longer. That is a smaller value of maintenance_work_mem would be
 preferred for when external sort is performed anyway. Does that make
 sense?

The heap structure used in external sorts is cache-unfriendly.  The
bigger the heap used, the more this unfriendliness becomes apparent.
And the bigger maintenance_work_mem, the bigger the heap used.

The bigger heap also means you have fewer runs to merge in the
external sort.  However, as long as the number of runs still fits in
the same number of merge passes, this is generally not a meaningful
difference.

Ideally the planner (or something) would figure out how much memory
would be needed to complete an external sort in just one external
pass, and then lower the effective maintenance_work_mem to that
amount.  But that would be hard to do with complete precision, and the
consequences of getting it wrong are asymmetric.

(More thoroughly, it would figure out the number of passes needed for
the given maintenance_work_mem, and then lower the effective
maintenance_work_mem to the smallest value that gives the same number
of passes. But for nearly all practical situations, I think the number
of passes for an index build is going to be 0 or 1.)

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jul 23, 2013 at 12:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As far as get_actual_variable_range() is concerned, an MVCC snapshot
 would probably be the thing to use anyway;

 That surprises me, though.  I really thought the point was to cost the
 index scan, and surely that will be slowed down even by entries we
 can't see.

No, the usage (or the main usage anyway) is for selectivity estimation,
ie how many rows will the query fetch.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Josh Berkus
Greg,

 It's more than the available experienced reviewers are willing to chew
 on fully as volunteers.  The reward for spending review time is pretty
 low right now.

Short of paying for review time, I don't think we have another solution
for getting the big patches reviewed, except to rely on the major
contributors who are paid full-time to hack Postgres.   You know as well
as me that, as consultants, we can get clients to pay for 10% extra time
for review in the course of developing a feature, but the kind of time
which patches like Row Security, Changesets, or other big patches need
nobody is going to pay for on a contract basis.  And nobody who is doing
this in their spare time has that kind of block.

So I don't think there's any good solution for the big patches.

I do think our project could do much more to recruit reviewers for the
small-medium patches, to take workload off the core contributors in
general.  Historically, however, this project (and the contributors on
this list) has made material decisions not to encourage or recruit new
people as reviewers, and has repeatedly stated that reviewers are not
important.  Until that changes, we are not going to get more reviewers
(and I'm not going to have much sympathy for existing contributors who
say they have no time for review).

If we want more reviewers and people spending more time on review, then
we need to give reviewers the *same* respect and the *same* rewards that
feature contributors get.  Not something else, the exact same.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-23 Thread Robert Haas
On Tue, Jul 23, 2013 at 2:24 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jul 23, 2013 at 12:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 As far as get_actual_variable_range() is concerned, an MVCC snapshot
 would probably be the thing to use anyway;

 That surprises me, though.  I really thought the point was to cost the
 index scan, and surely that will be slowed down even by entries we
 can't see.

 No, the usage (or the main usage anyway) is for selectivity estimation,
 ie how many rows will the query fetch.

OK, so GetActiveSnapshot()?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] getting rid of SnapshotNow

2013-07-23 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 OK, so GetActiveSnapshot()?

Works for me.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Adding Zigzag Merge Join to Index Nested Loops Join

2013-07-23 Thread tubadzin
Hi.
I want add Zigzag Merge join to Index Nested Loops Join alghoritm.
http://www.cs.berkeley.edu/~fox/summaries/database/query_eval_5-8.html 
Which files are responsible for Index nested loops join ? (not simple nested 
loops join which has double foreach in nodeNestloop.c) nodeIndexscan.c? 
nodeIndexonlyscan.c?
Best Regards 
Tom

Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Andrew Gierth
Tom Lane said:
 If you yourself are still finding bugs in the code as of this afternoon,
 onlookers could be forgiven for doubting that the code is quite as
 readable or ready to commit as all that.

Right, and we all know that all code is perfect when committed. sheesh.

(This is actually the first time in six months that I've had occasion
to look at that part of the code; that's how long it's been sitting in
the queue.  And yes, it was one of my bits, not David's.  Maybe I
should have left the bug in to see how long it took you to spot it?)

What I'm very notably not seeing from you is any substantive feedback.
You've repeatedly described this patch as broken on the basis of
nothing more than garbled hearsay while loudly proclaiming not to have
actually looked at it; I find this both incomprehensible and insulting.
If you have legitimate technical concerns then let's hear them, now.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Stephen Frost
Andrew,

* Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
 Right, and we all know that all code is perfect when committed. sheesh.

That clearly wasn't what was claimed.

 (This is actually the first time in six months that I've had occasion
 to look at that part of the code; that's how long it's been sitting in
 the queue.

While such issues are frustrating for all of us, huffing about it here
isn't particularly useful.

 And yes, it was one of my bits, not David's.  Maybe I
 should have left the bug in to see how long it took you to spot it?)

That attitude is certainly discouraging.

 What I'm very notably not seeing from you is any substantive feedback.
 You've repeatedly described this patch as broken on the basis of
 nothing more than garbled hearsay while loudly proclaiming not to have
 actually looked at it; I find this both incomprehensible and insulting.

As Greg is the one looking to possibly commit this, I certainly didn't
consider his comments on the patch to be garbled hearsay.  It would have
been great if he had been more clear in his original comments, but I
don't feel that you can fault any of us for reading his email and
voicing what concerns we had from his review.  While you might wish that
we all read every patch submitted, none of us has time for that- simply
keeping up with this mailing list requires a significant amount of time.

 If you have legitimate technical concerns then let's hear them, now.

Fine- I'd have been as happy leaving this be and letting Greg commit it,
but if you'd really like to hear my concerns, I'd start with pointing
out that it's pretty horrid to have to copy every record around like
this and that the hack of CreateTupleDescCopyExtend is pretty terrible
and likely to catch people by surprise.  Having FunctionNext() operate
very differently depending on WITH ORDINALITY is ugly and the cause of
the bug that was found.  All-in-all, I'm not convinced that this is
really a good approach and feel it'd be better off implemented at a
different level, eg a node type instead of a hack on top of the existing
SRF code.

Now, what would be great to see would be your response to Dean's
comments and suggestions rather than berating someone who's barely
written 5 sentences on this whole thread.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

2013-07-23 Thread Jeff Janes
On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote amitlangot...@gmail.com wrote:
 On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote amitlangot...@gmail.com wrote:
 Hello,

 While understanding the effect of maintenance_work_mem on time taken
 by CREATE INDEX, I observed that for the values of
 maintenance_work_mem less than the value for which an internal sort is
 performed, the time taken by CREATE INDEX increases as
 maintenance_work_increases.

 My guess is that for all those values an external sort is chosen at
 some point and larger the value of maintenance_work_mem, later the
 switch to external sort would be made causing CREATE INDEX to take
 longer. That is a smaller value of maintenance_work_mem would be
 preferred for when external sort is performed anyway. Does that make
 sense?


 Upon further investigation, it is found that the delay to switch to
 external sort caused by a larger value of maintenance_work_mem is
 small compared to the total time of CREATE INDEX.

If you are using trace_sort to report that, it reports the switch as
happening as soon as it runs out of memory.

At point, all we have been doing is reading tuples into memory.  The
time it takes to do that will depend on maintenance_work_mem, because
that affects how many tuples fit in memory.  But all the rest of the
tuples need to be read sooner or later anyway, so pushing more of them
to later doesn't improve things overall it just shifts timing around.

After it reports the switch, it still needs to heapify the existing
in-memory tuples before the tapesort proper can begin.  This is where
the true lost opportunities start to arise, as the large heap starts
driving cache misses which would not happen at all under different
settings.

Once the existing tuples are heapified, it then continues to use the
heap to pop tuples from it to write out to tape, and to push newly
read tuples onto it.  This also suffers lost opportunities.

Once all the tuples are written out and it starts merging, then the
large maintenance_work_mem is no longer a penalty as the new heap is
limited by the number of tapes, which is almost always much smaller.
In fact this stage will actually be faster, but not by enough to make
up for the earlier slow down.

So it is not surprising that the time before the switch is reported is
a small part of the overall time difference.


Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Greg Stark
On Tue, Jul 23, 2013 at 9:27 PM, Stephen Frost sfr...@snowman.net wrote:

 Fine- I'd have been as happy leaving this be and letting Greg commit it,
 but if you'd really like to hear my concerns, I'd start with pointing
 out that it's pretty horrid to have to copy every record around like
 this and that the hack of CreateTupleDescCopyExtend is pretty terrible
 and likely to catch people by surprise.  Having FunctionNext() operate
 very differently depending on WITH ORDINALITY is ugly and the cause of
 the bug that was found.  All-in-all, I'm not convinced that this is
 really a good approach and feel it'd be better off implemented at a
 different level, eg a node type instead of a hack on top of the existing
 SRF code.

Fwiw I've been mulling over the same questions and came to the
conclusion that the existing approach makes the most sense.

In an ideal world an extra executor node would be the prettiest,
cleanest implementation. But the amount of extra code and busywork
that would necessitate just isn't justified for the amount of work it
would be doing.

It might be different if WITH ORDINALITY made sense for any other
types of target tables. But it really only makes sense for SRFs. The
whole motivation for having them in the spec is that UNNEST is taking
an ordered list and turning it into a relation which is unordered. You
can't just do row_number() because there's nothing to make it ordered
by.

By the same token for any other data source you would just use
row_number *precisely because* any other data source is unordered and
you should have to specify an order in order to make row_number
produce something meaningful.

So given that WITH ORDINALITY is really only useful for UNNEST and we
can generalize it to all SRFs on the basis that Postgres SRFs do
produce ordered sets not unordered relations it isn't crazy for the
work to be in the Function node.

Now that I've written that though it occurs to me to wonder whether
FDW tables might be usefully said to be ordered too though?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Stephen Frost
* Greg Stark (st...@mit.edu) wrote:
 So given that WITH ORDINALITY is really only useful for UNNEST and we
 can generalize it to all SRFs on the basis that Postgres SRFs do
 produce ordered sets not unordered relations it isn't crazy for the
 work to be in the Function node.

I agree, it isn't *crazy*. :)

 Now that I've written that though it occurs to me to wonder whether
 FDW tables might be usefully said to be ordered too though?

My thought around this was a VALUES() construct, but FDWs are an
interesting case to consider also; with FDWs it's possible that
something is said in SQL/MED regarding this question- perhaps it would
make sense to look there?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-23 Thread Josh Berkus

 Everything was already *not* hunky-dory as soon as you did that, since
 a SIGHUP would have had the same problem.
 
 I'd be inclined to think that ALTER SYSTEM SET should not be allowed to
 modify any PGC_POSTMASTER parameters.

That makes alter system set a bunch less useful, but might be
insurmountable.

Anyway, I also regard this as a problem we need to resolve now that we
have the config directory if we expect people to build autotuning tools.
 Once someone is relying on an autotuning tool which drops a file in
config/, it becomes a real problem that there are uncommented
PGC_POSTMASTER options in the default postgresql.conf.

I'm not sure what the solution to that problem should be, but I do know
that we're going to hear it a lot from users as includes and the config
directory get more common.  Certainly any solution which says first,
manually edit postgresql.conf makes the config directory into a kind of
joke.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread David Fetter
On Tue, Jul 23, 2013 at 05:23:17PM -0400, Stephen Frost wrote:
 * Greg Stark (st...@mit.edu) wrote:
  So given that WITH ORDINALITY is really only useful for UNNEST and we
  can generalize it to all SRFs on the basis that Postgres SRFs do
  produce ordered sets not unordered relations it isn't crazy for the
  work to be in the Function node.
 
 I agree, it isn't *crazy*. :)
 
  Now that I've written that though it occurs to me to wonder whether
  FDW tables might be usefully said to be ordered too though?
 
 My thought around this was a VALUES() construct, but FDWs are an
 interesting case to consider also; with FDWs it's possible that
 something is said in SQL/MED regarding this question- perhaps it would
 make sense to look there?

There are a lot of ways foreign tables don't yet act like local ones.
Much as I'm a booster for fixing that problem, I'm thinking
improvements in this direction are for a separate patch.

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Stephen Frost
David

On Tuesday, July 23, 2013, David Fetter wrote:

 There are a lot of ways foreign tables don't yet act like local ones.
 Much as I'm a booster for fixing that problem, I'm thinking
 improvements in this direction are for a separate patch.


This isn't about making FDWs more like local tables- indeed, quite the
opposite. The question is if we should declare that WITH ORDINALITY will
only ever be for SRFs or if we should consider that it might be useful with
FDWs specifically because they are not unordered sets as tables are.
Claiming that question is unrelated to the implementation of WITH
ORDINALITY is rather... Bizarre.

Thanks,

Stephen


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread David Fetter
On Tue, Jul 23, 2013 at 06:09:20PM -0400, Stephen Frost wrote:
 David
 
 On Tuesday, July 23, 2013, David Fetter wrote:
 
  There are a lot of ways foreign tables don't yet act like local
  ones.  Much as I'm a booster for fixing that problem, I'm thinking
  improvements in this direction are for a separate patch.
 
 
 This isn't about making FDWs more like local tables- indeed, quite
 the opposite. The question is if we should declare that WITH
 ORDINALITY will only ever be for SRFs or if we should consider that
 it might be useful with FDWs specifically because they are not
 unordered sets as tables are.  Claiming that question is unrelated
 to the implementation of WITH ORDINALITY is rather... Bizarre.

Are you saying that there's stuff that if I don't put it in now will
impede our ability to add this to FTs later?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Greg Smith

On 7/23/13 2:30 PM, Josh Berkus wrote:


You know as well as me that, as consultants, we can get clients to pay for 10% 
extra time
for review in the course of developing a feature


Before this number gets quoted anywhere, I think it's on the low side. 
I've spent a good bit of time measuring how much time it takes to do a 
fair offsetting review--one where you put as much time in as it takes to 
review your submission--and I keep getting numbers more in the 20 to 25% 
range.  The work involved to do a high quality review takes a while.


I happen to think the review structure is one of the most important 
components to PostgreSQL release quality.  It used to be a single level 
review with just the committers, now it's a two level structure.  The 
reason the Postgres code is so good isn't that the submitted development 
is inherently any better than other projects.  There's plenty of bogus 
material submitted here.  It's the high standards for review and commit 
that are the key filter.  The importance of the process to the result 
isn't weighed as heavily as I think it should be.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-07-23 Thread Josh Berkus
On 07/23/2013 03:34 PM, Greg Smith wrote:

 I happen to think the review structure is one of the most important
 components to PostgreSQL release quality.  It used to be a single level
 review with just the committers, now it's a two level structure.  The
 reason the Postgres code is so good isn't that the submitted development
 is inherently any better than other projects.  There's plenty of bogus
 material submitted here.  It's the high standards for review and commit
 that are the key filter.  The importance of the process to the result
 isn't weighed as heavily as I think it should be.

I think we're in violent agreement here.  Now, we just need to convince
everyone else on this list.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance problem in PLPgSQL

2013-07-23 Thread Tom Lane
I wrote:
 Marc Cousin cousinm...@gmail.com writes:
 The example below is of course extremely simplified, and obviously not
 what we are really doing in the database, but it exhibits the slowdown
 between 9.1.9 and 9.2.4.

 Hm.  Some experimentation shows that the plan cache is failing to switch
 to a generic plan, but I'm not sure why the cast would have anything to
 do with that ...

Hah, I see why:

(gdb) s
1009if (plansource-generic_cost  avg_custom_cost * 1.1)
(gdb) p plansource-generic_cost
$18 = 0.012501
(gdb) p avg_custom_cost
$19 = 0.01
(gdb) p avg_custom_cost * 1.1
$20 = 0.011001

That is, the estimated cost of the custom plan is just the evaluation
time for a simple constant, while the estimated cost of the generic plan
includes a charge for evaluation of int4_numeric().  That makes the
latter more than ten percent more expensive, and since this logic isn't
considering anything else at all (particularly not the cost of
planning), it thinks that makes the custom plan worth picking.

We've been around on this before, but not yet thought of a reasonable
way to estimate planning cost, let alone compare it to estimated
execution costs.  Up to now I hadn't thought that was a particularly
urgent issue, but this example shows it's worth worrying about.

One thing that was suggested in the previous discussion is to base the
planning cost estimate on the length of the rangetable.  We could
do something trivial like add 10 * (length(plan-rangetable) + 1)
in this comparison.

Another thing that would fix this particular issue, though perhaps not
related ones, is to start charging something nonzero for ModifyTable
nodes, say along the lines of one seq_page_cost per inserted/modified
row.  That would knock the total estimated cost for an INSERT up enough
that the ten percent threshold wouldn't be exceeded.

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Failure to use generic plans (was: Re: Performance problem in PLPgSQL)

2013-07-23 Thread Andrew Gierth
As failures to use a generic plan goes, that one's fairly tame. I've
seen much worse. For example:

PREPARE foo(integer[]) AS SELECT * FROM complexview WHERE id = ANY ($1);

where the caller typically supplies 1-5 array elements (or any number
less than 10, because generic parameter arrays are assumed to have 10
elements). This one can be a massive performance regression between
9.1 and 9.2; the first guy who mentioned this on IRC was getting a 40x
slowdown (~20ms planning time vs. 0.5ms execution time).

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Stephen Frost
On Tuesday, July 23, 2013, David Fetter wrote:

 Are you saying that there's stuff that if I don't put it in now will
 impede our ability to add this to FTs later?


I'm saying that it'd be a completely different implementation and that this
one would get in the way and essentially have to be ripped out.

No one is saying that this patch wouldn't work for the specific use-case
that it set out to meet, and maybe it's unfair for us to consider possible
use-cases beyond the patch's goal and the spec requirement, but that, IMO,
is also one of the things that makes PG great. MVCC isn't necessary and
isn't required by spec either.

Thanks,

Stephen


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Andrew Gierth
Stephen Frost said:
 [stuff about foreign tables]

I think the issue with foreign tables is probably moot because if you
_did_ want to have some types of foreign tables with a fixed
ordinality, you'd probably also want qual pushdown to work for it
(i.e. so that WHERE rownum=123 doesn't have to filter all the rows),
whereas with SRFs this doesn't really apply.

For this to work, foreign tables with a fixed ordering would have to
provide that in the FDW - which is in any case the only place that
knows whether a fixed order would even make any sense.

So I see no overlap here with the SRF ordinality case.

As for VALUES, the spec regards those as constructing a table and
therefore not having any inherent order - the user can add their own
ordinal column if need be. Even if you did want to add WITH ORDINALITY
for VALUES, though, it would actually make more sense to do it in the
Values Scan node since that maintains its own ordinal position already.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improve Chinese locale performance

2013-07-23 Thread Quan Zongliang

On 07/23/2013 09:42 PM, Craig Ringer wrote:

(Replying on phone, please forgive bad quoting)

Isn't this pretty much what adopting ICU is supposed to give us? OS-independent 
collations?


Yes, we need OS-independent collations.


I'd be interested in seeing the rest data for this performance report, partly 
as I'd like to see how ICU collations would compare when ICU is crudely hacked 
into place for testing.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review: UNNEST (and other functions) WITH ORDINALITY

2013-07-23 Thread Tom Lane
Stephen Frost sfr...@snowman.net writes:
 * Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
 If you have legitimate technical concerns then let's hear them, now.

 Fine- I'd have been as happy leaving this be and letting Greg commit it,
 but if you'd really like to hear my concerns, I'd start with pointing
 out that it's pretty horrid to have to copy every record around like
 this and that the hack of CreateTupleDescCopyExtend is pretty terrible
 and likely to catch people by surprise.  Having FunctionNext() operate
 very differently depending on WITH ORDINALITY is ugly and the cause of
 the bug that was found.  All-in-all, I'm not convinced that this is
 really a good approach and feel it'd be better off implemented at a
 different level, eg a node type instead of a hack on top of the existing
 SRF code.

I took the time to read through this patch, finally.  i think the $64
question is pretty much what Stephen says above: do we like tying this
behavior to FunctionScan, as opposed to having it be some kind of
expression node?  You could certainly imagine a WithOrdinality
expression node that takes in values of a set-returning expression,
and returns them with an extra column tacked on.  This would resolve
the problem that was mentioned awhile back that the current approach
can't support SRFs in targetlists.

If it weren't that we've been speculating for years about deprecating
SRFs-in-tlists once we had LATERAL, I would personally consider this
patch DOA in this form.  If we do think we'll probably deprecate that
feature, then not extending WITH ORDINALITY to such cases is at least
defensible.  On the other hand, considering that we've yet to ship a
release supporting LATERAL, it's probably premature to commit to such
deprecation --- we don't really know whether people will find LATERAL
to be a convenient and performant substitute.

As far as performance goes, despite Stephen's gripe above, I think this
way is likely better than any alternative.  The reason is that once
we've disassembled the function result tuple and tacked on the counter,
we have a reasonable shot at things staying like that (in the form of
a virtual TupleTableSlot).  The next higher level of evaluation can
probably use the column Datums as-is.  A WithOrdinality expression node
would have to disassemble the input tuple and then make a new tuple,
which the next higher expression level would likely pull apart again :-(.
Also, any other approach would lead to needing to store the ordinality
values inside the FunctionScan's tuplestore, bloating that storage with
rather-low-value data.

The other big technical issue I see is representation of the rowtype of
the result.  If we did it with a WithOrdinality expression node, the
result would always be of type RECORD, and we'd have to use blessed tuple
descriptors to keep track of exactly which record type a particular
instance emits.  This is certainly do-able (see RowExpr for precedent).
Attaching the functionality to FunctionScan reduces the need for that
because the system mostly avoids trying to associate any type OID with
the rowtype of a FROM item.  Instead though, we've got a lot of ad-hoc
code that deals with RangeTblEntry type information.  A big part of the
patch is dealing with extending that code, and frankly I've got about
zero confidence that the patch has found everything that needs to be
found in that line.  A patch using an expression node would probably
need to touch only a much more localized set of places to handle the
type description issue.

Anyway, on balance I'm satisfied with this overall approach, though it's
not without room for debate.

I am fairly dissatisfied with the patch at a stylistic level, though.
It seems way too short on comments.  People griped about the code in
nodeFunctionscan in particular, but I think all the changes in ad-hoc
type-management code elsewhere are even more deserving of comments,
and they mostly didn't get that.  Likewise, there's no documentation
anywhere that I can see of the new interrelationships of struct fields,
such as that if a RangeTblEntry has funcordinality set, then its various
column-related fields such as eref-colnames include a trailing INT8
column for the ordinality.  Also, maybe I'm misreading the patch (have
I mentioned lately that large patches in -u format are practically
illegible?), but it sure looks like it flat out removed several existing
regression-test cases and a few existing comments as well.  How is that
considered acceptable?

FWIW, I concur with the gripe I remember seeing upthread that the
default name of the added column ought not be ?column?.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Preventing tuple-table leakage in plpgsql

2013-07-23 Thread Noah Misch
On Mon, Jul 22, 2013 at 10:02:30PM -0400, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Sun, Jul 21, 2013 at 12:40:38PM -0400, Tom Lane wrote:
  Hmm ... good point.  The other plan I'd been considering was to add
  explicit tracking inside spi.c of all tuple tables created within the
  current procedure, and then have AtEOSubXact_SPI flush any that were
  created inside a failed subxact.
 
  Is there reason to believe we wouldn't eventually find a half dozen other
  allocations calling for similar bespoke treatment?  Does something make 
  tuple
  tables special among memory allocations, or are they just the garden-variety
  allocation that happens to bother the test case at hand?
 
 It's hard to speculate about the memory management habits of third-party
 SPI-using code.  But in plpgsql, the convention is that random bits of
 memory should be allocated in a short-term context separate from the SPI
 procCxt --- typically, the estate-eval_econtext expression context,
 which exec_stmt_block already takes care to clean up when catching an
 exception.  So the problem is that that doesn't work for tuple tables,
 which have procCxt lifespan.  The fact that they tend to be big (at
 least 8K apiece) compounds the issue.

Reasonable to treat them specially, per your plan, then.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-23 Thread Noah Misch
On Tue, Jul 23, 2013 at 01:21:52AM +, Andrew Gierth wrote:
 For hypothetical set functions we add a special case, aggordnargs=-1,
 for which both the aggregate and the finalfn must be defined as
 (variadic any) and parse analysis detects this case and unifies the
 types of the normal args with those of the ORDER BY args.

Other aggregates based on this syntax might not desire such type unification.
Having parse analysis do that distorts the character of an any argument.  I
think the proper place for such processing is the first call to a transition
function.  The transition functions could certainly call a new API exposed
under src/backend/parser to do the heavy lifting.  But let's not make the
parser presume that an aggordnargs=-1 aggregate always wants its any
arguments handled in the manner of the standard hypothetical set functions.

The rest of the plan looks good so far.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

2013-07-23 Thread Amit Langote
On Wed, Jul 24, 2013 at 6:02 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Tue, Jul 23, 2013 at 1:23 AM, Amit Langote amitlangot...@gmail.com wrote:
 On Tue, Jul 23, 2013 at 1:11 PM, Amit Langote amitlangot...@gmail.com 
 wrote:
 Hello,

 While understanding the effect of maintenance_work_mem on time taken
 by CREATE INDEX, I observed that for the values of
 maintenance_work_mem less than the value for which an internal sort is
 performed, the time taken by CREATE INDEX increases as
 maintenance_work_increases.

 My guess is that for all those values an external sort is chosen at
 some point and larger the value of maintenance_work_mem, later the
 switch to external sort would be made causing CREATE INDEX to take
 longer. That is a smaller value of maintenance_work_mem would be
 preferred for when external sort is performed anyway. Does that make
 sense?


 Upon further investigation, it is found that the delay to switch to
 external sort caused by a larger value of maintenance_work_mem is
 small compared to the total time of CREATE INDEX.

 If you are using trace_sort to report that, it reports the switch as
 happening as soon as it runs out of memory.

 At point, all we have been doing is reading tuples into memory.  The
 time it takes to do that will depend on maintenance_work_mem, because
 that affects how many tuples fit in memory.  But all the rest of the
 tuples need to be read sooner or later anyway, so pushing more of them
 to later doesn't improve things overall it just shifts timing around.

 After it reports the switch, it still needs to heapify the existing
 in-memory tuples before the tapesort proper can begin.  This is where
 the true lost opportunities start to arise, as the large heap starts
 driving cache misses which would not happen at all under different
 settings.

 Once the existing tuples are heapified, it then continues to use the
 heap to pop tuples from it to write out to tape, and to push newly
 read tuples onto it.  This also suffers lost opportunities.

 Once all the tuples are written out and it starts merging, then the
 large maintenance_work_mem is no longer a penalty as the new heap is
 limited by the number of tapes, which is almost always much smaller.
 In fact this stage will actually be faster, but not by enough to make
 up for the earlier slow down.

 So it is not surprising that the time before the switch is reported is
 a small part of the overall time difference.


So, is it the actual sorting (before merging) that suffers with larger
maintenance_work_mem? I am sorry but I can not grasp the complexity of
external sort code at this point, so all I can say is that during an
external sort a smaller value of maintenance_work_mem is beneficial
(based on my observations in tests). But how that follows from what is
going on in the implementation of external sort is still something I
am working on understanding.


-- 
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

2013-07-23 Thread Amit Langote
On Wed, Jul 24, 2013 at 11:30 AM, Amit Langote amitlangot...@gmail.com wrote:
 On Wed, Jul 24, 2013 at 6:02 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 If you are using trace_sort to report that, it reports the switch as
 happening as soon as it runs out of memory.

 At point, all we have been doing is reading tuples into memory.  The
 time it takes to do that will depend on maintenance_work_mem, because
 that affects how many tuples fit in memory.  But all the rest of the
 tuples need to be read sooner or later anyway, so pushing more of them
 to later doesn't improve things overall it just shifts timing around.

 After it reports the switch, it still needs to heapify the existing
 in-memory tuples before the tapesort proper can begin.  This is where
 the true lost opportunities start to arise, as the large heap starts
 driving cache misses which would not happen at all under different
 settings.

 Once the existing tuples are heapified, it then continues to use the
 heap to pop tuples from it to write out to tape, and to push newly
 read tuples onto it.  This also suffers lost opportunities.

 Once all the tuples are written out and it starts merging, then the
 large maintenance_work_mem is no longer a penalty as the new heap is
 limited by the number of tapes, which is almost always much smaller.
 In fact this stage will actually be faster, but not by enough to make
 up for the earlier slow down.

 So it is not surprising that the time before the switch is reported is
 a small part of the overall time difference.


 So, is it the actual sorting (before merging) that suffers with larger
 maintenance_work_mem? I am sorry but I can not grasp the complexity of
 external sort code at this point, so all I can say is that during an
 external sort a smaller value of maintenance_work_mem is beneficial
 (based on my observations in tests). But how that follows from what is
 going on in the implementation of external sort is still something I
 am working on understanding.


Or does the increased create index time follow from something else
altogether (not any part of external sort) may be still another
question. Since we have to relate  that to maintenance_work_mem, the
first thing I could think of was to look at sorting part of it.



-- 
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Suggestion for concurrent index creation using a single full scan operation

2013-07-23 Thread Noah Misch
On Tue, Jul 23, 2013 at 01:06:26PM +0100, Tim Kane wrote:
 I haven't given this a lot of thought, but it struck me that when
 rebuilding tables (be it for a restore process, or some other operational
 activity) - there is more often than not a need to build an index or two,
 sometimes many indexes, against the same relation.
 
 It strikes me that in order to build just one index, we probably need to
 perform a full table scan (in a lot of cases).   If we are building
 multiple indexes sequentially against that same table, then we're probably
 performing multiple sequential scans in succession, once for each index.

Check.

 Could we architect a mechanism that allowed multiple index creation
 statements to execute concurrently, with all of their inputs fed directly
 from a single sequential scan against the full relation?
 
 From a language construct point of view, this may not be trivial to
 implement for raw/interactive SQL - but possibly this is a candidate for
 the custom format restore?

As Greg Stark mentioned, pg_restore can already issue index build commands in
parallel.  Where applicable, that's probably superior to having one backend
build multiple indexes during a single heap scan.  Index builds are
CPU-intensive, and the pg_restore approach takes advantage of additional CPU
cores in addition to possibly saving I/O.

However, the pg_restore method is not applicable if you want CREATE INDEX
CONCURRENTLY, and it's not applicable for implicit index building such as
happens for ALTER TABLE rewrites and for VACUUM FULL.  Backend-managed
concurrent index builds could shine there.

 I presume this would substantially increase the memory overhead required to
 build those indexes, though the performance gains may be advantageous.

The multi-index-build should respect maintenance_work_mem overall.  Avoiding
cases where that makes concurrent builds slower than sequential builds is a
key challenge for such a project:

- If the index builds each fit in maintenance_work_mem when run sequentially
  and some spill to disk when run concurrently, expect concurrency to lose.
- If the heap is small enough to stay in cache from one index build to the
  next, performing the builds concurrently is probably a wash or a loss.
- Concurrency should help when a wide-row table large enough to exhaust OS
  cache has narrow indexes that all fit in maintenance_work_mem.  I don't know
  whether concurrency would help for a huge-table scenario where the indexes
  do overspill maintenance_work_mem.  You would have N indexes worth of
  external merge files competing for disk bandwidth; that could cancel out
  heap I/O savings.

Overall, it's easy to end up with a loss.  We could punt by having an
index_build_concurrency GUC, much like pg_restore relies on the user to
discover a good -j value.  But if finding cases where concurrency helps is
too hard, leaving the GUC at one would become the standard advice.

 Apologies in advance if this is not the correct forum for suggestions..

It's the right forum.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shorter iterations of join_info_list

2013-07-23 Thread Tom Lane
[ sorry for slow response, this month has been mostly crazy ]

Antonin Houska antonin.hou...@gmail.com writes:
 As far as I understand, deconstruct_recurse() ensures that 
 SpecialJoinInfo of a new join always gets added to higher position in 
 join_info_list than SJ infos of all joins located below the new join in 
 the tree. I wonder if we can rely on that fact sometimes.

FWIW, I think of most of those planner lists as being unordered sets.
Depending on a particular ordering definitely adds fragility; so I'd
not want to introduce such a dependency without solid evidence of a
substantial speed improvement.  The cases you mention don't seem very
likely to offer any noticeable gain at all --- at least, I don't recall
seeing any of those functions show up high in profiles.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal/design feedback needed: WITHIN GROUP (sql standard ordered set aggregate functions)

2013-07-23 Thread Andrew Gierth
Noah Misch said:
 Other aggregates based on this syntax might not desire such type unification.

Then there would have to be some way to distinguish that. Maybe those could
have -1 and the standard hypothetical set functions -2, with some flag in
CREATE AGGREGATE to sort it out.

 Having parse analysis do that distorts the character of an any argument.  I
 think the proper place for such processing is the first call to a transition
 function.

Except there isn't one.

 But let's not make the
 parser presume that an aggordnargs=-1 aggregate always wants its any
 arguments handled in the manner of the standard hypothetical set functions.

This has to happen in the parser because these are errors that should be
caught before execution:

rank(foo) within group (order by bar,baz)
rank(integercol) within group (order by textcol)

And collations have to be resolved (pairwise) before sorting can happen:

rank(textval COLLATE C) within group (order by foo)  -- sorts in C
rank(textval COLLATE C) within group (order by bar COLLATE en_US)  -- error

(basically, in rank(x) within group (order by y) where x and y are
collatable, the collation rules apply exactly as though you were doing
(x  y), with all the implicit vs. explicit stuff included)

And this:

rank(1.1) within group (order by intcol)

should become  rank(1.1) within group (order by intcol::numeric)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: [HACKERS] Proposal for Allow postgresql.conf values to be changed via SQL [review])

2013-07-23 Thread Amit Kapila
On Tuesday, July 23, 2013 5:06 AM Josh Berkus wrote:
 All,
 
 Christophe just discovered something with include files which is going
 to cause issues with ALTER SYSTEM SET.
 
 So, take as a hypothetical that you use the default postgresql.conf
 file, which sets shared_buffers = 32MB.
 
 Instead of editing this file, you do ALTER SYSTEM SET shared_buffers =
 '1GB', which updates config/postgresql.auto.conf.
 
 Then you restart PostgreSQL.
 
 Everything is hunky-dory, until a later occasion where you *reload*
 postgresql.  Then postgres startup hits the first shared_buffers=32MB
 (in postgresql.conf), says I can't change shared buffers on a reload,
 and aborts before ever getting to postgresql.auto.conf.

It doesn't abort after showing initial message parameter %s cannot be changed 
without restart, rather it processes all remaining parameters.
We can even test it by setting 1 postmaster and 1 sighup variable, after 
reload, even though it log message for postmaster variable, but it will set
the sighup variable.

So in this the real problem is that there is some message in log which can 
create confusion, otherwise the behavior will be exactly what user wants as per 
Postgres specs.

To avoid the message, following can be done:
1. We need to make sure that in function ProcessConfigFile before calling 
set_config_option(), list of parameters is unique.
   a. it can be done while storing the elements in list as suggested by me in 
my yesterday's mail.
  
http://www.postgresql.org/message-id/004c01ce8761$b8a4ab20$29ee0160$@kap...@huawei.com
   b. after parsing all the config files, make sure the list contain one entry 
for any variable 
  (giving priority to elements that come later)
2. As this is just a confusing message issue, we can even update the docs to 
explain this behavior.


With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] maintenance_work_mem and CREATE INDEX time

2013-07-23 Thread Amit Langote
On Wed, Jul 24, 2013 at 3:20 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Mon, Jul 22, 2013 at 9:11 PM, Amit Langote amitlangot...@gmail.com wrote:
 Hello,

 While understanding the effect of maintenance_work_mem on time taken
 by CREATE INDEX, I observed that for the values of
 maintenance_work_mem less than the value for which an internal sort is
 performed, the time taken by CREATE INDEX increases as
 maintenance_work_increases.

 My guess is that for all those values an external sort is chosen at
 some point and larger the value of maintenance_work_mem, later the
 switch to external sort would be made causing CREATE INDEX to take
 longer. That is a smaller value of maintenance_work_mem would be
 preferred for when external sort is performed anyway. Does that make
 sense?

 The heap structure used in external sorts is cache-unfriendly.  The
 bigger the heap used, the more this unfriendliness becomes apparent.
 And the bigger maintenance_work_mem, the bigger the heap used.

 The bigger heap also means you have fewer runs to merge in the
 external sort.  However, as long as the number of runs still fits in
 the same number of merge passes, this is generally not a meaningful
 difference.

Does fewer runs mean more time (in whichever phase of external sort)?


-- 
Amit Langote


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers