date:20141202

[HACKERS] Many processes blocked at ProcArrayLock

2014-12-02 Thread Xiaoyulei

Test configuration:
Hardware:
4P intel server, 60 core 120 hard thread.
Memory:512G
SSD:2.4T

PG:
max_connections = 160   # (change requires restart)
shared_buffers = 32GB   
work_mem = 128MB
maintenance_work_mem = 32MB 
bgwriter_delay = 100ms  # 10-1ms between rounds
bgwriter_lru_maxpages = 200 # 0-1000 max buffers written/round
bgwriter_lru_multiplier = 2.0   # 0-10.0 multipler on buffers 
scanned/round
wal_level = minimal # minimal, archive, or hot_standby
wal_buffers = 256MB # min 32kB, -1 sets based on 
shared_buffers
autovacuum = off
checkpoint_timeout=60min
checkpoint_segments = 1000
archive_mode = off
synchronous_commit = off
fsync = off
full_page_writes = off  


We use tpcc and pgbench to test postgresql 9.4beat2 performance. And we found 
the tps/tpmc could not increase with the terminal increase. The detail 
information is in attachment.

Many processes is blocked, I dump the call stack, and found these processes is 
blocked at: ProcArrayLock. 60% processes is blocked in ProcArrayEndTransaction 
with ProcArrayLock EXCLUSIVE, 20% is in GetSnapshotData with ProcArrayLock 
SHARED. Others locks like XLogFlush and WALInsertLock are not very heavy.

Is there any way we solve this problem?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Peter Geoghegan

On Fri, Nov 28, 2014 at 8:43 AM, Peter Eisentraut  wrote:
> At the moment, this is probably just an experiment that shows where
> refactoring and better abstractions might be suitable if we want to
> support multiple locale libraries.  If we want to pursue ICU, I think
> this could be a useful third option.

FWIW, I think that the richer API that ICU provides for string
transformations could be handy in optimizing sorting using abbreviated
keys. For example, ICU will happily only produce parts of sort keys
(the equivalent of strxfrm() blobs) if that is all that is required
[1].

I think that ICU also allows clients to parse individual primary
weights in a principled way (primary weights tend to be isomorphic to
the Unicode code points in the original string). I think that this
will enable order-preserving compression of the type anticipated by
the Unicode collation algorithm [2]. That could be useful for certain
languages, like Russian, where the primary weight level usually
contains multi-byte code points with glibc's strxfrm() (this is
generally not true of languages that use the Latin alphabet, or of
East Asian languages).

Note that there is already naturally a form of what you might call
compression with strxfrm() [3]. This is very useful for abbreviated
keys.

[1] http://userguide.icu-project.org/collation/architecture
[2] http://www.unicode.org/reports/tr10/#Run-length_Compression
[3] 
http://www.postgresql.org/message-id/cam3swztywe5j69tapvzf2cm7mhskke3uhhnk9gluqckkwqo...@mail.gmail.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-02 Thread Tomas Vondra


Dne 2014-12-02 02:52, Tom Lane napsal:

Tomas Vondra  writes:

On 2.12.2014 01:33, Tom Lane wrote:

What I suspect you're looking at here is the detritus of creation of
a huge number of memory contexts. mcxt.c keeps its own state about
existing contents in TopMemoryContext. So, if we posit that those
contexts weren't real small, there's certainly room to believe that
there was some major memory bloat going on recently.



Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?


Right.


That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?


Well, if they're each 6K, that's your 6GB right there.


Yeah, but this memory should be freed after the query finishes, no?


Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed?


We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such "shell" parent 
contexts

should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to 
rethink

whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.


Maybe. For me, the 130MB is not really a big deal, because for this to
happen there really needs to be many child contexts at the same time,
consuming much more memory. With 6.5GB consumed in total, 130MB amounts
to ~2% which is negligible. Unless we can fix the RSS bloat.

Also, this explains the TopMemoryContext size, but not the RSS size 
(or

am I missing something)?


Very possibly you're left with "islands" that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.


Yes, that's something I was thinking about too - I believe what happens
is that allocations of info in TopMemoryContext and the actual contexts
are interleaved, and at the end only the memory contexts are deleted. 
The

blocks allocated in TopMemoryContexts are kept, creating the "islands".

If that's the case, allocating the child context info within the parent
context would solve this, because these pieces would be reclaimed with 
the

rest of the parent memory.

But then again, there are probably other ways to create such islands
(e.g. allocating additional block in a long-lived context while the 
child

contexts exist).

regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Serialization exception : Who else was involved?

2014-12-02 Thread Olivier MATROT

Hello,

 

I'm using PostgreSQL .9.2.8 on Windows from a .NET application using
Npgsql.

I'm working in the Radiology Information System field.

 

We have thousands of users against a big accounting database.

We're using the SERIALIZABLE isolation level to ensure data consistency.

 

Because of the large number of users, and probably because of the
database design, we're facing serialization exception and we retry our
transactions.

So far so good.

 

I was wondering if there was a log level in PostgreSQL that could tell
me which query was the trigger of a doomed transaction.

The goal is to understand the failures to improve the database and
application designs.

 

I pushed the logs to the DEBUG5 level with no luck.

 

After carefully reviewing the documentation, it seems that there was
nothing.

So I downloaded the code and looked at it.

 

Serialization conflict detection is done in
src/backend/storage/lmgr/predicate.c, where transactions that are doomed
to fail are marked as such with the SXACT_FLAG_DOOMED flag.

 

I simply added elog(...) calls with the NOTIFY level, each time the flag
is set, compiled the code and give it a try.

 

The results are amazing for me, because this simple modification allows
me to know which query is marking other running transactions to fail.

I'm pretty sure that in the production environment of our major
customers, there should be no more than a few transaction involved.

 

I would like to see this useful and simple addition in a future version
of PostgreSQL.

Is it in the spirit of what is done when it comes to ease the work of
the developer ?

May be the level I've chosen is not appropriate ?

 

Please let me know what you think.

 

Kind Regards.

 

Olivier.

Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-02 Thread Stephen Frost

Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
> Ok.  Though, this would affect how CATUPDATE is handled.  Peter Eisentraut
> previously raised a question about whether superuser checks should be
> included with catupdate which led me to create the following post.
> 
> http://www.postgresql.org/message-id/cakrt6cqovt2kiykg2gff7h9k8+jvu1149zlb0extkkk7taq...@mail.gmail.com
> 
> Certainly, we could keep has_rolcatupdate for this case and put the
> superuser check in role_has_attribute, but it seems like it might be worth
> taking a look at whether a superuser can bypass catupdate or not.  Just a
> thought.

My recollection matches the documentation- rolcatupdate should be
required to update the catalogs.  The fact that rolcatupdate is set by
AlterUser to match rolsuper is an interesting point and one which we
might want to reconsider, but that's beyond the scope of this patch.

Ergo, I'd suggest keeping has_rolcatupdate, but have it do the check
itself directly instead of calling down into role_has_attribute().

There's an interesting flip side to that though, which is the question
of what to do with pg_roles and psql.  Based on the discussion this far,
it seems like we'd want to keep the distinction for pg_roles and psql
based on what bits have explicitly been set rather than what's actually
checked for.  As such, I'd have one other function-
check_has_attribute() which *doesn't* have the superuser allow-all check
and is what is used in pg_roles and by psql.  I'd expose both functions
at the SQL level.

> Ok.  I had originally thought for this patch that I would try to minimize
> these types of changes, though perhaps this is that opportunity previously
> mentioned in refactoring those.  However, the catupdate question still
> remains.

It makes sense to me, at least, to include removing those individual
attribute functions in this patch.

> I have no reason for one over the other, though I did ask myself that
> question.  I did find it curious that in some cases there is "has_X" and
> then in others "pg_has_X".  Perhaps I'm not looking in the right places,
> but I haven't found anything that helps to distinguish when one vs the
> other is appropriate (even if it is a general rule of thumb).

Given that we're changing things anyway, it seems to me that the pg_
prefix makes sense.

> Yes, we were, however the latter causes a syntax error with initdb. :-/

Ok, then just stuff the 255 back there and add a comment about why it's
required and mention that cute tricks to calculate the value won't work.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-02 Thread Tomas Vondra

Dne 2 Prosinec 2014, 10:59, Tomas Vondra napsal(a):
> Dne 2014-12-02 02:52, Tom Lane napsal:
>> Tomas Vondra  writes:
>>
>>> Also, this explains the TopMemoryContext size, but not the RSS size
>>> (or am I missing something)?
>>
>> Very possibly you're left with "islands" that prevent reclaiming very
>> much of the peak RAM usage.  It'd be hard to be sure without some sort
>> of memory map, of course.
>
> Yes, that's something I was thinking about too - I believe what happens
> is that allocations of info in TopMemoryContext and the actual contexts
> are interleaved, and at the end only the memory contexts are deleted.
> The blocks allocated in TopMemoryContexts are kept, creating the
> "islands".
>
> If that's the case, allocating the child context info within the parent
> context would solve this, because these pieces would be reclaimed with
> the rest of the parent memory.

On second thought, I'm not sure this explains the continuous increase of
consumed memory. When the first iteration consumes 2,818g of memory, why
should the following iterations consume significantly more? The allocation
patterns should be (almost) exactly the same, reusing the already
allocated memory (either at the system or TopMemoryContext level).

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_stat_statement normalization fails due to temporary tables

2014-12-02 Thread Andres Freund

Hi,

pg_stat_statement's query normalization fails when temporary tables are
used in a query. That's because JumbleRangeTable() uses the relid in the
RTE to build the query fingerprint. I think in this case pgss from 9.1
actually would do better than 9.2+ as the hash lookup previously didn't
use the relid.

I don't really have a good idea about fixing this though. The best thing
that comes to mind is simply use eref->aliasname for the
disambiguation...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Alex Shulgin


Michael Paquier  writes:

> On Tue, Dec 2, 2014 at 1:58 AM, Alex Shulgin  wrote:
>> Here's the patch rebased against current HEAD, that is including the
>> recently committed action_at_recovery_target option.
> If this patch gets in, it gives a good argument to jump to 10.0 IMO.
> That's not a bad thing, only the cost of making recovery params as
> GUCs which is still a feature wanted.
>
>> The default for the new GUC is 'pause', as in HEAD, and
>> pause_at_recovery_target is removed completely in favor of it.
> Makes sense. Another idea that popped out was to rename this parameter
> as recovery_target_action as well, but that's not really something
> this patch should care about.

Indeed, but changing the name after the fact is straightforward.

>> I've also taken the liberty to remove that part that errors out when
>> finding $PGDATA/recovery.conf.
> I am not in favor of this part. It may be better to let the users know
> that their old configuration is not valid anymore with an error. This
> patch cuts in the flesh with a huge axe, let's be sure that users do
> not ignore the side pain effects, or recovery.conf would be simply
> ignored and users would not be aware of that.

Yeah, that is good point.

I'd be in favor of a solution that works the same way as before the
patch, without the need for extra trigger files, etc., but that doesn't
seem to be nearly possible.  Whatever tricks we might employ will likely
be defeated by the fact that the oldschool user will fail to *include*
recovery.conf in the main conf file.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 3:44 AM, Kouhei Kaigai  wrote:
> Today, I had a talk with Hanada-san to clarify which can be a common portion
> of them and how to implement it. Then, we concluded both of features can be
> shared most of the infrastructure.
> Let me put an introduction of join replacement by foreign-/custom-scan below.
>
> Its overall design intends to inject foreign-/custom-scan node instead of
> the built-in join logic (based on the estimated cost). From the viewpoint of
> core backend, it looks like a sub-query scan that contains relations join
> internally.
>
> What we need to do is below:
>
> (1) Add a hook add_paths_to_joinrel()
> It gives extensions (including FDW drivers and custom-scan providers) chance
> to add alternative paths towards a particular join of relations, using
> ForeignScanPath or CustomScanPath, if it can run instead of the built-in ones.
>
> (2) Informs the core backend varno/varattno mapping
> One thing we need to pay attention is, foreign-/custom-scan node that performs
> instead of the built-in join node must return mixture of values come from both
> relations. In case when FDW driver fetch a remote record (also, fetch a record
> computed by external computing resource), the most reasonable way is to store
> it on ecxt_scantuple of ExprContext, then kicks projection with varnode that
> references this slot.
> It needs an infrastructure that tracks relationship between original varnode
> and the alternative varno/varattno. We thought, it shall be mapped to 
> INDEX_VAR
> and a virtual attribute number to reference ecxt_scantuple naturally, and
> this infrastructure is quite helpful for both of ForegnScan/CustomScan.
> We'd like to add List *fdw_varmap/*custom_varmap variable to both of plan 
> nodes.
> It contains list of the original Var node that shall be mapped on the position
> according to the list index. (e.g, the first varnode is varno=INDEX_VAR and
> varattno=1)
>
> (3) Reverse mapping on EXPLAIN
> For EXPLAIN support, above varnode on the pseudo relation scan needed to be
> solved. All we need to do is initialization of dpns->inner_tlist on
> set_deparse_planstate() according to the above mapping.
>
> (4) case of scanrelid == 0
> To skip open/close (foreign) tables, we need to have a mark to introduce the
> backend not to initialize the scan node according to table definition, but
> according to the pseudo varnodes list.
> As earlier custom-scan patch doing, scanrelid == 0 is a straightforward mark
> to show the scan node is not combined with a particular real relation.
> So, it also need to add special case handling around foreign-/custom-scan 
> code.
>
> We expect above changes are enough small to implement basic join push-down
> functionality (that does not involves external computing of complicated
> expression node), but valuable to support in v9.5.
>
> Please comment on the proposition above.

I don't really have any technical comments on this design right at the
moment, but I think it's an important area where PostgreSQL needs to
make some progress sooner rather than later, so I hope that we can get
something committed in time for 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statement normalization fails due to temporary tables

2014-12-02 Thread Tom Lane

Andres Freund  writes:
> pg_stat_statement's query normalization fails when temporary tables are
> used in a query. That's because JumbleRangeTable() uses the relid in the
> RTE to build the query fingerprint. I think in this case pgss from 9.1
> actually would do better than 9.2+ as the hash lookup previously didn't
> use the relid.

> I don't really have a good idea about fixing this though. The best thing
> that comes to mind is simply use eref->aliasname for the
> disambiguation...

Hmm ... by "fails" I suppose you mean "doesn't treat two different
instances of the same temp table name as the same"?  I'm not sure
that's a bug.

If we go over to using the aliasname then "select x from foo a" and
"select x from bar a" would be treated as the same query, which
clearly *is* a bug.  More generally, I don't think that schema1.tab
and schema2.tab should be considered the same table for this purpose.
So it's hard to see how to do much better than using the OID.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statement normalization fails due to temporary tables

2014-12-02 Thread Andres Freund

On 2014-12-02 09:59:00 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > pg_stat_statement's query normalization fails when temporary tables are
> > used in a query. That's because JumbleRangeTable() uses the relid in the
> > RTE to build the query fingerprint. I think in this case pgss from 9.1
> > actually would do better than 9.2+ as the hash lookup previously didn't
> > use the relid.
> 
> > I don't really have a good idea about fixing this though. The best thing
> > that comes to mind is simply use eref->aliasname for the
> > disambiguation...
> 
> Hmm ... by "fails" I suppose you mean "doesn't treat two different
> instances of the same temp table name as the same"?  I'm not sure
> that's a bug.

Well, it did work < 9.2...

> If we go over to using the aliasname then "select x from foo a" and
> "select x from bar a" would be treated as the same query, which
> clearly *is* a bug.

Yep. The fully qualified name + alias would probably work - but IIRC
isn't available in the RTE without further syscache lookups...

>  More generally, I don't think that schema1.tab
> and schema2.tab should be considered the same table for this purpose.
> So it's hard to see how to do much better than using the OID.

Well, from a practical point of view it's highly annoying if half of
pg_stat_statement suddenly consists out of the same query, just issued
in a different session.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_shm_mq failing on mips*

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 10:42 AM, Christoph Berg  wrote:
> Re: To Robert Haas 2014-11-24 <20141124200824.ga22...@msg.df7cb.de>
>> > Does it fail every time when run on a machine where it fails sometimes?
>>
>> So far there's a consistent host -> fail-or-not mapping, but most
>> mips/mipsel build hosts have seen only one attempt so far which
>> actually came so far to actually run the shm_mq test.
>
> I got the build rescheduled on the same machine and it's hanging
> again.
>
>> Atm I don't have access to the boxes where it was failing (the builds
>> succeed on the mips(el) porter hosts available to Debian developers).
>> I'll see if I can arrange access there and run a test.
>
> Julien Cristau was so kind to poke into the hanging processes. The
> build has been stuck now for about 4h, while that postgres backend has
> only consumed 4s of CPU time (according to plain "ps"). The currently
> executing query is:
>
> SELECT test_shm_mq_pipelined(16384, (select 
> string_agg(chr(32+(random()*95)::int), '') from generate_series(1,27)), 
> 200, 3);
>
> (Waiting f, active, backend_start 6s older than xact_start, xact_start
> same as query_start, state_change 19盜 newer, all 4h old)

I can't tell from this exactly what's going wrong.  Questions:

1. Are there any background worker processes running at the same time?
 If so, how many?  We'd expect to see 3.
2. Can we printout of the following variables in stack frame 3
(test_shm_mq_pipelined)?  send_count, loop_count, *outqh, *inqh,
inqh->mqh_queue[0], outqh->mqh_queue[0]
3. What does a backtrace of each background worker process look like?
If they are stuck inside copy_messages(), can you print *outqh, *inqh,
inqh->mqh_queue[0], outqh->mqh_queue[0] from that stack frame?

Sorry for the hassle; I just don't have a better idea how to debug this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nitpicky doc corrections for BRIN functions of pageinspect

2014-12-02 Thread Alvaro Herrera

Peter Geoghegan wrote:
> On Wed, Nov 19, 2014 at 8:02 PM, Michael Paquier
>  wrote:
> > Just a small thing I noticed while looking at pageinspect.sgml, the
> > set of SQL examples related to BRIN indexes uses lower-case characters
> > for reserved keywords. This has been introduced by 7516f52.
> > Patch is attached.
> 
> My nitpicky observation about pageinspect + BRIN was that the comments
> added to the pageinspect SQL file for the SQL-callable function
> brin_revmap_data() have a comment header style slightly inconsistent
> with the existing items.

Pushed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 8:20 PM, Amit Langote
 wrote:
>> Before going too much further with this I'd mock up schemas for your
>> proposed catalogs and a list of DDL operations to be supported, with
>> the corresponding syntax, and float that here for comment.

More people should really comment on this.  This is a pretty big deal
if it goes forward, so it shouldn't be based on what one or two people
think.

> * Catalog schema:
>
> CREATE TABLE pg_catalog.pg_partitioned_rel
> (
>partrelidoidNOT NULL,
>partkindoidNOT NULL,
>partissub  bool  NOT NULL,
>partkey int2vector NOT NULL, -- partitioning attributes
>partopclass oidvector,
>
>PRIMARY KEY (partrelid, partissub),
>FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
>FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> )
> WITHOUT OIDS ;

So, we're going to support exactly two levels of partitioning?
partitions with partissub=false and subpartitions with partissub=true?
 Why not support only one level of partitioning here but then let the
children have their own pg_partitioned_rel entries if they are
subpartitioned?  That seems like a cleaner design and lets us support
an arbitrary number of partitioning levels if we ever need them.

> CREATE TABLE pg_catalog.pg_partition_def
> (
>partitionid  oid NOT NULL,
>partitionparentrel   oidNOT NULL,
>partitionisoverflow bool  NOT NULL,
>partitionvalues anyarray,
>
>PRIMARY KEY (partitionid),
>FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> )
> WITHOUT OIDS;
>
> ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;

What is an overflow partition and why do we want that?

What are you going to do if the partitioning key has two columns of
different data types?

> * DDL syntax (no multi-column partitioning, sub-partitioning support as yet):
>
> -- create partitioned table and child partitions at once.
> CREATE TABLE parent (...)
> PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> [ (
>  PARTITION child
>{
>VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
>  | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
>}
>[ WITH ( ... ) ] [ TABLESPACE tbs ]
>  [, ...]
>   ) ] ;

How are you going to dump and restore this, bearing in mind that you
have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
wants to do pg_dump --table name_of_a_partition?

I actually think it will be much cleaner to declare the parent first
and then have separate CREATE TABLE statements that glue the children
in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1,
1).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Robert Haas

On Wed, Nov 26, 2014 at 10:12 AM, Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> In the context at hand, I think most of the messages in question are
>> currently phrased like "must be superuser to do X".  I'd be fine with
>> changing that to "permission denied to do X", but not to just
>> "permission denied".
>
> Apologies for the terseness of my (earlier) reply.  This is exactly what
> I'm suggesting.  What was in the patch was this change:
>
> ! ERROR:  must be superuser or replication role to use replication slots
>
> ---
>
> ! ERROR:  permission denied to use replication slots
> ! HINT:  You must be superuser or replication role to use replication slots.

Your proposed change takes two lines to convey the same amount of
information that we are currently conveying in one line.  How is that
better?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.2 recovery/startup problems

2014-12-02 Thread Robert Haas

On Wed, Nov 26, 2014 at 7:13 PM, Jeff Janes  wrote:
> If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl stop
> -mi, then neither goes away.  It is only with the /sbin/reboot that I get
> the fatal combination of _init being gone but the other still present.

Eh?  That sounds wonky.

I mean, reboot normally kills processes with SIGTERM or SIGKILL, in
which case I'd expect the outcome to match what you get with pg_ctl
stop -mf or pg_ctl stop -mi.  The only way I can see that you'd get a
different behavior is if you did a hard reboot (like echo b >
/proc/sysrq-trigger); if that changes things, then we might have a
missing-fsync bug.  How is that reboot managing to leave the main fork
behind while losing the init fork?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] memory explosion on planning complex query

2014-12-02 Thread Robert Haas

On Wed, Nov 26, 2014 at 7:24 PM, Andrew Dunstan  wrote:
> On 11/26/2014 05:00 PM, Andrew Dunstan wrote:
>> Attached is some anonymized DDL for a fairly complex schema from a
>> PostgreSQL Experts client. Also attached is an explain query that runs
>> against the schema. The client's problem is that in trying to run the
>> explain, Postgres simply runs out of memory. On my untuned 9.3 test rig,
>> (Scientific Linux 6.4 with 24Gb of RAM and 24Gb of swap) vmstat clearly
>> shows the explain chewing up about 7Gb of memory. When it's done the free
>> memory jumps back to where it was. On a similar case on the clients test rig
>> we saw memory use jump lots more.
>>
>> The client's question is whether this is not a bug. It certainly seems
>> like it should be possible to plan a query without chewing up this much
>> memory, or at least to be able to limit the amount of memory that can be
>> grabbed during planning. Going from humming along happily to OOM conditions
>> all through running "explain " is not very friendly.
>>
>
> Further data point - thanks to Andrew Gierth (a.k.a. RhodiumToad) for
> pointing this out. The query itself grabs about 600Mb to 700Mb to run,
> whereas the EXPLAIN takes vastly more - on my system 10 times more. Surely
> that's not supposed to happen?

Hmm.  So you can run the query but you can't EXPLAIN it?

That sounds like it could well be a bug, but I'm thinking you might
have to instrument palloc() to find out where all of that space is
being allocated to figure out why it's happening - or maybe connect
gdb to the server while the EXPLAIN is chewing up memory and pull some
backtraces to figure out what section of code it's stuck in.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] dblink_get_connections() result for no connections

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 3:49 PM, Tom Lane  wrote:
> While fooling around with the array_agg(anyarray) patch, I noted
> that dblink_get_connections() returns NULL if there are no active
> connections.  It seems like an empty array might be a saner
> definition --- thoughts?

I don't think it matters much.  I wouldn't break backward
compatibility to change it from this behavior to that one, and if we'd
picked the other one I wouldn't break compatibility to change it to
this behavior.  Anybody who cares can wrap a COALESCE() around it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] copy.c handling for RLS is insecure

2014-12-02 Thread Robert Haas

On Thu, Nov 27, 2014 at 2:03 AM, Stephen Frost  wrote:
> Alright, I've done the change to use the RangeVar from CopyStmt, but
> also added a check wherein we verify that the relation's OID returned
> from the planned query is the same as the relation's OID that we did the
> RLS check on- if they're different, we throw an error.  Please let me
> know if there are any remaining concerns.

That's clearly an improvement, but I'm not sure it's water-tight.
What if the name that originally referenced a table ended up
referencing a view?  Then you could get
list_length(plan->relationOids) != 1.

(And, in that case, I also wonder if you could get
eval_const_expressions() to do evil things on your behalf while
planning.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] test_shm_mq failing on mips*

2014-12-02 Thread Christoph Berg

Re: Robert Haas 2014-12-02 

> I can't tell from this exactly what's going wrong.  Questions:
> 
> 1. Are there any background worker processes running at the same time?
>  If so, how many?  We'd expect to see 3.
> 2. Can we printout of the following variables in stack frame 3
> (test_shm_mq_pipelined)?  send_count, loop_count, *outqh, *inqh,
> inqh->mqh_queue[0], outqh->mqh_queue[0]
> 3. What does a backtrace of each background worker process look like?
> If they are stuck inside copy_messages(), can you print *outqh, *inqh,
> inqh->mqh_queue[0], outqh->mqh_queue[0] from that stack frame?
> 
> Sorry for the hassle; I just don't have a better idea how to debug this.

No problem, I'm aware this isn't an easy target.

I didn't have access to the build env myself, I'm not sure if I can
find someone to retry the test there. I'll let you know when I have
more data, but that will take a while (if at all :-/).

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] initdb: Improve error recovery.

2014-12-02 Thread Robert Haas

On Thu, Nov 27, 2014 at 7:28 AM, Mats Erik Andersson  
wrote:
> I would like to improve error recovery of initdb when the
> password file is empty. The present code declares "Error 0"
> as the cause of failure. It suffices to use ferrer() since
> fgets() returns NULL also at a premature EOF.
>
> In addition, a minor case is the need of a line feed in
> order to print the error message on a line of its own
> seems desirable.
>
> Best regards,
>   Mats Erik Andersson

Please add your patch here so that it doesn't get forgotten about:

https://commitfest.postgresql.org/action/commitfest_view/open

Also, note that the PostgreSQL project prefers for patches to be
attached rather than inline, as mailers have a tendency to mangle
them.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Nov 26, 2014 at 10:12 AM, Stephen Frost  wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> In the context at hand, I think most of the messages in question are
> >> currently phrased like "must be superuser to do X".  I'd be fine with
> >> changing that to "permission denied to do X", but not to just
> >> "permission denied".
> >
> > Apologies for the terseness of my (earlier) reply.  This is exactly what
> > I'm suggesting.  What was in the patch was this change:
> >
> > ! ERROR:  must be superuser or replication role to use replication slots
> >
> > ---
> >
> > ! ERROR:  permission denied to use replication slots
> > ! HINT:  You must be superuser or replication role to use replication slots.
> 
> Your proposed change takes two lines to convey the same amount of
> information that we are currently conveying in one line.  How is that
> better?

It includes the actual error message, unlike what we have today which is
guidence as to what's required to get past the permission denied error.

In other words, I disagree that the same amount of information is being
conveyed.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] copy.c handling for RLS is insecure

2014-12-02 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Thu, Nov 27, 2014 at 2:03 AM, Stephen Frost  wrote:
> > Alright, I've done the change to use the RangeVar from CopyStmt, but
> > also added a check wherein we verify that the relation's OID returned
> > from the planned query is the same as the relation's OID that we did the
> > RLS check on- if they're different, we throw an error.  Please let me
> > know if there are any remaining concerns.
> 
> That's clearly an improvement, but I'm not sure it's water-tight.
> What if the name that originally referenced a table ended up
> referencing a view?  Then you could get
> list_length(plan->relationOids) != 1.

I'll test it out and see what happens.  Certainly a good question and
if there's an issue there then I'll get it addressed.

> (And, in that case, I also wonder if you could get
> eval_const_expressions() to do evil things on your behalf while
> planning.)

If it can be made to reference a view then there's an issue as the view
might include a function call itself which is provided by the attacker..
I'm not sure that we have to really worry about anything more
complicated than that.

Clearly, if we found a relation originally then we need that same
relation with the same OID after the conversion to a query.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost  wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Wed, Nov 26, 2014 at 10:12 AM, Stephen Frost  wrote:
>> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> >> In the context at hand, I think most of the messages in question are
>> >> currently phrased like "must be superuser to do X".  I'd be fine with
>> >> changing that to "permission denied to do X", but not to just
>> >> "permission denied".
>> >
>> > Apologies for the terseness of my (earlier) reply.  This is exactly what
>> > I'm suggesting.  What was in the patch was this change:
>> >
>> > ! ERROR:  must be superuser or replication role to use replication slots
>> >
>> > ---
>> >
>> > ! ERROR:  permission denied to use replication slots
>> > ! HINT:  You must be superuser or replication role to use replication 
>> > slots.
>>
>> Your proposed change takes two lines to convey the same amount of
>> information that we are currently conveying in one line.  How is that
>> better?
>
> It includes the actual error message, unlike what we have today which is
> guidence as to what's required to get past the permission denied error.
>
> In other words, I disagree that the same amount of information is being
> conveyed.

So, you think that someone might see the message "must be superuser or
replication role to use replication slots" and fail to understand that
they have a permissions problem?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add CREATE support to event triggers

2014-12-02 Thread Robert Haas

> Normally you would replicate between an older master and a newer
> replica, so this shouldn't be an issue.  I find it unlikely that we
> would de-support some syntax that works in an older version: it would
> break pg_dump, for one thing.

The most common way in which we break forward-compatibility is by
reserving additional keywords.  Logical replication can deal with this
by quoting all identifiers, so it's not a big issue.

It's not the only possible issue; it is certainly conceivable that we
could change something in the server and then try to change pg_dump to
compensate.  I think we wouldn't do that with anything very big, but
suppose spgist were supplanted by a new and better access method
tsigps.  Well, we might figure that there are few enough people
accustomed to the current syntax that we can get away with hacking the
pg_dump output, and yes, anyone with an existing dump from an older
system will have problems, but there won't be many of them, and they
can always adjust the dump by hand.  If we ever decided to do such a
thing, whatever syntax transformation we did in pg_dump would have to
be mimicked by any logical replication solution.

I think it's legitimate to say that this could be a problem, but I
think it probably won't be a problem very often, and I think when it
is a problem it probably won't be a very big problem, because we
already have good reasons not to break the ability to restore older
dumps on newer servers more often than absolutely necessary.

One of the somewhat disappointing things about this is that we're
talking about adding a lot of new deparsing code to the server that
probably, in some sense, duplicates pg_dump.  Perhaps in an ideal
world there would be a way to avoid that, but in the world we really
live in, there probably isn't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] HINT: pg_hba.conf changed since last config reload

2014-12-02 Thread Robert Haas

On Thu, Nov 27, 2014 at 8:49 AM, Bruce Momjian  wrote:
> On Thu, Nov  6, 2014 at 05:46:42PM -0500, Peter Eisentraut wrote:
>> Finally, the fact that a configuration change is in progress is
>> privileged information.  Unprivileged users can deduct from the presence
>> of this message that administrators are doing something, and possibly
>> that they have done something wrong.
>>
>> I think it's fine to log a message in the server log if the pg_hba.conf
>> file needs reloading.  But the client shouldn't know about this at all.
>
> Should we do this for postgresql.conf too?

It doesn't really make sense; or at least, the exact same thing
doesn't make any sense.  If an authentication attempt fails
unexpectedly, it might be because of a pg_hba.conf change that wasn't
reloaded, so it makes sense to try to tip off the DBA.  But it can't
really be because of a pending postgresql.conf change that hasn't been
reloaded, because those don't generally affect authentication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.2 recovery/startup problems

2014-12-02 Thread Jeff Janes

On Tue, Dec 2, 2014 at 7:41 AM, Robert Haas  wrote:

> On Wed, Nov 26, 2014 at 7:13 PM, Jeff Janes  wrote:
> > If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl
> stop
> > -mi, then neither goes away.  It is only with the /sbin/reboot that I get
> > the fatal combination of _init being gone but the other still present.
>
> Eh?  That sounds wonky.
>
> I mean, reboot normally kills processes with SIGTERM or SIGKILL, in
> which case I'd expect the outcome to match what you get with pg_ctl
> stop -mf or pg_ctl stop -mi.  The only way I can see that you'd get a
> different behavior is if you did a hard reboot (like echo b >
> /proc/sysrq-trigger); if that changes things, then we might have a
> missing-fsync bug.  How is that reboot managing to leave the main fork
> behind while losing the init fork?
>

During abort processing after getting a SIGTERM, the back end truncates
59288 to zero size, and unlinks all the other files
(including 59288_init).  The actual removal of 59288 is left until the
checkpoint.  So if you SIGTERM the backend, then take down the server
uncleanly before the next checkpoint completes, you are left with just
59288.

Here is the strace:

open("base/16416/59288", O_RDWR)= 8
ftruncate(8, 0) = 0
close(8)= 0
unlink("base/16416/59288.1")= -1 ENOENT (No such file or
directory)
unlink("base/16416/59288_fsm")  = -1 ENOENT (No such file or
directory)
unlink("base/16416/59288_vm")   = -1 ENOENT (No such file or
directory)
unlink("base/16416/59288_init") = 0
unlink("base/16416/59288_init.1")   = -1 ENOENT (No such file or
directory)

Cheers,

Jeff

Re: [HACKERS] why is PG_AUTOCONF_FILENAME is pg_config_manual.h?

2014-12-02 Thread Robert Haas

On Thu, Nov 27, 2014 at 12:56 PM, Fujii Masao  wrote:
> On Thu, Nov 27, 2014 at 11:58 PM, Peter Eisentraut  wrote:
>> Surely that's not a value that we expect users to be able to edit.  Is
>> pg_config_manual.h just abused as a place that's included everywhere?
>>
>> (I suggest utils/guc.h as a better place.)
>
> +1

+1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Report search_path value back to the client.

2014-12-02 Thread Alexander Kukushkin

Hi,

As of now postgres is reporting only "really" important variables, but
among them
there are also not so important, like 'application_name'. The only reason
to report
it, was: "help pgbouncer, and perhaps other connection poolers".
Change was introduced by commit: 59ed94ad0c9f74a3f057f359316c845cedc4461e

This fact makes me wonder, why 'search_path' value is not reported back to
the
client? Use-case is absolutely the same as with 'application_name' but a
little bit
more important.

Our databases provides different version of stored procedures which are
located
in a different schemas: 'api_version1', 'api_version2', 'api_version5',
etc...
When application establish connection to the database it set search_path
value
for example to api_version1. At the same time, new version of the same
application
will set search_path value to api_version2. Sometimes we have hundreds of
instances of applications which may use different versions of stored
procedures
which are located in different schemas.

It's really crazy to keep so many (hundreds) connections to the database and
it would be much better to have something like pgbouncer in front of
postgres.
Right now it's not possible, because pgbouncer is not aware of search_path
and it's not really possible to fix pgbouncer, because there is no easy way
to
get current value of search_path.

I would like to mark 'search_path' as GUC_REPORT:
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2904,7 +2904,7 @@ static struct config_string ConfigureNamesString[] =
{"search_path", PGC_USERSET, CLIENT_CONN_STATEMENT,
gettext_noop("Sets the schema search order for
names that are not schema-qualified."),
NULL,
-   GUC_LIST_INPUT | GUC_LIST_QUOTE
+   GUC_LIST_INPUT | GUC_LIST_QUOTE | GUC_REPORT
},
&namespace_search_path,
"\"$user\",public",

What do you think?

Regards,
--
Alexander Kukushkin

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Robert Haas

On Wed, Nov 26, 2014 at 11:00 PM, Michael Paquier
 wrote:
> On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila  wrote:
>> Don't we need to initialize doPageCompression  similar to doPageWrites in 
>> InitXLOGAccess?
> Yep, you're right. I missed this code path.
>
>> Also , in the earlier patches compression was set 'on' even when fpw GUC is 
>> 'off'. This was to facilitate compression of FPW which are forcibly written 
>> even when fpw GUC is turned off.
>>  doPageCompression in this patch is set to true only if value of fpw GUC is 
>> 'compress'. I think its better to compress forcibly written full page writes.
> Meh? (stealing a famous quote).
> This is backward-incompatible in the fact that forcibly-written FPWs
> would be compressed all the time, even if FPW is set to off. The
> documentation of the previous patches also mentioned that images are
> compressed only if this parameter value is switched to compress.

If we have a separate GUC to determine whether to do compression of
full page writes, then it seems like that parameter ought to apply
regardless of WHY we are doing full page writes, which might be either
that full_pages_writes=on in general, or that we've temporarily turned
them on for the duration of a full backup.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Testing DDL deparsing support

2014-12-02 Thread Robert Haas

On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick  wrote:
> DDL deparsing is a feature that allows collection of DDL commands as they
> are
> executed in a server, in some flexible, complete, and fully-contained format
> that allows manipulation, storage, and transmission.  This feature has
> several
> use cases; the two best known ones are DDL replication and DDL auditing.
>
> We have came up with a design that uses a JSON structure to store commands.
> It is similar to the C sprintf() call in spirit: there is a base format
> string, which is a generic template for each command type, and contains
> placeholders that represent the variable parts of the command.  The values
> for
> the placeholders in each specific command are members of the JSON object.  A
> helper function is provided that expands the format string and replaces the
> placeholders with the values, and returns the SQL command as text.  This
> design lets the user operate on the JSON structure in either a read-only
> fashion (for example to block table creation if the names don't satisfy a
> certain condition), or by modifying it (for example, to change the schema
> name
> so that tables are created in different schemas when they are replicated to
> some remote server).
>
> This design is mostly accepted by the community.  The one sticking point is
> testing: how do we ensure that the JSON representation we have created
> correctly deparses back into a command that has the same effect as the
> original command.  This was expressed by Robert Haas:
> http://www.postgresql.org/message-id/CA+TgmoZ=vzrijmxlkqi_v0jg4k4leapmwusc6rwxs5mquxu...@mail.gmail.com
>
> The problem cannot be solved by a standard regression test module which runs
> a
> bunch of previously-defined commands and verifies the output.  We need not
> only check the output for the commands as they exist today, but also we need
> to ensure that this does not get broken as future patches modify the
> existing
> commands as well as create completely new commands.
>
> The challenge here is to create a new testing framework that ensures the DDL
> deparsing module will be maintained by future hackers as the DDL grammar is
> modified.
>
> What and How to Test
> 
>
> Our goal should be that patch authors run "make check-world" in their
> patched
> copies and notice that the DDL deparse test is failing; they can then modify
> deparse_utility.c to add support for the new commands, which should in
> general
> be pretty straightforward.  This way, maintaining deparsing code would be
> part
> of new patches just like we require pg_dump support and documentation for
> new
> features.
>
> It would not work to require patch authors to add their new commands to a
> new
> pg_regress test file, because most would not be aware of the need, or they
> would just forget to do it, and patches would be submitted and possibly even
> committed without any realization of the breakage caused.
>
> There are two things we can rely on: standard regression tests, and pg_dump.
>
> Standard regression tests are helpful because patch authors include new test
> cases in the patches that stress their new options or commands.  This new
> test
> framework needs to be something that internally runs the regression tests
> and
> exercises DDL deparsing as the regression tests execute DDL.  That would
> mean
> that new commands and options would automatically be deparse-tested by our
> new
> framework as soon as patch authors add standard regress support.
>
> One thing is first-grade testing, that is ensure that the deparsed version
> of
> a DDL command can be executed at all, for if the deparsed version throws an
> error, it's immediately obvious that the deparse code is bogus.  This is
> because we know the original command did not throw an error: otherwise, the
> deparse code would not have run at all, because ddl_command_end triggers are
> only executed once the original command has completed execution.  So
> first-grade testing ensures that no trivial bugs are present.
>
> But there's second-grade verification as well: is the object produced by the
> deparsed version identical to the one produced by the original command?  One
> trivial but incomplete approach is to run the command, then save the
> deparsed
> version; run the deparsed version, and deparse that one too; compare both.
> The problem with this approach is that if the deparse code is omitting some
> clause (say it omits IN TABLESPACE in a CREATE TABLE command), then both
> deparsed versions would contain the same bug yet they would compare equal.
> Therefore this approach is not good enough.
>
> The best idea we have so far to attack second-grade testing is to trust
> pg_dump to expose differences: accumulate commands as they run in the
> regression database, the run the deparsed versions in a different database;
> then pg_dump both databases and compare the dumped outputs.
>
> Proof-of-concept
> 
>
> We have now implemented th

Re: [HACKERS] Using pg_rewind for differential backup

2014-12-02 Thread Robert Haas

On Fri, Nov 28, 2014 at 2:49 AM, Heikki Linnakangas
 wrote:
> It also would be quite straightforward to write a separate tool to do just
> that. Would be better than conflating pg_rewind with this. You could use
> pg_rewind as the basis for it - it's under the same license as PostgreSQL.

If we had such a tool in core, would that completely solve the
differential backup problem, or would more be needed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fillfactor for GIN indexes

2014-12-02 Thread Robert Haas

On Fri, Nov 28, 2014 at 4:27 AM, Alexander Korotkov
 wrote:
> On Fri, Nov 21, 2014 at 8:12 AM, Michael Paquier 
> wrote:
>> Please find attached a simple patch adding fillfactor as storage parameter
>> for GIN indexes. The default value is the same as the one currently aka 100
>> to have the pages completely packed when a GIN index is created.
>
>
> That's not true. Let us discuss it a little bit.
> In btree access method index is created by sorting index tuples. Once they
> are sorted it can write a tree with any fullness of the pages. With
> completely filled pages you will get flood of index page splits on table
> updates or inserts. In order to avoid that the default fillfactor is 90.
> Besides index creation, btree access method tries to follow given fillfactor
> in the case of inserting ordered data. When rightmost page splits then
> fillfactor percent of data goes to the left page.
> Btree page life cycle is between being freshly branched by split (50%
> filled) and being full packed (100% filled) and then splitted. Thus we can
> roughly estimate average fullness of btree page to be 75%. This
> "equilibrium" state can be achieved by huge amount of inserts over btree
> index.
>
> Fillfactor was spreaded to other access methods. For instance, GiST. In
> GiST, tree is always constructed by page splits. So, freshly created GiST is
> already in its "equilibrium" state. Even more GiST doesn't splits page by
> halves. Split ration depends on opclass. So, "equilibrium" fullness of
> particular GiST index is even less than 75%. What does fillfactor do with
> GiST? It makes GiST pages split on fillfactor fullness on index build. So,
> GiST is even less fulled. But why? You anyway don't get flood of split on
> fresh GiST index because it's in "equilibrium" state. That's why I would
> rather delete fillfactor from GiST until we have some algorithm to construct
> GiST without page splits.
>
> What is going on with GIN? GIN is, basically, two-level nested btree. So,
> fillfactor should be applicable here. But GIN is not constructed like
> btree...
> GIN accumulates maintenance_work_mem of data and then inserts it into the
> tree. So fresh GIN is the result of insertion of maintenance_work_mem
> buckets. And each bucket is sorted. On small index (or large
> maintenance_work_mem) it would be the only one bucket. GIN has two levels of
> btree: entry tree and posting tree. Currently entry tree has on special
> logic to control fullness during index build. So, with only one bucket entry
> tree will be 50% filled. With many buckets entry tree will be at
> "equilibrium" state. In some specific cases (about two buckets) entry tree
> can be almost fully packed. Insertion into posting tree is always ordered
> during index build because posting tree is a tree of TIDs. When posting tree
> page splits it leaves left page fully packed during index build and 75%
> packed during insertion (because it expects some sequential inserts).
>
> Idea of GIN building algorithm is that entry tree is expected to be
> relatively small and fit to cache. Insertions into posting trees are orders.
> Thus this algorithm should be IO efficient and have low overhead. However,
> with large entry trees this algorithm can perform much worse.
>
> My summary is following:
> 1) In order to have fully correct support of fillfactor in GIN we need to
> rewrite GIN build algorithm.
> 2) Without rewriting GIN build algorithm, not much can be done with entry
> tree. However, you can implement some heuristics.
> 3) You definitely need to touch code that selects ratio of split in
> dataPlaceToPageLeaf (starting with if (!btree->isBuild)).
> 3) GIN data pages are always compressed excepts pg_upgraded indexes from
> pre-9.4. Take care about it in following code.
>   if (GinPageIsCompressed(page))
>   freespace = GinDataLeafPageGetFreeSpace(page);
> + else if (btree->isBuild)
> + freespace = BLCKSZ * (100 - fillfactor) / 100;

This is a very interesting explanation; thanks for writing it up!

It does leave me wondering why anyone would want fillfactor for GIN at
all, even if they were willing to rewrite the build algorithm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2014-12-02 Thread Robert Haas

On Fri, Nov 28, 2014 at 11:38 AM, Alex Shulgin  wrote:
> Should I add this patch to the next CommitFest?

If you don't want it to get forgotten about, yes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add shutdown_at_recovery_target option to recovery.conf

2014-12-02 Thread Robert Haas

On Fri, Nov 28, 2014 at 11:59 AM, Petr Jelinek  wrote:
>>> I'm a bit late to the party, but wouldn't
>>>
>>> recovery_target_action = ...
>>>
>>> have been a better name for this? It'd be in line with the other
>>> recovery_target_* parameters, and also a bit shorter than the imho
>>> somewhat ugly "action_at_recovery_target".
>>
>> FWIW, I too think that "recovery_target_action" is a better name.
>
> I agree.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Testing DDL deparsing support

2014-12-02 Thread Alvaro Herrera

Robert Haas wrote:
> On Thu, Nov 27, 2014 at 11:43 PM, Ian Barwick  wrote:

> > A simple schedule to demonstrate this is available; execute from the
> > src/test/regress/ directory like this:
> >
> > ./pg_regress \
> >   --temp-install=./tmp_check \
> >   --top-builddir=../../.. \
> >   --dlpath=. \
> >   --schedule=./schedule_ddl_deparse_demo
> 
> I haven't read the code, but this concept seems good to me.

Excellent, thanks.

> It has the unfortunate weakness that a difference could exist during
> the *middle* of the regression test run that is gone by the *end* of
> the run, but our existing pg_upgrade testing has the same weakness, so
> I guess we can view this as one more reason not to be too aggressive
> about having regression tests drop the unshared objects they create.

Agreed.  Not dropping objects also helps test pg_dump itself; the normal
procedure there is run the regression tests, then pg_dump the regression
database.  Objects that are dropped never exercise their corresponding
pg_dump support code, which I think is a bad thing.  I think we should
institute a policy that regression tests must keep the objects they
create; maybe not all of them, but at least a sample large enough to
cover all interesting possibilities.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Josh Berkus

On 12/02/2014 06:25 AM, Alex Shulgin wrote:
>> I am not in favor of this part. It may be better to let the users know
>> > that their old configuration is not valid anymore with an error. This
>> > patch cuts in the flesh with a huge axe, let's be sure that users do
>> > not ignore the side pain effects, or recovery.conf would be simply
>> > ignored and users would not be aware of that.
> Yeah, that is good point.
> 
> I'd be in favor of a solution that works the same way as before the
> patch, without the need for extra trigger files, etc., but that doesn't
> seem to be nearly possible.  

As previously discussed, there are ways to avoid having a trigger file
for replication.  However, it's hard to avoid having one for PITR
recovery; at least, I can't think of a method which isn't just as
awkward, and we might as well stick with the awkward method we know.

> Whatever tricks we might employ will likely
> be defeated by the fact that the oldschool user will fail to *include*
> recovery.conf in the main conf file.

Well, can we merge this patch and then fight out what to do for the
transitional users as a separate patch?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Robert Haas

On Sat, Nov 29, 2014 at 3:00 PM, Stephen Frost  wrote:
> * Stephen Frost (sfr...@snowman.net) wrote:
>> * Stephen Frost (sfr...@snowman.net) wrote:
>> > Attached is a patch to address the pg_cancel/terminate_backend and the
>> > statistics info as discussed previously.  It sounds like we're coming to
>>
>> And I forgot the attachment, of course.  Apologies.
>
> Updated patch attached which also changes the error messages (which
> hadn't been updated in the prior versions and really should be).
>
> Barring objections, I plan to move forward with this one and get this
> relatively minor change wrapped up.  As mentioned in the commit, while
> this might be an arguably back-patchable change, the lack of field
> complaints and the fact that it changes existing behavior mean it should
> go only against master, imv.

This patch does a couple of different things:

1. It makes more of the crappy error message change that Andres and I
already objected to on the other thread.  Whether you disagree with
those objections or not, don't make an end-run around them by putting
more of the same stuff into patches on other threads.

2. It changes the functions in pgstatfuncs.c so that you can see the
relevant information not only for your own role, but also for roles of
which you are a member.  That seems fine, but do we need a
documentation change someplace?

3. It messes around with pg_signal_backend().  There are currently no
cases in which pg_signal_backend() throws an error, which is good,
because it lets you write queries against pg_stat_activity() that
don't fail halfway through, even if you are missing permissions on
some things.  This patch introduces such a case, which is bad.

I think it's unfathomable that you would consider anything in this
patch a back-patchable bug fix.  It's clearly a straight-up behavior
change... or more properly three different changes, only one of which
I agree with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Robert Haas

On Sat, Nov 29, 2014 at 11:46 PM, Jim Nasby  wrote:
> What do you mean by "never succeed"? Is it skipping a large number of pages?
> Might re-trying the locks within the same vacuum help, or are the user locks
> too persistent?

You are confused.  He's talking about the relation-level lock that
vacuum attempts to take before doing any work at all on a given table,
not the per-page cleanup locks that it takes while processing each
page.  If the relation-level lock can't be acquired, the whole table
is skipped.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Getting references for OID

2014-12-02 Thread Воронин Дмитрий

Hello, How can I get all depend objects for oid of object? For example, I have oid of db db_id and I want to get all oids of namespaces, which contains this db. Thank you! -- С уважением, Дмитрий Воронин

Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Alvaro Herrera

Josh Berkus wrote:
> On 12/02/2014 06:25 AM, Alex Shulgin wrote:

> > Whatever tricks we might employ will likely
> > be defeated by the fact that the oldschool user will fail to *include*
> > recovery.conf in the main conf file.
> 
> Well, can we merge this patch and then fight out what to do for the
> transitional users as a separate patch?

You seem to be saying "I don't have any good idea how to solve this
problem now, but I will magically have one once this is committed".  I'm
not sure that works very well.

In any case, the proposal upthread that we raise an error if
recovery.conf is found seems sensible enough.  Users will see it and
they will adjust their stuff -- it's a one-time thing.  It's not like
they switch a version forwards one week and back the following week.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Alvaro Herrera

Robert Haas wrote:
> On Sat, Nov 29, 2014 at 11:46 PM, Jim Nasby  wrote:
> > What do you mean by "never succeed"? Is it skipping a large number of pages?
> > Might re-trying the locks within the same vacuum help, or are the user locks
> > too persistent?
> 
> You are confused.  He's talking about the relation-level lock that
> vacuum attempts to take before doing any work at all on a given table,
> not the per-page cleanup locks that it takes while processing each
> page.  If the relation-level lock can't be acquired, the whole table
> is skipped.

Almost there.  Autovacuum takes the relation-level lock, starts
processing.  Some time later, another process wants a lock that
conflicts with the one autovacuum has.  This is flagged by the deadlock
detector, and a signal is sent to autovacuum, which commits suicide.

If the table is large, the time window for this to happen is large also;
there might never be a time window large enough between two lock
acquisitions for one autovacuum run to complete in a table.  This
starves the table from vacuuming completely, until things are bad enough
that an emergency vacuum is forced.  By then, the bloat is disastrous.

I think it's that suicide that Andres wants to disable.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Josh Berkus

On 12/02/2014 10:31 AM, Alvaro Herrera wrote:
> Josh Berkus wrote:
>> On 12/02/2014 06:25 AM, Alex Shulgin wrote:
> 
>>> Whatever tricks we might employ will likely
>>> be defeated by the fact that the oldschool user will fail to *include*
>>> recovery.conf in the main conf file.
>>
>> Well, can we merge this patch and then fight out what to do for the
>> transitional users as a separate patch?
> 
> You seem to be saying "I don't have any good idea how to solve this
> problem now, but I will magically have one once this is committed".  I'm
> not sure that works very well.

No, I'm saying "this problem is easy to solve technically, but we have
intractable arguments on this list about the best way to solve it, even
though the bulk of the patch isn't in dispute".

> In any case, the proposal upthread that we raise an error if
> recovery.conf is found seems sensible enough.  Users will see it and
> they will adjust their stuff -- it's a one-time thing.  It's not like
> they switch a version forwards one week and back the following week.

I'm OK with that solution.  Apparently others aren't though.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Josh Berkus

On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
> If the table is large, the time window for this to happen is large also;
> there might never be a time window large enough between two lock
> acquisitions for one autovacuum run to complete in a table.  This
> starves the table from vacuuming completely, until things are bad enough
> that an emergency vacuum is forced.  By then, the bloat is disastrous.
> 
> I think it's that suicide that Andres wants to disable.

A much better solution for this ... and one which would solve a *lot* of
other issues with vacuum and autovacuum ... would be to give vacuum a
way to track which blocks an incomplete vacuum had already visited.
This would be even more valuable for freeze.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund

On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
> On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
> > If the table is large, the time window for this to happen is large also;
> > there might never be a time window large enough between two lock
> > acquisitions for one autovacuum run to complete in a table.  This
> > starves the table from vacuuming completely, until things are bad enough
> > that an emergency vacuum is forced.  By then, the bloat is disastrous.
> > 
> > I think it's that suicide that Andres wants to disable.

Correct.

> A much better solution for this ... and one which would solve a *lot* of
> other issues with vacuum and autovacuum ... would be to give vacuum a
> way to track which blocks an incomplete vacuum had already visited.
> This would be even more valuable for freeze.

That's pretty much a different problem. Yes, some more persistent would
be helpful - although it'd need to be *much* more than which pages it
has visited - but you'd still be vulnerable to the same issue.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Josh Berkus

On 12/02/2014 11:08 AM, Andres Freund wrote:
> On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
>> On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
>>> If the table is large, the time window for this to happen is large also;
>>> there might never be a time window large enough between two lock
>>> acquisitions for one autovacuum run to complete in a table.  This
>>> starves the table from vacuuming completely, until things are bad enough
>>> that an emergency vacuum is forced.  By then, the bloat is disastrous.
>>>
>>> I think it's that suicide that Andres wants to disable.
> 
> Correct.
> 
>> A much better solution for this ... and one which would solve a *lot* of
>> other issues with vacuum and autovacuum ... would be to give vacuum a
>> way to track which blocks an incomplete vacuum had already visited.
>> This would be even more valuable for freeze.
> 
> That's pretty much a different problem. Yes, some more persistent would
> be helpful - although it'd need to be *much* more than which pages it
> has visited - but you'd still be vulnerable to the same issue.

If we're trying to solve the problem that vacuums of large, high-update
tables never complete, it's solving the same problem.  And in a much
better way.

And yeah, doing a vacuum placeholder wouldn't be simple, but it's the
only solution I can think of that's worthwhile.  Just disabling the
vacuum releases sharelock behavior puts the user in the situation of
deciding between maintenance and uptime.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Removing INNER JOINs

2014-12-02 Thread Robert Haas

On Sun, Nov 30, 2014 at 12:51 PM, Tom Lane  wrote:
> Bottom line, given all the restrictions on whether the optimization can
> happen, I have very little enthusiasm for the whole idea.  I do not think
> the benefit will be big enough to justify the amount of mess this will
> introduce.

This optimization applies to a tremendous number of real-world cases,
and we really need to have it.  This was a huge problem for me in my
previous life as a web developer.  The previous work that we did to
remove LEFT JOINs was an enormous help, but it's not enough; we need a
way to remove INNER JOINs as well.

I thought that David's original approach of doing this in the planner
was a good one.  That fell down because of the possibility that
apparently-valid referential integrity constraints might not be valid
at execution time if the triggers were deferred.  But frankly, that
seems like an awfully nitpicky thing for this to fall down on.  Lots
of web applications are going to issue only SELECT statements that run
as as single-statement transactions, and so that issue, so troubling
in theory, will never occur in practice.  That doesn't mean that we
don't need to account for it somehow to make the code safe, but any
argument that it abridges the use case significantly is, in my
opinion, not credible.

Anyway, David was undeterred by the rejection of that initial approach
and rearranged everything, based on suggestions from Andres and later
Simon, into the form it's reached now.  Kudos to him for his
persistance.  But your point that we might have chosen a whole
different plan if it had known that this join was cheaper is a good
one.  However, that takes us right back to square one, which is to do
this at plan time.  I happen to think that's probably better anyway,
but I fear we're just going around in circles here.  We can either do
it at plan time and find some way of handling the fact that there
might be deferred triggers that haven't fired yet; or we can do it at
execution time and live with the fact that we might have chosen a plan
that is not optimal, though still better than executing a
completely-unnecessary join.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund

On 2014-12-02 11:12:40 -0800, Josh Berkus wrote:
> On 12/02/2014 11:08 AM, Andres Freund wrote:
> > On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
> >> On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
> >>> If the table is large, the time window for this to happen is large also;
> >>> there might never be a time window large enough between two lock
> >>> acquisitions for one autovacuum run to complete in a table.  This
> >>> starves the table from vacuuming completely, until things are bad enough
> >>> that an emergency vacuum is forced.  By then, the bloat is disastrous.
> >>>
> >>> I think it's that suicide that Andres wants to disable.
> > 
> > Correct.
> > 
> >> A much better solution for this ... and one which would solve a *lot* of
> >> other issues with vacuum and autovacuum ... would be to give vacuum a
> >> way to track which blocks an incomplete vacuum had already visited.
> >> This would be even more valuable for freeze.
> > 
> > That's pretty much a different problem. Yes, some more persistent would
> > be helpful - although it'd need to be *much* more than which pages it
> > has visited - but you'd still be vulnerable to the same issue.
> 
> If we're trying to solve the problem that vacuums of large, high-update
> tables never complete, it's solving the same problem.

Which isn't what I'm talking about.

The problem is that vacuum is cancelled if a conflicting lock request is
acquired. Plain updates don't do that. But there's workloads where you
need more heavyweight updates, and then it can easily happen.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Sequence Access Method WIP

2014-12-02 Thread Andres Freund

On 2014-11-24 13:16:24 +0200, Heikki Linnakangas wrote:
> To be clear: I don't think this API is very good for its stated purpose, for
> implementing global sequences for use in a cluster. For the reasons I've
> mentioned before.  I'd like to see two changes to this proposal:
> 
> 1. Make the AM implementation solely responsible for remembering the "last
> value". (if it's a global or remote sequence, the current value might not be
> stored in the local server at all)

I think that reason isn't particularly good. The practical applicability
for such a implementation doesn't seem to be particularly large.

> 2. Instead of the single amdata field, make it possible for the
> implementation to define any number of fields with any datatype in the
> tuple. That would make debugging, monitoring etc. easier.

My main problem with that approach is that it pretty much nails the door
shut for moving sequences into a catalog table instead of the current,
pretty insane, approach of a physical file per sequence. Currently, with
our without seqam, it'd not be all that hard to force it into a catalog,
taking care to to force each tuple onto a separate page...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Jeff Janes

On Tue, Dec 2, 2014 at 11:12 AM, Josh Berkus  wrote:

> On 12/02/2014 11:08 AM, Andres Freund wrote:
> > On 2014-12-02 11:02:07 -0800, Josh Berkus wrote:
> >> On 12/02/2014 10:35 AM, Alvaro Herrera wrote:
> >>> If the table is large, the time window for this to happen is large
> also;
> >>> there might never be a time window large enough between two lock
> >>> acquisitions for one autovacuum run to complete in a table.  This
> >>> starves the table from vacuuming completely, until things are bad
> enough
> >>> that an emergency vacuum is forced.  By then, the bloat is disastrous.
> >>>
> >>> I think it's that suicide that Andres wants to disable.
> >
> > Correct.
> >
> >> A much better solution for this ... and one which would solve a *lot* of
> >> other issues with vacuum and autovacuum ... would be to give vacuum a
> >> way to track which blocks an incomplete vacuum had already visited.
> >> This would be even more valuable for freeze.
> >
> > That's pretty much a different problem. Yes, some more persistent would
> > be helpful - although it'd need to be *much* more than which pages it
> > has visited - but you'd still be vulnerable to the same issue.
>
> If we're trying to solve the problem that vacuums of large, high-update
> tables never complete, it's solving the same problem.  And in a much
> better way.
>
> And yeah, doing a vacuum placeholder wouldn't be simple, but it's the
> only solution I can think of that's worthwhile.  Just disabling the
> vacuum releases sharelock behavior puts the user in the situation of
> deciding between maintenance and uptime.
>

I think it would be more promising to work on downgrading lock strengths so
that fewer things conflict, and it would be not much more work than what
you propose.

What operations are people doing on a regular basis that take locks which
cancel vacuum?  create index?

Cheers,

Jeff

Re: [HACKERS] Getting references for OID

2014-12-02 Thread Borodin Vladimir


> Hello,
>  
> How can I get all depend objects for oid of object? For example, I have oid 
> of db db_id and I want to get all oids of namespaces, which contains this db. 
> Thank you!

Hello, Dmitry.

Actually, this list is for developers (as described here [0]). You should ask 
this question on pgsql-general@, for example, or you can do it on 
pgsql-ru-general@, in Russian.

And if I understood you right you can use oid2name tool which is part of 
contrib. More info about it can be found here [1].

[0] http://www.postgresql.org/list/
[1] http://www.postgresql.org/docs/current/static/oid2name.html

>  
> -- 
> С уважением, Дмитрий Воронин
>  


--
Vladimir

Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost  wrote:
> > It includes the actual error message, unlike what we have today which is
> > guidence as to what's required to get past the permission denied error.
> >
> > In other words, I disagree that the same amount of information is being
> > conveyed.
> 
> So, you think that someone might see the message "must be superuser or
> replication role to use replication slots" and fail to understand that
> they have a permissions problem?

I think that the error message about a permissions problem should be
that there is a permissions problem, first and foremost.

We don't say 'You must have SELECT rights on relation X to select from
it.' when you try to do a SELECT that you're not allowed to.  We throw a
permissions denied error.  As pointed out up-thread, we do similar
things for other cases of permissions denied and even beyond permission
denied errors we tend to match up the errmsg() with the errcode(), which
makes a great deal of sense to me and is why I'm advocating that
position.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund

On 2014-12-02 11:23:31 -0800, Jeff Janes wrote:
> I think it would be more promising to work on downgrading lock strengths so
> that fewer things conflict, and it would be not much more work than what
> you propose.

I think you *massively* underestimate the effort required to to lower
lock levels. There's some very ugly corners you have to think about to
do so. Just look at how long it took to implement the lock level
reductions for ALTER TABLE - and those were the simpler cases.

> What operations are people doing on a regular basis that take locks
> which cancel vacuum?  create index?

Locking tables against modifications in this case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sat, Nov 29, 2014 at 3:00 PM, Stephen Frost  wrote:
> > * Stephen Frost (sfr...@snowman.net) wrote:
> > Updated patch attached which also changes the error messages (which
> > hadn't been updated in the prior versions and really should be).
> >
> > Barring objections, I plan to move forward with this one and get this
> > relatively minor change wrapped up.  As mentioned in the commit, while
> > this might be an arguably back-patchable change, the lack of field
> > complaints and the fact that it changes existing behavior mean it should
> > go only against master, imv.
> 
> This patch does a couple of different things:
> 
> 1. It makes more of the crappy error message change that Andres and I
> already objected to on the other thread.  Whether you disagree with
> those objections or not, don't make an end-run around them by putting
> more of the same stuff into patches on other threads.

The error message clearly needed to be updated either way or I wouldn't
have touched it.  I changed it to match what I feel is the prevelant and
certainly more commonly seen messaging from PG when it comes to
permissions errors, and drew attention to it by commenting on the fact
that I changed it.  Doing otherwise would have drawn similar criticism
(is it did upthread, by Peter or Alvaro, I believe..) that I wasn't
updating it to match the messaging which we should be using.

> 2. It changes the functions in pgstatfuncs.c so that you can see the
> relevant information not only for your own role, but also for roles of
> which you are a member.  That seems fine, but do we need a
> documentation change someplace?

Yes, I've added the documentation changes to my branch, just hadn't
posted an update yet (travelling today).

> 3. It messes around with pg_signal_backend().  There are currently no
> cases in which pg_signal_backend() throws an error, which is good,
> because it lets you write queries against pg_stat_activity() that
> don't fail halfway through, even if you are missing permissions on
> some things.  This patch introduces such a case, which is bad.

Good point, I'll move that check up into the other functions, which will
allow for a more descriptive error as well.

> I think it's unfathomable that you would consider anything in this
> patch a back-patchable bug fix.  It's clearly a straight-up behavior
> change... or more properly three different changes, only one of which
> I agree with.

I didn't think it was back-patchable and stated as much.  I anticipated
that argument and provided my thoughts on it.  I *do* think it's wrong
to be using GetUserId() in this case and it's only very slightly
mollified by being documented that way.

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Removing INNER JOINs

2014-12-02 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Sun, Nov 30, 2014 at 12:51 PM, Tom Lane  wrote:
> > Bottom line, given all the restrictions on whether the optimization can
> > happen, I have very little enthusiasm for the whole idea.  I do not think
> > the benefit will be big enough to justify the amount of mess this will
> > introduce.
> 
> This optimization applies to a tremendous number of real-world cases,
> and we really need to have it.  This was a huge problem for me in my
> previous life as a web developer.  The previous work that we did to
> remove LEFT JOINs was an enormous help, but it's not enough; we need a
> way to remove INNER JOINs as well.

For my 2c, I'm completely with Robert on this one.  There are a lot of
cases this could help with, particularly things coming out of ORMs
(which, yes, might possibly be better written, but that's a different
issue).

> I thought that David's original approach of doing this in the planner
> was a good one.  That fell down because of the possibility that
> apparently-valid referential integrity constraints might not be valid
> at execution time if the triggers were deferred.  But frankly, that
> seems like an awfully nitpicky thing for this to fall down on.  Lots
> of web applications are going to issue only SELECT statements that run
> as as single-statement transactions, and so that issue, so troubling
> in theory, will never occur in practice.  That doesn't mean that we
> don't need to account for it somehow to make the code safe, but any
> argument that it abridges the use case significantly is, in my
> opinion, not credible.

Agreed with this also, deferred triggers are not common-place in my
experience and when it *does* happen, ime at least, it's because you
have a long-running data load or similar where you're not going to
care one bit that large, complicated JOINs aren't as fast as they
might have been otherwise.

> Anyway, David was undeterred by the rejection of that initial approach
> and rearranged everything, based on suggestions from Andres and later
> Simon, into the form it's reached now.  Kudos to him for his
> persistance.  But your point that we might have chosen a whole
> different plan if it had known that this join was cheaper is a good
> one.  However, that takes us right back to square one, which is to do
> this at plan time.  I happen to think that's probably better anyway,
> but I fear we're just going around in circles here.  We can either do
> it at plan time and find some way of handling the fact that there
> might be deferred triggers that haven't fired yet; or we can do it at
> execution time and live with the fact that we might have chosen a plan
> that is not optimal, though still better than executing a
> completely-unnecessary join.

Right, we can't get it wrong in the face of deferred triggers either.
Have we considered only doing the optimization for read-only
transactions?  I'm not thrilled with that, but at least we'd get out
from under this deferred triggers concern.  Another way might be an
option to say "use the optimization, but throw an error if you run
into a deferred trigger", or perhaps save both plans and use whichever
one we can when we get to execution time?  That could make planning
time go up too much to work, but perhaps it's worth testing..

Thanks,

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Jeff Janes

On Tue, Dec 2, 2014 at 11:41 AM, Andres Freund 
wrote:

> On 2014-12-02 11:23:31 -0800, Jeff Janes wrote:
> > I think it would be more promising to work on downgrading lock strengths
> so
> > that fewer things conflict, and it would be not much more work than what
> > you propose.
>
> I think you *massively* underestimate the effort required to to lower
> lock levels. There's some very ugly corners you have to think about to
> do so. Just look at how long it took to implement the lock level
> reductions for ALTER TABLE - and those were the simpler cases.
>

Or maybe I overestimate how hard it would be to make vacuum restartable.
You would have to save a massive amount of state (upto maintenance_work_mem
tid list, the block you left off on both the table and all of the indexes
in that table), and you would somehow have to validate that saved state
against any changes that might have occurred to the table or the indexes
while it was saved and you were not holding the lock, which seems like it
would almost as full of corner cases as weakening the lock in the first
place.  Aren't they logically the same thing?  If we could drop the lock
and take it up again later, maybe the answer is not to save the state, but
just to pause the vacuum until the lock becomes free again, in effect
saving the state in situ.  That would allow autovac worker to be held
hostage to anyone taking a lock, though.

The only easy way to do it that I see is to have it only stop at the end of
a index-cleaning cycle, which probably takes too long to block for.  Or
record a restart point at the end of each index-cleaning cycle, and then
when it yields the lock it abandons all work since the last cycle end,
rather than since the beginning.  That would be better than what we have,
but seems like a far cry from actual restarting from any point.

>
> > What operations are people doing on a regular basis that take locks
> > which cancel vacuum?  create index?
>
> Locking tables against modifications in this case.
>

So in "share mode", then?  I don't think there is any reason that there
can't be a lock mode that conflicts with "ROW EXCLUSIVE" but not "SHARE
UPDATE EXCLUSIVE".  Basically something that conflicts with logical
changes, but not with physical changes.

Cheers,

Jeff

Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Stephen Frost

* Stephen Frost (sfr...@snowman.net) wrote:
> > 3. It messes around with pg_signal_backend().  There are currently no
> > cases in which pg_signal_backend() throws an error, which is good,
> > because it lets you write queries against pg_stat_activity() that
> > don't fail halfway through, even if you are missing permissions on
> > some things.  This patch introduces such a case, which is bad.
> 
> Good point, I'll move that check up into the other functions, which will
> allow for a more descriptive error as well.

Err, I'm missing something here, as pg_signal_backend() is a misc.c
static internal function?  How would you be calling it from a query
against pg_stat_activity()?

I'm fine making the change anyway, just curious..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Review of GetUserId() Usage

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 2:50 PM, Stephen Frost  wrote:
>> 1. It makes more of the crappy error message change that Andres and I
>> already objected to on the other thread.  Whether you disagree with
>> those objections or not, don't make an end-run around them by putting
>> more of the same stuff into patches on other threads.
>
> The error message clearly needed to be updated either way or I wouldn't
> have touched it.  I changed it to match what I feel is the prevelant and
> certainly more commonly seen messaging from PG when it comes to
> permissions errors, and drew attention to it by commenting on the fact
> that I changed it.  Doing otherwise would have drawn similar criticism
> (is it did upthread, by Peter or Alvaro, I believe..) that I wasn't
> updating it to match the messaging which we should be using.

OK, I guess that's a fair point.

>> I think it's unfathomable that you would consider anything in this
>> patch a back-patchable bug fix.  It's clearly a straight-up behavior
>> change... or more properly three different changes, only one of which
>> I agree with.
>
> I didn't think it was back-patchable and stated as much.  I anticipated
> that argument and provided my thoughts on it.  I *do* think it's wrong
> to be using GetUserId() in this case and it's only very slightly
> mollified by being documented that way.

It's not wrong.  It's just different than what you happen to prefer.
It's fine to want to change things, but "not the way I would have done
it" is not the same as "arguably a bug".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Andres Freund

On 2014-12-02 12:22:42 -0800, Jeff Janes wrote:
> Or maybe I overestimate how hard it would be to make vacuum
> restartable.

That's a massive project. Which is why I'm explicitly *not* suggesting
that. What I instead suggest is a separate threshhold after which vacuum
isn't going to abort automaticlaly after a lock conflict. So after that
threshold just behave like anti wraparound vacuum already does.

Maybe autovacuum_vacuum/analyze_force_threshold or similar. If set to
zero, the default, that behaviour is disabled. If set to a positive
value it's an absolute one, if negative it's a factor of the normal
autovacuum_vacuum/analyze_threshold.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 2:35 PM, Stephen Frost  wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost  wrote:
>> > It includes the actual error message, unlike what we have today which is
>> > guidence as to what's required to get past the permission denied error.
>> >
>> > In other words, I disagree that the same amount of information is being
>> > conveyed.
>>
>> So, you think that someone might see the message "must be superuser or
>> replication role to use replication slots" and fail to understand that
>> they have a permissions problem?
>
> I think that the error message about a permissions problem should be
> that there is a permissions problem, first and foremost.

I can't disagree with that, but it's not like the current message is:

ERROR: I ran into some kind of a problem but I'm not going to tell you
what it is for a long time OK I guess that's enough well actually you
see there is a problem and it happens to do have to do with
permissions and in particular that they are denied.

It's pretty clear that the current message is complaining about a
permissions problem, so the fact that it doesn't specifically include
the words "permission" and "denied" doesn't bother me.  Let me ask the
question again: what information do you think is conveyed by your
proposed rewrite that is not conveyed by the current message?

> We don't say 'You must have SELECT rights on relation X to select from
> it.' when you try to do a SELECT that you're not allowed to.  We throw a
> permissions denied error.  As pointed out up-thread, we do similar
> things for other cases of permissions denied and even beyond permission
> denied errors we tend to match up the errmsg() with the errcode(), which
> makes a great deal of sense to me and is why I'm advocating that
> position.

I don't thing that trying to match up the errmsg() with the errcode()
is a useful exercise.  That's just restricting the ways that people
can write error messages in a way that is otherwise unnecessary.  The
errmsg() and errcode() have different purposes; the former is to
provide a concise, human-readable description of the problem, and the
latter is to group errors into categories so that related errors can
be handled programmatically.  They need not be, and in many existing
cases are not, even vaguely similar; and the fact that many of our
permission denied errors happen to use that phrasing is not a
sufficient justification for changing the rest unless the new phrasing
is a bona fide improvement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Andres Freund

On 2014-12-02 17:25:14 +0300, Alex Shulgin wrote:
> I'd be in favor of a solution that works the same way as before the
> patch, without the need for extra trigger files, etc., but that doesn't
> seem to be nearly possible.  Whatever tricks we might employ will likely
> be defeated by the fact that the oldschool user will fail to *include*
> recovery.conf in the main conf file.

I think removing the ability to define a trigger file is pretty much
unacceptable. It's not *too* bad to rewrite recovery.conf's contents
into actual valid postgresql.conf format, but it's harder to change an
existing complex failover setup that relies on the existance of such a
trigger.  I think aiming for removal of that is a sure way to prevent
the patch from getting in.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 1:38 PM, Peter Geoghegan  wrote:
> On Tue, Nov 25, 2014 at 4:01 AM, Robert Haas  wrote:
>> - This appears to needlessly reindent the comments for PG_CACHE_LINE_SIZE.
>
> Actually, the word "only" is removed (because PG_CACHE_LINE_SIZE has a
> new client now). So it isn't quite the same paragraph as before.

Oy, I missed that.

>> - I really don't think we need a #define in pg_config_manual.h for
>> this.  Please omit that.
>
> You'd prefer to not offer a way to disable abbreviation? Okay. I guess
> that makes sense - it should work well as a general optimization.

I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
that we expect a reasonably decent number of users to need to change
should be in that file, and this is too marginal for that.  If anybody
other than the developers of the feature is disabling this, the whole
thing is going to be ripped back out.

> I'm not sure about that. I'd prefer to have tuplesort (and one or two
> other sites) set the "abbreviation is possible in principle" flag.
> Otherwise, sortsupport needs to assume that the leading attribute is
> going to be the abbreviation-applicable one, which might not always be
> true. Still, it's not as if I feel strongly about it.

When wouldn't that be true?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selectivity estimation for inet operators

2014-12-02 Thread Emre Hasegeli

> I spent a fair chunk of the weekend hacking on this patch to make
> it more understandable and fix up a lot of what seemed to me pretty
> clear arithmetic errors in the "upper layers" of the patch.  However,
> I couldn't quite convince myself to commit it, because the business
> around estimation for partial histogram-bucket matches still doesn't
> make any sense to me.  Specifically this:
>
> /* Partial bucket match. */
> left_divider = inet_hist_match_divider(left, query, opr_codenum);
> right_divider = inet_hist_match_divider(right, query, 
> opr_codenum);
>
> if (left_divider >= 0 || right_divider >= 0)
> match += 1.0 / pow(2.0, Max(left_divider, right_divider));
>
> Now unless I'm missing something pretty basic about the divider
> function, it returns larger numbers for inputs that are "further away"
> from each other (ie, have more not-in-common significant address bits).
> So the above calculation seems exactly backwards to me: if one endpoint
> of a bucket is "close" to the query, or even an exact match, and the
> other endpoint is further away, we completely ignore the close/exact
> match and assign a bucket match fraction based only on the further-away
> endpoint.  Isn't that exactly backwards?

You are right that partial bucket match calculation isn't fair on
some circumstances.

> I experimented with logic like this:
>
> if (left_divider >= 0 && right_divider >= 0)
> match += 1.0 / pow(2.0, Min(left_divider, right_divider));
> else if (left_divider >= 0 || right_divider >= 0)
> match += 1.0 / pow(2.0, Max(left_divider, right_divider));
>
> ie, consider the closer endpoint if both are valid.  But that didn't seem
> to work a whole lot better.  I think really we need to consider both
> endpoints not just one to the exclusion of the other.

I have tried many combinations like this.  Including arithmetic,
geometric, logarithmic mean functions.  I could not get good results
with any of them, so I left it in a basic form.

Max() works pretty well most of the time, because if the query matches
the bucket generally it is close to both of the endpoints.  By using
Max(), we are actually crediting the ones which are close to the both
of the endpoints.  

> I'm also not exactly convinced by the divider function itself,
> specifically about the decision to fail and return -1 if the masklen
> comparison comes out wrong.  This effectively causes the masklen to be
> the most significant part of the value (after the IP family), which seems
> totally wrong.  ISTM we ought to consider the number of leading bits in
> common as the primary indicator of "how far apart" a query and a
> histogram endpoint are.

The partial match calculation with Max() is especially unfair on
the buckets where more significant bits change.  For example 63/8 and
64/8.  Returning -1 instead of a high divider, forces it to use
the divider for the other endpoint.  We consider the number of leading
bits in common as the primary indicator, just for the other endpoint.

I have also experimented with the count of the common bits of
the endpoints of the bucket for better partial match calculation.
I could not find out a meaningful equation with it.

> Even if the above aspects of the code are really completely right, the
> comments fail to explain why.  I spent a lot of time on the comments,
> but so far as these points are concerned they still only explain what
> is being done and not why it's a useful calculation to make.

I couldn't write better comments because I don't have strong arguments
about it.  We can say that we don't try to make use of the both of
the endpoints, because we don't know how to combine them.  We only use
the one with matching family and masklen, and when both of them match
we use the distant one to be on the safer side.

Thank you for looking at it.  Comments look much better now.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Doing better at HINTing an appropriate column within errorMissingColumn()

2014-12-02 Thread Robert Haas

On Tue, Nov 25, 2014 at 7:13 PM, Peter Geoghegan  wrote:
> Alright, so let me summarize what I think are the next steps in
> working towards getting this patch committed. I should produce a new
> revision which:
>
> * Uses default costings.
>
> * Applies a generic final quality check that enforces a distance of no
> greater than 50% of the total string size. (The use of default
> costings removes any reason to continue to do this)
>
> * Work through Robert's suggestions on other aspects that need work
> [1], most of which I already agreed to.

Sounds good so far.

> What is unclear is whether or not I should continue to charge extra
> for non-matching user supplied alias (and, I think more broadly,
> consider multiple RTEs iff the user did use an alias) - Robert was
> skeptical, but didn't seem to have made his mind up. I still think I
> should cost things based on aliases, and consider multiple RTEs even
> when the user supplied an alias (the penalty should just be a distance
> of 1 and not 3, though, in light of other changes to the
> weighing/costing). If I don't hear anything in the next day or two,
> I'll more or less preserve aliases-related aspects of the patch.

Basically, the case in which I think it's helpful to issue a
suggestion here is when the user has used the table name rather than
the alias name.  I wonder if it's worth checking for that case
specifically, in lieu of what you've done here, and issuing a totally
different hint in that case ("HINT: You must refer to this as column
as "prime_minister.id" rather than "cameron.id").

Another idea, which I think I like less well, is to check the
Levenshtein distance between the allowed alias and the entered alias
and, if that's within the half-the-shorter-length threshold, consider
possible matches from that RTE, charge the distance between the
correct alias and the entered alias as a penalty to each potential
column match.

What I think won't do is to look at a situation where the user has
entered automobile.id and suggest that maybe they meant student.iq, or
even student.id. The amount of difference between the names has got to
matter for the RTE names, just as it does for the column names.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] superuser() shortcuts

2014-12-02 Thread Stephen Frost

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Dec 2, 2014 at 2:35 PM, Stephen Frost  wrote:
> > * Robert Haas (robertmh...@gmail.com) wrote:
> >> On Tue, Dec 2, 2014 at 11:29 AM, Stephen Frost  wrote:
> >> > It includes the actual error message, unlike what we have today which is
> >> > guidence as to what's required to get past the permission denied error.
> >> >
> >> > In other words, I disagree that the same amount of information is being
> >> > conveyed.
> >>
> >> So, you think that someone might see the message "must be superuser or
> >> replication role to use replication slots" and fail to understand that
> >> they have a permissions problem?
> >
> > I think that the error message about a permissions problem should be
> > that there is a permissions problem, first and foremost.
> 
> I can't disagree with that, but it's not like the current message is:
> 
> ERROR: I ran into some kind of a problem but I'm not going to tell you
> what it is for a long time OK I guess that's enough well actually you
> see there is a problem and it happens to do have to do with
> permissions and in particular that they are denied.

I agree that doing this would be rather overkill.

> It's pretty clear that the current message is complaining about a
> permissions problem, so the fact that it doesn't specifically include
> the words "permission" and "denied" doesn't bother me.  Let me ask the
> question again: what information do you think is conveyed by your
> proposed rewrite that is not conveyed by the current message?

I do like that it's explicitly stating that the problem is with
permissions and not because of some lack-of-capability in the software
and I like the consistency of messaging (to the point where I was
suggesting somewhere along the way that we update the error messaging
documentation to be explicit that the errmsg and the errcode should make
sense and match up- NOT that we should use some simple mapping from one
to the other as we'd then be reducing the information provided, but I've
seen a number of cases where people have the SQL error-code also
returned in their psql session; I've never been a fan of that as I much
prefer words over error-codes, but I wonder if they've done that
because, in some cases, it *isn't* clear what the errcode is from the
errmsg..).

> > We don't say 'You must have SELECT rights on relation X to select from
> > it.' when you try to do a SELECT that you're not allowed to.  We throw a
> > permissions denied error.  As pointed out up-thread, we do similar
> > things for other cases of permissions denied and even beyond permission
> > denied errors we tend to match up the errmsg() with the errcode(), which
> > makes a great deal of sense to me and is why I'm advocating that
> > position.
> 
> I don't thing that trying to match up the errmsg() with the errcode()
> is a useful exercise.  That's just restricting the ways that people
> can write error messages in a way that is otherwise unnecessary.  The
> errmsg() and errcode() have different purposes; the former is to
> provide a concise, human-readable description of the problem, and the
> latter is to group errors into categories so that related errors can
> be handled programmatically.  They need not be, and in many existing
> cases are not, even vaguely similar; and the fact that many of our
> permission denied errors happen to use that phrasing is not a
> sufficient justification for changing the rest unless the new phrasing
> is a bona fide improvement.

It provides a consistency of messaging that I feel is an improvement.
That they aren't related in a number of existing cases strikes me as an
opportunity to improve rather than cases where we really "know better".
Perhaps in some of those cases the problem is that there isn't a good
errcode, and maybe we should actually add an error code instead of
changing the message.  Perhaps there are cases where we just know better
or it's a very generic errcode without any real meaning.  I agree that
we want the messaging to be good, I'd like it to also be consistent with
itself and with the errcode.

I've not gone and tried to do a deep look at the errmsg vs errcode, and
that's likely too much to bite off at once anyway, but I have gone and
looked at the role attribute uses and the other permissions denied
errcode cases and we're not at all consistent today and we change from
release-to-release depending on changes to the permissions system (see:
TRUNCATE).  I specifically don't like that.

If we want to say that role-attribute-related error messages should be
"You need X to do Y", then I'll go make those changes, removing the
'permission denied' where those are used and be happier that we're at
least consistent, but it'll be annoying if we ever make those
role-attributes be GRANT'able rights (GRANT .. ON CLUSTER?)  as those
errmsg's will then change from "you need X" to "permission denied to do
Y".  Having the errdetail change bothers me a lot less.

How would you feel about

Re: [HACKERS] Selectivity estimation for inet operators

2014-12-02 Thread Emre Hasegeli

> Actually, there's a second large problem with this patch: blindly
> iterating through all combinations of MCV and histogram entries makes the
> runtime O(N^2) in the statistics target.  I made up some test data (by
> scanning my mail logs) and observed the following planning times, as
> reported by EXPLAIN ANALYZE:
>
> explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
> explain analyze select * from relays r1, relays r2 where r1.ip && r2.ip;
>
> stats targeteqjoinsel   networkjoinsel
>
> 100 0.27 ms 1.85 ms
> 10002.56 ms 167.2 ms
> 1   56.6 ms 13987.1 ms
>
> I don't think it's necessary for network selectivity to be quite as
> fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
> time for a query that might need just milliseconds to execute :-(
>
> It seemed to me that it might be possible to reduce the runtime by
> exploiting knowledge about the ordering of the histograms, but
> I don't have time to pursue that right now.

I make it break the loop when we passed the last possible match. Patch
attached.  It reduces the runtime almost 50% with large histograms.

We can also make it use only some elements of the MCV and histogram
for join estimation.  I have experimented with reducing the right and
the left hand side of the lists on the previous versions.  I remember
it was better to reduce only the left hand side.  I think it would be
enough to use log(n) elements of the right hand side MCV and histogram.
I can make the change, if you think selectivity estimation function
is the right place for this optimization.
diff --git a/src/backend/utils/adt/network_selfuncs.c b/src/backend/utils/adt/network_selfuncs.c
index f854847..16f39db 100644
--- a/src/backend/utils/adt/network_selfuncs.c
+++ b/src/backend/utils/adt/network_selfuncs.c
@@ -612,20 +612,23 @@ inet_hist_value_sel(Datum *values, int nvalues, Datum constvalue,
 		return 0.0;
 
 	query = DatumGetInetPP(constvalue);
 
 	/* "left" is the left boundary value of the current bucket ... */
 	left = DatumGetInetPP(values[0]);
 	left_order = inet_inclusion_cmp(left, query, opr_codenum);
 
 	for (i = 1; i < nvalues; i++)
 	{
+		if (left_order == 256)
+			break;
+
 		/* ... and "right" is the right boundary value */
 		right = DatumGetInetPP(values[i]);
 		right_order = inet_inclusion_cmp(right, query, opr_codenum);
 
 		if (left_order == 0 && right_order == 0)
 		{
 			/* The whole bucket matches, since both endpoints do. */
 			match += 1.0;
 		}
 		else if ((left_order <= 0 && right_order >= 0) ||
@@ -854,20 +857,23 @@ inet_opr_codenum(Oid operator)
 static int
 inet_inclusion_cmp(inet *left, inet *right, int opr_codenum)
 {
 	if (ip_family(left) == ip_family(right))
 	{
 		int			order;
 
 		order = bitncmp(ip_addr(left), ip_addr(right),
 		Min(ip_bits(left), ip_bits(right)));
 
+		if (order > 0)
+			return 256;
+
 		if (order != 0)
 			return order;
 
 		return inet_masklen_inclusion_cmp(left, right, opr_codenum);
 	}
 
 	return ip_family(left) - ip_family(right);
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 1:00 PM, Robert Haas  wrote:
> I'd prefer not to have a #define in pg_config_manual.h.  Only stuff
> that we expect a reasonably decent number of users to need to change
> should be in that file, and this is too marginal for that.  If anybody
> other than the developers of the feature is disabling this, the whole
> thing is going to be ripped back out.

I agree. The patch either works well in general and it goes in, or it
doesn't and should be rejected. That doesn't mean that the standard
applied is that regressions are absolutely unacceptable, but the
standard shouldn't be too far off that.  I feel pretty confident that
we'll be able to meet that standard, though, because database luminary
Jim Gray recommended this technique in about 1995. Even if the details
of what I have here could stand to be tweaked, there is no getting
around the fact that a good sort routine needs to strongly consider
locality. That was apparent even in 1995, but now it's a very major
trend.

Incidentally, I think that an under-appreciated possible source of
regressions here is that attributes abbreviated have a strong
physical/logical correlation. I could see a small regression for one
such case even though my cost model indicated that it should be very
profitable. On the other hand, on other occasions my cost model (i.e.
considering how good a proxy for full key cardinality abbreviated key
cardinality is) was quite pessimistic. Although, at least it was only
a small regression, even though the correlation was something like
0.95. And at least the sort will be very fast in any case.

You'll recall that Heikki's test case involved correlation like that,
even though it was mostly intended to make a point about the entropy
in abbreviated keys. Correlation was actually the most important
factor there. I think it might be generally true that it's the most
important factor, in practice more important even than capturing
sufficient entropy in the abbreviated key representation.

>> I'm not sure about that. I'd prefer to have tuplesort (and one or two
>> other sites) set the "abbreviation is possible in principle" flag.
>> Otherwise, sortsupport needs to assume that the leading attribute is
>> going to be the abbreviation-applicable one, which might not always be
>> true. Still, it's not as if I feel strongly about it.
>
> When wouldn't that be true?

It just feels a bit wrong to me. There might be a future in which we
want to use the datum1 field for a non-leading attribute. For example,
when it is known ahead of time that there are low cardinality integers
in the leading key/attribute. Granted, that's pretty speculative, but
then it's not as if I'm insisting that it must be done that way. I
defer to you.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Bruce Momjian

On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
> Likely for most aggregates, like count, sum, max, min, bit_and and bit_or the
> merge function would be the same as the transition function, as the state type
> is just the same as the input type. It would only be aggregates like avg(),
> stddev*(), bool_and() and bool_or() that would need a new merge function
> made... These would be no more complex than the transition functions... Which
> are just a few lines of code anyway.
> 
> We'd simply just not run parallel query if any aggregates used in the query
> didn't have a merge function.
> 
> When I mentioned this, I didn't mean to appear to be placing a road block.I 
> was
> just bringing to the table the information that COUNT(*) + COUNT(*) works ok
> for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n) does not.

Sorry, late reply, but, FYI, I don't think our percentile functions
can't be parallelized in the same way:

test=> \daS *percent*
  List of aggregate 
functions
   Schema   |  Name   |  Result data type  | 
Argument data types  | Description

+-++--+-
 pg_catalog | percent_rank| double precision   | VARIADIC "any" 
ORDER BY VARIADIC "any"   | fractional rank of hypothetical row
 pg_catalog | percentile_cont | double precision   | double precision 
ORDER BY double precision   | continuous distribution percentile
 pg_catalog | percentile_cont | double precision[] | double precision[] 
ORDER BY double precision | multiple continuous percentiles
 pg_catalog | percentile_cont | interval   | double precision 
ORDER BY interval   | continuous distribution percentile
 pg_catalog | percentile_cont | interval[] | double precision[] 
ORDER BY interval | multiple continuous percentiles
 pg_catalog | percentile_disc | anyelement | double precision 
ORDER BY anyelement | discrete percentile
 pg_catalog | percentile_disc | anyarray   | double precision[] 
ORDER BY anyelement   | multiple discrete percentiles

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.2 recovery/startup problems

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 11:54 AM, Jeff Janes  wrote:
> During abort processing after getting a SIGTERM, the back end truncates
> 59288 to zero size, and unlinks all the other files (including 59288_init).
> The actual removal of 59288 is left until the checkpoint.  So if you SIGTERM
> the backend, then take down the server uncleanly before the next checkpoint
> completes, you are left with just 59288.
>
> Here is the strace:
>
> open("base/16416/59288", O_RDWR)= 8
> ftruncate(8, 0) = 0
> close(8)= 0
> unlink("base/16416/59288.1")= -1 ENOENT (No such file or
> directory)
> unlink("base/16416/59288_fsm")  = -1 ENOENT (No such file or
> directory)
> unlink("base/16416/59288_vm")   = -1 ENOENT (No such file or
> directory)
> unlink("base/16416/59288_init") = 0
> unlink("base/16416/59288_init.1")   = -1 ENOENT (No such file or
> directory)

Hmm, that's not good.

I guess we can either adopt your suggestion of adjusting
ResetUnloggedRelationsInDbspaceDir() to cope with the possibility that
the situation has changed during recovery, or else figure out how to
be more stringent about the order in which forks get removed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 4:21 PM, Peter Geoghegan  wrote:
>>> I'm not sure about that. I'd prefer to have tuplesort (and one or two
>>> other sites) set the "abbreviation is possible in principle" flag.
>>> Otherwise, sortsupport needs to assume that the leading attribute is
>>> going to be the abbreviation-applicable one, which might not always be
>>> true. Still, it's not as if I feel strongly about it.
>>
>> When wouldn't that be true?
>
> It just feels a bit wrong to me. There might be a future in which we
> want to use the datum1 field for a non-leading attribute. For example,
> when it is known ahead of time that there are low cardinality integers
> in the leading key/attribute. Granted, that's pretty speculative, but
> then it's not as if I'm insisting that it must be done that way. I
> defer to you.

Well, maybe you should make the updates we've agreed on and I can take
another look at it.  But I didn't think that I was proposing to change
anything about the level at which the decision about whether to
abbreviate or not was made; rather, I thought I was suggesting that we
pass that flag down to the code that initializes the sortsupport
object as an argument rather than through the sortsupport structure
itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-02 Thread Alex Shulgin


Andres Freund  writes:

> On 2014-12-02 17:25:14 +0300, Alex Shulgin wrote:
>> I'd be in favor of a solution that works the same way as before the
>> patch, without the need for extra trigger files, etc., but that doesn't
>> seem to be nearly possible.  Whatever tricks we might employ will likely
>> be defeated by the fact that the oldschool user will fail to *include*
>> recovery.conf in the main conf file.
>
> I think removing the ability to define a trigger file is pretty much
> unacceptable. It's not *too* bad to rewrite recovery.conf's contents
> into actual valid postgresql.conf format, but it's harder to change an
> existing complex failover setup that relies on the existance of such a
> trigger.  I think aiming for removal of that is a sure way to prevent
> the patch from getting in.

To make it clear, I was talking not about trigger_file, but about
standby.enabled file which triggers the recovery/pitr/standby scenario
in the current form of the patch and stands as a replacement check
instead of the original check for the presense of recovery.conf.

--
Alex


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas  wrote:
> Well, maybe you should make the updates we've agreed on and I can take
> another look at it.

Agreed.

> But I didn't think that I was proposing to change
> anything about the level at which the decision about whether to
> abbreviate or not was made; rather, I thought I was suggesting that we
> pass that flag down to the code that initializes the sortsupport
> object as an argument rather than through the sortsupport structure
> itself.

The flag I'm talking about concerns the *applicability* of
abbreviation, and not whether or not it will actually be used (maybe
the opclass lacks support, or decides not to for some platform
specific reason). Tuplesort has a contract with abbreviation +
sortsupport that considers whether or not the function pointer used to
abbreviate is set, which relates to whether or not abbreviation will
*actually* be used. Note that for non-abbreviation-applicable
attributes, btsortsupport_worker() never sets the function pointer
(nor, incidentally, does it set the other abbreviation related
function pointers in the struct).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 5:16 PM, Peter Geoghegan  wrote:
> On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas  wrote:
>> Well, maybe you should make the updates we've agreed on and I can take
>> another look at it.
>
> Agreed.
>
>> But I didn't think that I was proposing to change
>> anything about the level at which the decision about whether to
>> abbreviate or not was made; rather, I thought I was suggesting that we
>> pass that flag down to the code that initializes the sortsupport
>> object as an argument rather than through the sortsupport structure
>> itself.
>
> The flag I'm talking about concerns the *applicability* of
> abbreviation, and not whether or not it will actually be used (maybe
> the opclass lacks support, or decides not to for some platform
> specific reason). Tuplesort has a contract with abbreviation +
> sortsupport that considers whether or not the function pointer used to
> abbreviate is set, which relates to whether or not abbreviation will
> *actually* be used. Note that for non-abbreviation-applicable
> attributes, btsortsupport_worker() never sets the function pointer
> (nor, incidentally, does it set the other abbreviation related
> function pointers in the struct).

Right, and what I'm saying is that maybe the "applicability" flag
shouldn't be stored in the SortSupport object, but passed down as an
argument.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas  wrote:
> Right, and what I'm saying is that maybe the "applicability" flag
> shouldn't be stored in the SortSupport object, but passed down as an
> argument.

But then how does that information get to any given sortsupport
routine? That's the place that really needs to know if abbreviation is
useful. In general, they're only passed a SortSupport object. Are you
suggesting revising the signature required of SortSupport routines to
add that extra flag as an additional argument?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Dec 2, 2014 at 2:21 PM, Robert Haas  wrote:
>> Right, and what I'm saying is that maybe the "applicability" flag
>> shouldn't be stored in the SortSupport object, but passed down as an
>> argument.

> But then how does that information get to any given sortsupport
> routine? That's the place that really needs to know if abbreviation is
> useful. In general, they're only passed a SortSupport object. Are you
> suggesting revising the signature required of SortSupport routines to
> add that extra flag as an additional argument?

I think that is what he's suggesting, and I too am wondering why it's
a good idea.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Nitpicky doc corrections for BRIN functions of pageinspect

2014-12-02 Thread Michael Paquier

On Wed, Dec 3, 2014 at 12:23 AM, Alvaro Herrera
 wrote:
> Peter Geoghegan wrote:
>> On Wed, Nov 19, 2014 at 8:02 PM, Michael Paquier
>>  wrote:
>> > Just a small thing I noticed while looking at pageinspect.sgml, the
>> > set of SQL examples related to BRIN indexes uses lower-case characters
>> > for reserved keywords. This has been introduced by 7516f52.
>> > Patch is attached.
>>
>> My nitpicky observation about pageinspect + BRIN was that the comments
>> added to the pageinspect SQL file for the SQL-callable function
>> brin_revmap_data() have a comment header style slightly inconsistent
>> with the existing items.
>
> Pushed.
Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Michael Paquier

On Wed, Dec 3, 2014 at 2:17 AM, Robert Haas  wrote:
> On Wed, Nov 26, 2014 at 11:00 PM, Michael Paquier
>  wrote:
>> On Wed, Nov 26, 2014 at 8:27 PM, Syed, Rahila  
>> wrote:
>>> Don't we need to initialize doPageCompression  similar to doPageWrites in 
>>> InitXLOGAccess?
>> Yep, you're right. I missed this code path.
>>
>>> Also , in the earlier patches compression was set 'on' even when fpw GUC is 
>>> 'off'. This was to facilitate compression of FPW which are forcibly written 
>>> even when fpw GUC is turned off.
>>>  doPageCompression in this patch is set to true only if value of fpw GUC is 
>>> 'compress'. I think its better to compress forcibly written full page 
>>> writes.
>> Meh? (stealing a famous quote).
>> This is backward-incompatible in the fact that forcibly-written FPWs
>> would be compressed all the time, even if FPW is set to off. The
>> documentation of the previous patches also mentioned that images are
>> compressed only if this parameter value is switched to compress.
>
> If we have a separate GUC to determine whether to do compression of
> full page writes, then it seems like that parameter ought to apply
> regardless of WHY we are doing full page writes, which might be either
> that full_pages_writes=on in general, or that we've temporarily turned
> them on for the duration of a full backup.

In the latest versions of the patch, control of compression is done
within full_page_writes by assigning a new value 'compress'. Something
that I am scared of is that if we enforce compression when
full_page_writes is off for forcibly-written pages and if a bug shows
up in the compression/decompression algorithm at some point (that's
unlikely to happen as this has been used for years with toast but
let's say "if"), we may corrupt a lot of backups. Hence why not simply
having a new GUC parameter to fully control it. First versions of the
patch did that but ISTM that it is better than enforcing the use of a
new feature for our user base.

Now, something that has not been mentioned on this thread is to make
compression the default behavior in all cases so as we won't even have
to use a GUC parameter. We are usually conservative about changing
default behaviors so I don't really think that's the way to go. Just
mentioning the possibility.

Regards,
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How about a option to disable autovacuum cancellation on lock conflict?

2014-12-02 Thread Jim Nasby


On 12/2/14, 2:22 PM, Jeff Janes wrote:

Or maybe I overestimate how hard it would be to make vacuum restartable.  You 
would have to save a massive amount of state (upto maintenance_work_mem tid 
list, the block you left off on both the table and all of the indexes in that 
table), and you would somehow have to validate that saved state against any 
changes that might have occurred to the table or the indexes while it was saved 
and you were not holding the lock, which seems like it would almost as full of 
corner cases as weakening the lock in the first place.  Aren't they logically 
the same thing?  If we could drop the lock and take it up again later, maybe 
the answer is not to save the state, but just to pause the vacuum until the 
lock becomes free again, in effect saving the state in situ.  That would allow 
autovac worker to be held hostage to anyone taking a lock, though.


Yeah, rather than messing with any of that, I think it would make a lot more 
sense to split vacuum into smaller operations that don't require such a huge 
chunk of time.


The only easy way to do it that I see is to have it only stop at the end of a 
index-cleaning cycle, which probably takes too long to block for.  Or record a 
restart point at the end of each index-cleaning cycle, and then when it yields 
the lock it abandons all work since the last cycle end, rather than since the 
beginning.  That would be better than what we have, but seems like a far cry 
from actual restarting from any point.


Now that's not a bad idea. This would basically mean just saving a block number 
in pg_class after every intermediate index clean and then setting that back to 
zero when we're done with that relation, right?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [v9.5] Custom Plan API

2014-12-02 Thread Kouhei Kaigai

> -Original Message-
> From: Simon Riggs [mailto:si...@2ndquadrant.com]
> Sent: Thursday, November 27, 2014 8:48 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Robert Haas; Thom Brown; Kohei KaiGai; Tom Lane; Alvaro Herrera; Shigeru
> Hanada; Stephen Frost; Andres Freund; PgHacker; Jim Mlodgenski; Peter
> Eisentraut
> Subject: Re: [HACKERS] [v9.5] Custom Plan API
> 
> On 27 November 2014 at 10:33, Kouhei Kaigai  wrote:
> 
> > The reason why documentation portion was not yet committed is, sorry,
> > it is due to quality of documentation from the standpoint of native
> > English speaker.
> > Now, I'm writing up a documentation stuff according to the latest code
> > base, please wait for several days and help to improve.
> 
> Happy to help with that.
> 
> Please post to the Wiki first so we can edit it communally.
> 
Simon, 

I tried to describe how custom-scan provider interact with the core backend,
and expectations to individual callbacks here.

  https://wiki.postgresql.org/wiki/CustomScanInterface

I'd like to see which kind of description should be added, from third person's
viewpoint.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Many processes blocked at ProcArrayLock

2014-12-02 Thread Michael Paquier

On Tue, Dec 2, 2014 at 5:07 PM, Xiaoyulei  wrote:
> Test configuration:
> Hardware:
> 4P intel server, 60 core 120 hard thread.
> Memory:512G
> SSD:2.4T
>
> PG:
> max_connections = 160   # (change requires restart)
> shared_buffers = 32GB
> work_mem = 128MB
> maintenance_work_mem = 32MB
> bgwriter_delay = 100ms  # 10-1ms between rounds
> bgwriter_lru_maxpages = 200 # 0-1000 max buffers written/round
> bgwriter_lru_multiplier = 2.0   # 0-10.0 multipler on buffers 
> scanned/round
> wal_level = minimal # minimal, archive, or hot_standby
> wal_buffers = 256MB # min 32kB, -1 sets based on 
> shared_buffers
> autovacuum = off
> checkpoint_timeout=60min
> checkpoint_segments = 1000
> archive_mode = off
> synchronous_commit = off
> fsync = off
> full_page_writes = off
>
>
> We use tpcc and pgbench to test postgresql 9.4beat2 performance. And we found 
> the tps/tpmc could not increase with the terminal increase. The detail 
> information is in attachment.
>
> Many processes is blocked, I dump the call stack, and found these processes 
> is blocked at: ProcArrayLock. 60% processes is blocked in 
> ProcArrayEndTransaction with ProcArrayLock EXCLUSIVE, 20% is in 
> GetSnapshotData with ProcArrayLock SHARED. Others locks like XLogFlush and 
> WALInsertLock are not very heavy.
>
> Is there any way we solve this problem?
Providing complete backtraces showing in which code paths those
processes are blocked would help better in understand what may be
going on.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 2:16 PM, Peter Geoghegan  wrote:
> On Tue, Dec 2, 2014 at 2:07 PM, Robert Haas  wrote:
>> Well, maybe you should make the updates we've agreed on and I can take
>> another look at it.
>
> Agreed.

Attached, revised patchset makes these updates. I continue to use the
sortsupport struct to convey that a given attribute of a given sort is
abbreviation-applicable (although the field is now a bool, not an
enum).

-- 
Peter Geoghegan
From 865a1abeb6bfb601b1ec605afb1e339c0e444e10 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Sun, 9 Nov 2014 14:38:44 -0800
Subject: [PATCH 2/2] Estimate total number of rows to be sorted

Sortsupport opclasses now accept a row hint, indicating the estimated
number of rows to be sorted.  This gives opclasses a sense of proportion
about how far along the copying of tuples is when considering aborting
abbreviation.

Estimates come from various sources.  The text opclass now always avoids
aborting abbreviation if the total number of rows to be sorted is high
enough, without considering cardinality at all.
---
 src/backend/access/nbtree/nbtree.c |  5 ++-
 src/backend/access/nbtree/nbtsort.c| 14 +-
 src/backend/commands/cluster.c |  4 +-
 src/backend/executor/nodeAgg.c |  5 ++-
 src/backend/executor/nodeSort.c|  1 +
 src/backend/utils/adt/orderedsetaggs.c |  2 +-
 src/backend/utils/adt/varlena.c| 80 --
 src/backend/utils/sort/tuplesort.c | 14 --
 src/include/access/nbtree.h|  2 +-
 src/include/utils/sortsupport.h|  7 ++-
 src/include/utils/tuplesort.h  |  6 +--
 11 files changed, 121 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d881525..d26c60b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -109,14 +109,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique,
+	 indexInfo->ii_Predicate != NIL, false);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo->ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, true);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 593571b..473ac54 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -73,6 +73,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/rel.h"
+#include "utils/selfuncs.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplesort.h"
 
@@ -149,10 +150,13 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool ispartial,
+			  bool isdead)
 {
 	BTSpool*btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
+	double		estRows;
+	float4		relTuples;
 
 	btspool->heap = heap;
 	btspool->index = index;
@@ -165,10 +169,16 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * Certain cases will always have a relTuples of 0, such as reindexing as
+	 * part of a CLUSTER operation, or when reindexing toast tables.  This is
+	 * interpreted as "no estimate available".
 	 */
 	btKbytes = isdead ? work_mem : maintenance_work_mem;
+	relTuples = RelationGetForm(heap)->reltuples;
+	estRows =  relTuples * (isdead || ispartial ?  DEFAULT_INEQ_SEL : 1);
 	btspool->sortstate = tuplesort_begin_index_btree(heap, index, isunique,
-	 btKbytes, false);
+	 btKbytes, estRows, false);
 
 	return btspool;
 }
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index bc5f33f..8e5f536 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -890,7 +890,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	/* Set up sorting if wanted */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
-			maintenance_work_mem, false);
+			maintenance_work_mem,
+			RelationGetForm(OldHeap)->reltuples,
+			false);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 89de755..95143c3 100644
--- a/src/backend/executor/nodeAg

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 5:28 PM, Peter Geoghegan  wrote:
> Attached, revised patchset makes these updates.

Whoops. Missed some obsolete comments. Here is a third commit that
makes a further small modification to one comment.

-- 
Peter Geoghegan
From 8d1aba80f95e05742047cba5bd83d8f17aa5ef37 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 Dec 2014 17:42:21 -0800
Subject: [PATCH 3/3] Alter comments to reflect current naming

---
 src/include/utils/sortsupport.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/include/utils/sortsupport.h b/src/include/utils/sortsupport.h
index 659233b..f7c73b3 100644
--- a/src/include/utils/sortsupport.h
+++ b/src/include/utils/sortsupport.h
@@ -127,9 +127,9 @@ typedef struct SortSupportData
 	 * Returning zero from the alternative comparator does not indicate
 	 * equality, as with a conventional support routine 1, though -- it
 	 * indicates that it wasn't possible to determine how the two abbreviated
-	 * values compared.  A proper comparison, using "auth_comparator"/
-	 * ApplySortComparatorFull() is therefore required.  In many cases this
-	 * results in most or all comparisons only using the cheap alternative
+	 * values compared.  A proper comparison, using "abbrev_full_comparator"/
+	 * ApplySortAbbrevFullComparator() is therefore required.  In many cases
+	 * this results in most or all comparisons only using the cheap alternative
 	 * comparison func, which is typically implemented as code that compiles to
 	 * just a few CPU instructions.  CPU cache miss penalties are expensive; to
 	 * get good overall performance, sort infrastructure must heavily weigh
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] About xmllint checking for the validity of postgres.xml in 9.5

2014-12-02 Thread Michael Paquier

Hi all,

Since commit 5d93ce2d, the output of xmllint is checked by passing
--valid to it. Isn't that a regression with what we were doing for
pre-9.4 versions? For example, with 9.4 and older versions it is
possible to compile man pages even if the xml spec is not entirely
valid when using docbook 4.2.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] About xmllint checking for the validity of postgres.xml in 9.5

2014-12-02 Thread Michael Paquier

On Wed, Dec 3, 2014 at 12:09 PM, Michael Paquier
 wrote:
> Since commit 5d93ce2d, the output of xmllint is checked by passing
> --valid to it. Isn't that a regression with what we were doing for
> pre-9.4 versions? For example, with 9.4 and older versions it is
> possible to compile man pages even if the xml spec is not entirely
> valid when using docbook 4.2.

Another thing coming to my mind is why don't we simply have a variable
to pass flags to xmllint similarly to xsltproc? Packagers would be
then free to pass the arguments they want. (Note that in some of the
environments where I build the docs postgres.xml is found as invalid,
making build fail for master only, not for older branches).

In any case, attached is a patch showing the idea, bringing more
flexibility in the build, default value being "--valid --noout" if the
flag is not passed by the caller.
Regards,
-- 
Michael
diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 8bdd26c..99ce106 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -48,6 +48,10 @@ ifndef XMLLINT
 XMLLINT = $(missing) xmllint
 endif
 
+ifndef XMLLINTCFLAGS
+XMLLINTCFLAGS = --valid --noout
+endif
+
 ifndef XSLTPROC
 XSLTPROC = $(missing) xsltproc
 endif
@@ -82,7 +86,7 @@ override SPFLAGS += -wall -wno-unused-param -wno-empty -wfully-tagged
 man distprep-man: man-stamp
 
 man-stamp: stylesheet-man.xsl postgres.xml
-	$(XMLLINT) --noout --valid postgres.xml
+	$(XMLLINT) $(XMLLINTCFLAGS) postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_MAN_FLAGS) $^
 	touch $@
 
@@ -259,13 +263,13 @@ endif
 xslthtml: xslthtml-stamp
 
 xslthtml-stamp: stylesheet.xsl postgres.xml
-	$(XMLLINT) --noout --valid postgres.xml
+	$(XMLLINT) $(XMLLINTCFLAGS) postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $(XSLTPROC_HTML_FLAGS) $^
 	cp $(srcdir)/stylesheet.css html/
 	touch $@
 
 htmlhelp: stylesheet-hh.xsl postgres.xml
-	$(XMLLINT) --noout --valid postgres.xml
+	$(XMLLINT) $(XMLLINTCFLAGS) postgres.xml
 	$(XSLTPROC) $(XSLTPROCFLAGS) $^
 
 %-A4.fo.tmp: stylesheet-fo.xsl %.xml
@@ -287,7 +291,7 @@ FOP = fop
 
 epub: postgres.epub
 postgres.epub: postgres.xml
-	$(XMLLINT) --noout --valid $<
+	$(XMLLINT) $(XMLLINTCFLAGS) $<
 	$(DBTOEPUB) $<
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Robert Haas

On Tue, Dec 2, 2014 at 7:16 PM, Michael Paquier
 wrote:
> In the latest versions of the patch, control of compression is done
> within full_page_writes by assigning a new value 'compress'. Something
> that I am scared of is that if we enforce compression when
> full_page_writes is off for forcibly-written pages and if a bug shows
> up in the compression/decompression algorithm at some point (that's
> unlikely to happen as this has been used for years with toast but
> let's say "if"), we may corrupt a lot of backups. Hence why not simply
> having a new GUC parameter to fully control it. First versions of the
> patch did that but ISTM that it is better than enforcing the use of a
> new feature for our user base.

That's a very valid concern.  But maybe it shows that
full_page_writes=compress is not the Right Way To Do It, because then
there's no way for the user to choose the behavior they want when
full_page_writes=off but yet a backup is in progress.  If we had a
separate GUC, we could know the user's actual intention, instead of
guessing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-12-02 Thread Michael Paquier

On Wed, Dec 3, 2014 at 12:35 PM, Robert Haas  wrote:
> On Tue, Dec 2, 2014 at 7:16 PM, Michael Paquier
>  wrote:
>> In the latest versions of the patch, control of compression is done
>> within full_page_writes by assigning a new value 'compress'. Something
>> that I am scared of is that if we enforce compression when
>> full_page_writes is off for forcibly-written pages and if a bug shows
>> up in the compression/decompression algorithm at some point (that's
>> unlikely to happen as this has been used for years with toast but
>> let's say "if"), we may corrupt a lot of backups. Hence why not simply
>> having a new GUC parameter to fully control it. First versions of the
>> patch did that but ISTM that it is better than enforcing the use of a
>> new feature for our user base.
>
> That's a very valid concern.  But maybe it shows that
> full_page_writes=compress is not the Right Way To Do It, because then
> there's no way for the user to choose the behavior they want when
> full_page_writes=off but yet a backup is in progress.  If we had a
> separate GUC, we could know the user's actual intention, instead of
> guessing.
Note that implementing a separate parameter for this patch would not
be much complicated if the core portion does not change much. What
about the long name full_page_compression or the longer name
full_page_writes_compression?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On partitioning

2014-12-02 Thread Amit Langote


Hi Robert,

From: Robert Haas [mailto:robertmh...@gmail.com]
> > * Catalog schema:
> >
> > CREATE TABLE pg_catalog.pg_partitioned_rel
> > (
> >partrelidoidNOT NULL,
> >partkindoidNOT NULL,
> >partissub  bool  NOT NULL,
> >partkey int2vector NOT NULL, -- partitioning attributes
> >partopclass oidvector,
> >
> >PRIMARY KEY (partrelid, partissub),
> >FOREIGN KEY (partrelid)   REFERENCES pg_class (oid),
> >FOREIGN KEY (partopclass) REFERENCES pg_opclass (oid)
> > )
> > WITHOUT OIDS ;
> 
> So, we're going to support exactly two levels of partitioning?
> partitions with partissub=false and subpartitions with partissub=true?
>  Why not support only one level of partitioning here but then let the
> children have their own pg_partitioned_rel entries if they are
> subpartitioned?  That seems like a cleaner design and lets us support
> an arbitrary number of partitioning levels if we ever need them.
> 

Yeah, that's what I thought at some point in favour of dropping partissub 
altogether. 

However, not that this design solves it, there is one question - if we would 
want to support defining for a table both partition key and sub-partition key 
in advance? That is, without having defined a first level partition yet; in 
that case, what level do we associate sub-(sub-) partitioning key with or more 
to the point where do we keep it? One way is to replace partissub by 
partkeylevel with level 0 being the topmost-level partitioning key and so on 
while keeping the partrelid equal to the pg_class.oid of the parent. That 
brings us to next question of managing hierarchies in pg_partition_def 
corresponding to partkeylevel in the definition of topmost partitioned 
relation. But I guess those are implementation details rather than 
representational unless I am being too naïve.

> > CREATE TABLE pg_catalog.pg_partition_def
> > (
> >partitionid  oid NOT NULL,
> >partitionparentrel   oidNOT NULL,
> >partitionisoverflow bool  NOT NULL,
> >partitionvalues anyarray,
> >
> >PRIMARY KEY (partitionid),
> >FOREIGN KEY (partitionid) REFERENCES pg_class(oid)
> > )
> > WITHOUT OIDS;
> >
> > ALTER TABLE pg_catalog.pg_class ADD COLUMN relispartitioned;
> 
> What is an overflow partition and why do we want that?
> 

That would be a default partition. That is, where the tuples that don't belong 
elsewhere (other defined partitions) go. VALUES clause of the definition for 
such a partition would look like:

(a range partition) ... VALUES LESS THAN MAXVALUE 
(a list partition) ... VALUES DEFAULT

There has been discussion about whether there shouldn't be such a place for 
tuples to go. That is, it should generate an error if a tuple can't go anywhere 
(or support auto-creating a new one like in interval partitioning?)

> What are you going to do if the partitioning key has two columns of
> different data types?
> 

Sorry, this totally eluded me. Perhaps, the 'values' needs some more thought. 
They are one of the most crucial elements of the scheme.

I wonder if your suggestion of pg_node_tree plays well here. This then could be 
a list of CONSTs or some such... And I am thinking it's a concern only for 
range partitions, no? (that is, a multicolumn partition key)

I think partkind switches the interpretation of the field as appropriate. Am I 
missing something? By the way, I had mentioned we could have two values fields 
each for range and list partition kind.

> > * DDL syntax (no multi-column partitioning, sub-partitioning support as 
> > yet):
> >
> > -- create partitioned table and child partitions at once.
> > CREATE TABLE parent (...)
> > PARTITION BY [ RANGE | LIST ] (key_column) [ opclass ]
> > [ (
> >  PARTITION child
> >{
> >VALUES LESS THAN { ... | MAXVALUE } -- for RANGE
> >  | VALUES [ IN ] ( { ... | DEFAULT } ) -- for LIST
> >}
> >[ WITH ( ... ) ] [ TABLESPACE tbs ]
> >  [, ...]
> >   ) ] ;
> 
> How are you going to dump and restore this, bearing in mind that you
> have to preserve a bunch of OIDs across pg_upgrade?  What if somebody
> wants to do pg_dump --table name_of_a_partition?
> 

Assuming everything's (including partitioned relation and partitions at all 
levels) got a pg_class entry of its own, would OIDs be a problem? Or what is 
the nature of this problem if it's possible that it may be.

If someone pg_dump's an individual partition as a table, we could let it be 
dumped as just a plain table. I am thinking we should be able to do that or 
should be doing just that (?)

> I actually think it will be much cleaner to declare the parent first
> and then have separate CREATE TABLE statements that glue the children
> in, like CREATE TABLE child PARTITION OF parent VALUES LESS THAN (1,
> 1).
> 

Oh, do you mean to do away without any syntax for defining partitions with 
CREATE TABLE parent?

Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Noah Misch

On Fri, Nov 28, 2014 at 11:43:28AM -0500, Peter Eisentraut wrote:
> In light of the recent discussions about using ICU on OS X, I looked
> into the Core Foundation locale functions (Core Foundation = traditional
> Mac API in OS X, as opposed to the Unix/POSIX APIs).
> 
> Attached is a proof of concept patch that just about works for the
> sorting aspects.  (The ctype aspects aren't there yet and will crash,
> but they could be done similarly.)  It passes an appropriately adjusted
> collate.linux.utf8 test, meaning that it does produce language-aware
> sort orders that are equivalent to what glibc produces.
> 
> At the moment, this is probably just an experiment that shows where
> refactoring and better abstractions might be suitable if we want to
> support multiple locale libraries.  If we want to pursue ICU, I think
> this could be a useful third option.

Does this make the backend multi-threaded?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Craig Ringer

On 12/02/2014 12:52 AM, David E. Wheeler wrote:
> Gotta say, I’m thrilled to see movement on this front, and especially pleased 
> to see how consensus seems to be building around an abstracted interface to 
> keep options open. This platform-specific example really highlights the need 
> for it (I had no idea that there was separate and more up-to-date collation 
> support in Core Foundation than in the UNIX layer of OS X).

It'd also potentially let us make use of Windows' native locale APIs,
which AFAIK receive considerably more love on that platform than their
POSIX back-country cousins.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2014-12-02 Thread Kouhei Kaigai

> On Tue, Nov 25, 2014 at 3:44 AM, Kouhei Kaigai  wrote:
> > Today, I had a talk with Hanada-san to clarify which can be a common
> > portion of them and how to implement it. Then, we concluded both of
> > features can be shared most of the infrastructure.
> > Let me put an introduction of join replacement by foreign-/custom-scan
> below.
> >
> > Its overall design intends to inject foreign-/custom-scan node instead
> > of the built-in join logic (based on the estimated cost). From the
> > viewpoint of core backend, it looks like a sub-query scan that
> > contains relations join internally.
> >
> > What we need to do is below:
> >
> > (1) Add a hook add_paths_to_joinrel()
> > It gives extensions (including FDW drivers and custom-scan providers)
> > chance to add alternative paths towards a particular join of
> > relations, using ForeignScanPath or CustomScanPath, if it can run instead
> of the built-in ones.
> >
> > (2) Informs the core backend varno/varattno mapping One thing we need
> > to pay attention is, foreign-/custom-scan node that performs instead
> > of the built-in join node must return mixture of values come from both
> > relations. In case when FDW driver fetch a remote record (also, fetch
> > a record computed by external computing resource), the most reasonable
> > way is to store it on ecxt_scantuple of ExprContext, then kicks
> > projection with varnode that references this slot.
> > It needs an infrastructure that tracks relationship between original
> > varnode and the alternative varno/varattno. We thought, it shall be
> > mapped to INDEX_VAR and a virtual attribute number to reference
> > ecxt_scantuple naturally, and this infrastructure is quite helpful for
> both of ForegnScan/CustomScan.
> > We'd like to add List *fdw_varmap/*custom_varmap variable to both of plan
> nodes.
> > It contains list of the original Var node that shall be mapped on the
> > position according to the list index. (e.g, the first varnode is
> > varno=INDEX_VAR and
> > varattno=1)
> >
> > (3) Reverse mapping on EXPLAIN
> > For EXPLAIN support, above varnode on the pseudo relation scan needed
> > to be solved. All we need to do is initialization of dpns->inner_tlist
> > on
> > set_deparse_planstate() according to the above mapping.
> >
> > (4) case of scanrelid == 0
> > To skip open/close (foreign) tables, we need to have a mark to
> > introduce the backend not to initialize the scan node according to
> > table definition, but according to the pseudo varnodes list.
> > As earlier custom-scan patch doing, scanrelid == 0 is a
> > straightforward mark to show the scan node is not combined with a
> particular real relation.
> > So, it also need to add special case handling around foreign-/custom-scan
> code.
> >
> > We expect above changes are enough small to implement basic join
> > push-down functionality (that does not involves external computing of
> > complicated expression node), but valuable to support in v9.5.
> >
> > Please comment on the proposition above.
> 
> I don't really have any technical comments on this design right at the moment,
> but I think it's an important area where PostgreSQL needs to make some
> progress sooner rather than later, so I hope that we can get something
> committed in time for 9.5.
> 
I tried to implement the interface portion, as attached.
Hanada-san may be under development of postgres_fdw based on this interface
definition towards the next commit fest.

Overall design of this patch is identical with what I described above.
It intends to allow extensions (FDW driver or custom-scan provider) to
replace a join by a foreign/custom-scan which internally contains a result
set of relations join externally computed. It looks like a relation scan
on the pseudo relation.

One we need to pay attention is, how setrefs.c fixes up varno/varattno
unlike regular join structure. I could find IndexOnlyScan already has
similar infrastructure that redirect references of varnode to a certain
column on ecxt_scantuple of ExprContext using a pair of INDEX_VAR and
alternative varattno.

This patch put a new field: fdw_ps_tlist of ForeignScan, and
custom_ps_tlist of CustomScan. It is extension's role to set a pseudo-
scan target-list (so, ps_tlist) of the foreign/custom-scan that replaced
a join.
If it is not NIL, set_plan_refs() takes another strategy to fix up them.
It calls fix_upper_expr() to map varnodes of expression-list on INDEX_VAR
according to the ps_tlist, then extension is expected to put values/isnull
pair on ss_ScanTupleSlot of scan-state according to the  ps_tlist preliminary
constructed.

Regarding to the primary hook to add alternative foreign/custom-scan
path instead of built-in join paths, I added the following hook on
add_paths_to_joinrel().

  /* Hook for plugins to get control in add_paths_to_joinrel() */
  typedef void (*set_join_pathlist_hook_type) (PlannerInfo *root,
   RelOptInfo *joinrel,

Re: [HACKERS] using Core Foundation locale functions

2014-12-02 Thread Peter Geoghegan

On Tue, Dec 2, 2014 at 10:07 PM, Craig Ringer  wrote:
> On 12/02/2014 12:52 AM, David E. Wheeler wrote:
>> Gotta say, I’m thrilled to see movement on this front, and especially 
>> pleased to see how consensus seems to be building around an abstracted 
>> interface to keep options open. This platform-specific example really 
>> highlights the need for it (I had no idea that there was separate and more 
>> up-to-date collation support in Core Foundation than in the UNIX layer of OS 
>> X).
>
> It'd also potentially let us make use of Windows' native locale APIs,
> which AFAIK receive considerably more love on that platform than their
> POSIX back-country cousins.

Not to mention the fact that a MultiByteToWideChar() call could be
saved, and sortsupport for text would just work on Windows.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Kouhei Kaigai

> On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
> > Likely for most aggregates, like count, sum, max, min, bit_and and
> > bit_or the merge function would be the same as the transition
> > function, as the state type is just the same as the input type. It
> > would only be aggregates like avg(), stddev*(), bool_and() and
> > bool_or() that would need a new merge function made... These would be
> > no more complex than the transition functions... Which are just a few
> lines of code anyway.
> >
> > We'd simply just not run parallel query if any aggregates used in the
> > query didn't have a merge function.
> >
> > When I mentioned this, I didn't mean to appear to be placing a road
> > block.I was just bringing to the table the information that COUNT(*) +
> > COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n)
> does not.
> 
> Sorry, late reply, but, FYI, I don't think our percentile functions can't
> be parallelized in the same way:
> 
>   test=> \daS *percent*
> List of
> aggregate functions
>  Schema   |  Name   |  Result data type  |
> Argument data types  | Description
>   +-++--
> +-
> 
>pg_catalog | percent_rank| double precision   | VARIADIC
> "any" ORDER BY VARIADIC "any"   | fractional rank of hypothetical row
>pg_catalog | percentile_cont | double precision   | double
> precision ORDER BY double precision   | continuous distribution percentile
>pg_catalog | percentile_cont | double precision[] | double
> precision[] ORDER BY double precision | multiple continuous percentiles
>pg_catalog | percentile_cont | interval   | double
> precision ORDER BY interval   | continuous distribution
> percentile
>pg_catalog | percentile_cont | interval[] | double
> precision[] ORDER BY interval | multiple continuous percentiles
>pg_catalog | percentile_disc | anyelement | double
> precision ORDER BY anyelement | discrete percentile
>pg_catalog | percentile_disc | anyarray   | double
> precision[] ORDER BY anyelement   | multiple discrete percentiles
> 
Yep, it seems to me the type of aggregate function that is not obvious
to split into multiple partitions.
I think, it is valuable even if we can push-down a part of aggregate
functions which is well known by the core planner.
For example, we know count(*) = sum(nrows), we also know avg(X) can
be rewritten to enhanced avg() that takes both of nrows and partial
sum of X. We can utilize these knowledge to break-down aggregate
functions.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] printing table in asciidoc with psql

2014-12-02 Thread Michael Paquier

On Mon, Nov 17, 2014 at 7:48 AM, Pavel Stehule  wrote:
> 2014-11-07 22:37 GMT+01:00 Alvaro Herrera :
>>
>>
>> I did \o /tmp/tst, then
>> \dS
>> create table "eh | oh" ();
>> \dS
>>
>> and then filtered the output file to HTML.  The CREATE TABLE tag ended
>> up in the same line as the title of the following table.  I think
>> there's a newline is missing somewhere.
>>
>>
>> The good news is that the | in the table name was processed correctly;
>> but I noticed that the table title is not using the escaped print, so I
>> would imagine that if I put a | in the title, things would go wrong.
>> (I don't know how to put titles on tables other than editing the
>> hardcoded titles for \ commands in psql).
>>
>> Another thing is that spaces around the | seem gratuituous and produce
>> bad results.  I tried "select * from pg_class" which results in a very
>> wide table, and then the HTML output contains some cells with the values
>> in the second line; this makes all rows taller than they must be,
>> because some cells use the first line and other cells in the same row
>> use the second line for the text.  I hand-edited the asciidoc and
>> removed the spaces around | which makes the result nicer.  (Maybe
>> removing the trailing space is enough.)
>
>
> I see a trailing spaces, but I don't see a described effect. Please, can you
> send some more specific test case?

This formatting problem is trivial to reproduce:
=# create table "foo" ();

CREATE TABLE
Time: 9.826 ms
=# \d

.List of relations
[options="header",cols="

Also, something a bit surprising is that this format produces always
one newline for each command, for example in the case of a DDL:
=# create table foo ();

CREATE TABLE
I think that this extra space should be removed as well, no?

This patch has been marked as "Waiting on Author" for a couple of
weeks, and the problems mentioned before have not been completely
addressed, hence marking this patch as returned with feedback. It
would be nice to see progress for the next CF.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] printing table in asciidoc with psql

2014-12-02 Thread Michael Paquier

On Wed, Dec 3, 2014 at 3:52 PM, Michael Paquier
 wrote:
> This patch has been marked as "Waiting on Author" for a couple of
> weeks, and the problems mentioned before have not been completely
> addressed, hence marking this patch as returned with feedback. It
> would be nice to see progress for the next CF.
Btw, here is a resource showing table formatting in asciidoc:
http://asciidoc.org/newtables.html
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait free LW_SHARED acquisition - v0.2

2014-12-02 Thread Michael Paquier

On Tue, Nov 18, 2014 at 12:33 AM, Robert Haas  wrote:
> On Mon, Nov 17, 2014 at 10:31 AM, Andres Freund  
> wrote:
>> On 2014-11-17 10:21:04 -0500, Robert Haas wrote:
>>> Andres, where are we with this patch?
>>>
>>> 1. You're going to commit it, but haven't gotten around to it yet.
>>>
>>> 2. You're going to modify it some more and repost, but haven't gotten
>>> around to it yet.
>>>
>>> 3. You're willing to see it modified if somebody else does the work,
>>> but are out of time to spend on it yourself.
>>>
>>> 4. Something else?
>>
>> I'm working on it. Amit had found a hang on PPC that I couldn't
>> reproduce on x86. Since then I've reproduced it and I think yesterday I
>> found the problem. Unfortunately it always took a couple hours to
>> trigger...
>>
>> I've also made some, in my opinion, cleanups to the patch since
>> then. Those have the nice side effect of making the size of struct
>> LWLock smaller, but that wasn't actually the indended effect.
>>
>> I'll repost once I've verified the problem is fixed and I've updated all
>> commentary.
>>
>> The current problem is that I seem to have found a problem that's also
>> reproducible with master :(. After a couple of hours a
>> pgbench -h /tmp -p 5440 scale3000 -M prepared -P 5 -c 180 -j 60 -T 2 -S
>> against a
>> -c max_connections=200 -c shared_buffers=4GB
>> cluster seems to hang on PPC. With all the backends waiting in buffer
>> mapping locks. I'm now making sure it's really master and not my patch
>> causing the problem - it's just not trivial with 180 processes involved.
>
> Ah, OK.  Thanks for the update.
Ping? This patch is in a stale state for a couple of weeks and still
marked as waiting on author for this CF.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] using custom scan nodes to prototype parallel sequential scan

2014-12-02 Thread Michael Paquier

On Wed, Dec 3, 2014 at 3:23 PM, Kouhei Kaigai  wrote:
>> On Fri, Nov 14, 2014 at 02:51:32PM +1300, David Rowley wrote:
>> > Likely for most aggregates, like count, sum, max, min, bit_and and
>> > bit_or the merge function would be the same as the transition
>> > function, as the state type is just the same as the input type. It
>> > would only be aggregates like avg(), stddev*(), bool_and() and
>> > bool_or() that would need a new merge function made... These would be
>> > no more complex than the transition functions... Which are just a few
>> lines of code anyway.
>> >
>> > We'd simply just not run parallel query if any aggregates used in the
>> > query didn't have a merge function.
>> >
>> > When I mentioned this, I didn't mean to appear to be placing a road
>> > block.I was just bringing to the table the information that COUNT(*) +
>> > COUNT(*) works ok for merging COUNT(*)'s "sub totals", but AVG(n) + AVG(n)
>> does not.
>>
>> Sorry, late reply, but, FYI, I don't think our percentile functions can't
>> be parallelized in the same way:
>>
>>   test=> \daS *percent*
>> List of
>> aggregate functions
>>  Schema   |  Name   |  Result data type  |
>> Argument data types  | Description
>>   +-++--
>> +-
>> 
>>pg_catalog | percent_rank| double precision   | VARIADIC
>> "any" ORDER BY VARIADIC "any"   | fractional rank of hypothetical row
>>pg_catalog | percentile_cont | double precision   | double
>> precision ORDER BY double precision   | continuous distribution percentile
>>pg_catalog | percentile_cont | double precision[] | double
>> precision[] ORDER BY double precision | multiple continuous percentiles
>>pg_catalog | percentile_cont | interval   | double
>> precision ORDER BY interval   | continuous distribution
>> percentile
>>pg_catalog | percentile_cont | interval[] | double
>> precision[] ORDER BY interval | multiple continuous percentiles
>>pg_catalog | percentile_disc | anyelement | double
>> precision ORDER BY anyelement | discrete percentile
>>pg_catalog | percentile_disc | anyarray   | double
>> precision[] ORDER BY anyelement   | multiple discrete percentiles
>>
> Yep, it seems to me the type of aggregate function that is not obvious
> to split into multiple partitions.
> I think, it is valuable even if we can push-down a part of aggregate
> functions which is well known by the core planner.
> For example, we know count(*) = sum(nrows), we also know avg(X) can
> be rewritten to enhanced avg() that takes both of nrows and partial
> sum of X. We can utilize these knowledge to break-down aggregate
> functions.
Postgres-XC (Postgres-XL) has implemented such parallel aggregate
logic some time ago using a set of sub functions and a finalization
function to do the work.
My 2c.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

96 matches

Mail list logo