Re: [HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-18 Thread Tom Lane
Craig Ringer ring...@ringerc.id.au writes:
 On 07/18/2012 08:31 AM, Tom Lane wrote:
 Not sure if we need a whole farm, but certainly having at least one
 machine testing this sort of stuff on a regular basis would make me feel
 a lot better.

 OK. That's something I can actually be useful for.

 My current qemu/kvm test harness control code is in Python since that's 
 what all the other tooling for the project I was using it for is in. Is 
 it likely to be useful for me to adapt that code for use for a Pg 
 crash-test harness, or will you need a particular tool/language to be 
 used? If so, which/what? I'll do pretty much anything except Perl. I'll 
 have a result for you more quickly working in Python, though I'm happy 
 enough to write it in C (or Java, but I'm guessing that won't get any 
 enthusiasm around here).

If we were talking about code that was going to end up in the PG
distribution, I'd kind of want it to be in C or Perl, just to keep down
the number of languages we're depending on.  However, it's not obvious
that a tool like this would ever go into our distribution.  I'd suggest
working with what you're comfortable with, and we can worry about
translation when and if there's a reason to.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-18 Thread Daniel Farina
On Tue, Jul 17, 2012 at 9:16 PM, Bruce Momjian br...@momjian.us wrote:
 WAL is not guaranteed to be the same between PG major versions, so doing
 anything with WAL is pretty much a no-go.

I understand that the WAL format changes, sometimes dramatically
between versions. What I'm suggesting that the first WAL-record
emitted by the binary upgrade process could be entitled WAL-stream
upgrade to 9.4 that would fail to be understood by old versions or
possibly understood to mean stop replay, you won't even understand
what's about to be said.

At that point, start up new version in the same cluster and have it
continue replay from that position on forward, which should all be in
the new format that it can understand.  It need not understand the old
format in that case, but the tricky part is this single record that
tells the replayer of the old version to stop while a replayer of the
new version somehow will know it is the right place to start.

One mechanism could be a WAL file segment boundary: the standby could
be told to exit when it finishes recovery of the segment
0001123455CD, and to start the new version beginning
recovery at 0001123455CF (one higher), and that would be
the first WAL emitted by pg_upgrade. In principle the same is possible
using the fine-grained record position, such as X/NN, but may be
more complex for not much gain.

This also means the database would be stuck in an inconsistent state
when it starts, not unlike when recovering from a on-line base backup.
 And that's totally reasonable: the new version has to start up
presuming that the database cluster makes not enough sense to enter
hot standby yet.

Yet another mechanism is to not have the Postgres recovery-process
apply the WAL, but rather some special purpose program that knows how
to count through and apply specially-formatted WAL segments, and then
set the resultant cluster to start recovering from the WAL past this
span of specially-formatted WAL.  The crux is to get some continuity
in this stream, and there are many ways to slice it. Otherwise, the
continuous archives will have a gap while a new base backup is taken
of data that mostly rests unchanged.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-18 Thread Heikki Linnakangas

On 18.07.2012 02:48, Peter Geoghegan wrote:

On 17 July 2012 23:56, Tom Lanet...@sss.pgh.pa.us  wrote:

This implies that nobody has done pull-the-plug testing on either HEAD
or 9.2 since the checkpointer split went in (2011-11-01), because even
a modicum of such testing would surely have shown that we're failing to
fsync a significant fraction of our write traffic.

Furthermore, I would say that any performance testing done since then,
if it wasn't looking at purely read-only scenarios, isn't worth the
electrons it's written on.  In particular, any performance gain that
anybody might have attributed to the checkpointer splitup is very
probably hogwash.

This is not giving me a warm feeling about our testing practices.


The checkpointer slit-up was not justified as a performance
optimisation so much as a re-factoring effort that might have some
concomitant performance benefits.


Agreed, but it means that we need to re-run the tests that were done to 
make sure the extra fsync-request traffic is not causing a performance 
regression, 
http://archives.postgresql.org/pgsql-hackers/2011-10/msg01321.php.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [9.1] 2 bugs with extensions

2012-07-18 Thread Marko Kreen
I converted Skytools modules to extensions and found 2 problems:

1) Dumpable sequences are not supported - if sequence is tagged
   with pg_catalog.pg_extension_config_dump(), the pg_dump tries
   to run COPY on it.

2) If there is schema with functions, but nothing else,
   then when later converting it to extension by adding
   functions and schema to extension, later drop
   of that extension fails to remove schema.
   Proper CREATE EXT + DROP EXT works fine.

To reproduce, checkout 'master' branch and go directly
to module directory:

$ git clone --recursive git://github.com/markokr/skytools.git
$ cd skytools

1) $ cd sql/pgq  make install installcheck
..
pg_dump regression  test.dump
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  cannot copy from sequence
batch_id_seq
pg_dump: The command was: COPY pgq.batch_id_seq  TO stdout;


2) $ cd sql/pgq_coop  make install installcheck
..
 create extension pgq_coop from 'unpackaged';
 drop extension pgq_coop;
 create extension pgq_coop;
 ERROR:  schema pgq_coop already exists

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-18 Thread Robert Haas
On Tue, Jul 17, 2012 at 6:56 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 So I went to fix this in the obvious way (attached), but while testing
 it I found that the number of buffers_backend events reported during
 a regression test run barely changed; which surprised the heck out of
 me, so I dug deeper.  The cause turns out to be extremely scary:
 ForwardFsyncRequest isn't getting called at all in the bgwriter process,
 because the bgwriter process has a pendingOpsTable.  So it just queues
 its fsync requests locally, and then never acts on them, since it never
 runs any checkpoints anymore.

:-(

 This implies that nobody has done pull-the-plug testing on either HEAD
 or 9.2 since the checkpointer split went in (2011-11-01), because even
 a modicum of such testing would surely have shown that we're failing to
 fsync a significant fraction of our write traffic.

 Furthermore, I would say that any performance testing done since then,
 if it wasn't looking at purely read-only scenarios, isn't worth the
 electrons it's written on.  In particular, any performance gain that
 anybody might have attributed to the checkpointer splitup is very
 probably hogwash.

I don't think anybody thought that was going to result in a direct
performance gain, but I agree the performance testing needs to be
redone.  I suspect that the impact on my testing is limited, because I
do mostly pgbench testing, and the lost fsync requests were probably
duplicated by non-lost fsync requests from backend writes.  But I
agree that it needs to be redone once this is fixed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CompactCheckpointerRequestQueue versus pad bytes

2012-07-18 Thread Robert Haas
On Tue, Jul 17, 2012 at 1:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 9:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, I wonder whether the code that checks for relfilenode conflict
 when selecting a pg_class or relfilenode OID tries both file naming
 conventions?  If not, should we make it do so?

 I don't believe it does, nor do I see what we would gain by making it to do 
 so.

 What we would gain is ensuring that we aren't using the same relfilenode
 for both a regular table and a temp table.  Do you really want to assume
 that such a conflict is 100% safe?  It sounds pretty scary to me, and
 even if we were sure the backend would never get confused, what about
 client-side code that's looking at relfilenode?

Well, when I was working on that patch, I spent a lot of time trying
to ensure that it was in fact safe.  So I hope it is.  Also, note that
that could perfectly well happen anyway in any prior release if the
relations happened to live in different tablespaces.  Anyone assuming
that dboid,relfilenode is unique is kidding themselves, because it
is not guaranteed to be and has never been guaranteed to be.  Yes,
there could be client code out there that assumes
dboid,tsoid,relfilenode is unique and such code will need adjustment
for 9.1+.  But I bet there isn't very much.  The thing that broke a
lot of client code in that commit was the replacement of relistemp
with relpersistence; we already decided we didn't care about that (and
it's too late to change it now anyway) so I can't really get excited
about this.  I think that code assuming that anything other than a
RelFileNodeBackend is sufficient to uniquely identify a relation is
just buggy, and if there is any, we should fix it.  All remaining uses
of RelFileNode rather than RelFileNodeBackend should be cases where we
know that the backend ID has got to be InvalidBackendId.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using pg_upgrade on log-shipping standby servers

2012-07-18 Thread Robert Haas
On Tue, Jul 17, 2012 at 6:02 PM, Bruce Momjian br...@momjian.us wrote:
 However, I have two ideas.  First, I don't know _why_ the
 primary/standby would be any different after pg_upgrade, so I added the
 documentation mention because I couldn't _guarantee_ they were the same.
 Actually, if people can test this, we might be able to say this is safe.

 Second, the user files (large) are certainly identical, it is only the
 system tables (small) that _might_ be different, so rsync'ing just those
 would add the guarantee, but I know of no easy way to rsync just the
 system tables.

I'm scratching my head in confusion here.  After pg_upgrade, the
master is a completely new cluster.  The system catalog contents are
completely different, and so are things like the database system
identifier and the WAL position - yeah, the latter is approximately
the same, but almost doesn't count except in horseshoes.  Obviously
any attempt to replay WAL from the new cluster on the old cluster is
doomed to failure, at least unless we do a bunch more engineering here
that hasn't really been thought about yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Event Triggers reduced, v1

2012-07-18 Thread Robert Haas
On Thu, Jul 12, 2012 at 8:50 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 [ new patch ]

Well, I think it's about time to start getting some of this code into
our tree.  However, I'm still not confident that the code that
actually runs the triggers is entirely solid, so I decided to rip that
stuff out and commit only the syntax support and documentation
portions of this patch.  We can add the actual trigger firing stuff
and PL language support in a subsequent commit or commits.  I made a
significant number of modifications.

First, I changed the representation of CreateEventTriggerStmt
considerably, so that we can more easily accommodate multiple filter
variables in future patches without having to whack the code around
too much.  I also disentangled the CreateEventTriggerStmt processing
from the event-cache machinery, because it doesn't seem like a good
idea  to have evtcache.c and event_trigger.c be deeply intertwined,
from a modularity point of view.  I think we might still need to make
a bit more adjustment to the organization of this code - perhaps the
code that is needed by both commands/event_triggers.c and
utils/cache/evtcache.c ought to be moved to something like
catalog/pg_event_trigger.c.  However, I haven't done that in this
commit.

Second, I made this work with COMMENT, SECURITY LABEL, and with the
extension mechanism, including updating the documentation.

Third, I also some other documentation updates to match recent changes
and also added the missing documentation for \dy.

Fourth, I rewrote the regression tests fairly extensively to make sure
we're adequately testing all of the syntax that this commit adds: not
only that it works when it's supposed to work, but also that it fails
when it's supposed to fail and emits a hopefully-good error message in
such cases.

And finally there were a bunch of miscellaneous code cleanups (some of
them fixing things that I muffed in earlier incremental patches that I
sent you to merge), and others that touch code you wrote.

I suspect there are probably still a few oversights here, but
hopefully not too many.

The next step here is obviously to complete the work necessary to
actually be able to fire these triggers, as opposed to just saying
that we fire them.  I'll write up my thoughts on that topic in a
separate email.  I don't think there's a ton of work left to be done
there, but more than zero.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Fix mapping of PostgreSQL encodings to Python encodings.

2012-07-18 Thread Heikki Linnakangas

On 14.07.2012 17:50, Jan Urbański wrote:

On 13/07/12 13:38, Jan Urbański wrote:

On 12/07/12 11:08, Heikki Linnakangas wrote:

On 07.07.2012 00:12, Jan Urbański wrote:

So you're in favour of doing unicode - bytes by encoding with UTF-8
and
then using the server's encoding functions?


Sounds reasonable to me. The extra conversion between UTF-8 and UCS-2
should be quite fast, and it would be good to be consistent in the way
we do conversions in both directions.



I'll implement that than (sorry for not following up on that eariler).


Here's a patch that always encodes Python unicode objects using UTF-8
and then uses Postgres's internal functions to produce bytes in the
server encoding.


Thanks.

If pg_do_encoding_conversion() throws an error, you don't get a chance 
to call Py_DECREF() to release the string. Is that a problem?


If an error occurs in PLy_traceback(), after incrementing 
recursion_depth, you don't get a chance to decrement it again. I'm not 
sure if the Py* function calls can fail, but at least seemingly trivial 
things like initStringInfo() can throw an out-of-memory error.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] enhanced error fields

2012-07-18 Thread Pavel Stehule
Hello

* renamed erritem to err_generic_string
* fixed CSVlog generation
* new file /utils/error/relerror.c with axillary functions -
declarations are in utils/rel.h

Regards

Pavel

2012/7/11 Tom Lane t...@sss.pgh.pa.us:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 FWIW about the new include:  I feel a strong dislike about the forward
 declaration you suggest.  Defining Relation in elog.h seems completely
 out of place.

 Agreed.  Maybe a reasonable solution is to allow some ereport helper
 functions (or, really, wrappers for the helper functions) to be declared
 someplace else than elog.h.  They'd likely need to be implemented
 someplace else than elog.c, too, so this doesn't seem unreasonable.

 The generic helper function approach doesn't seem too unreasonable for
 this: elog.h/.c would provide something like

 err_generic_string(int fieldid, const char *str)

 and then someplace else could provide functions built on this that
 insert table/schema/column/constraint/etc names into suitable fields.

 regards, tom lane


eelog-2012-07-18.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] row literal problem

2012-07-18 Thread Andrew Dunstan

I'm chasing up an issue from a client who has this problem (in 9.1):

with q as
(
some query here
)
select q.* from q

yields:

   job_scope   | checked_col
---+--
 Co Revenues: Co Revenues $100 to $999 Million | input panel=data 
type=checkbox checked
 Metropolitan Area: Austin-Round Rock  | input panel=data 
type=checkbox checked


which is as expected.

However,

with q as
(
same query here
)
select q from q

yields:

q
---
 (Co Revenues: Co Revenues $100 to $999 Million,input panel=data 
type=checkbox checked,)
 (Metropolitan Area: Austin-Round Rock,input panel=data 
type=checkbox checked,)



Note the trailing comma inside the (), which certainly looks bogus to 
me. If I replace select q with select row(q.*) it disappears.


It doesn't happen in all cases, and I'm trying to work out a minimal 
reproducible example. But it sure is puzzling.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-18 Thread Merlin Moncure
On Wed, Jul 18, 2012 at 1:58 PM, Andrew Dunstan and...@dunslane.net wrote:
 I'm chasing up an issue from a client who has this problem (in 9.1):

 with q as
 (
 some query here
 )
 select q.* from q

 yields:

job_scope   | checked_col
 ---+--
  Co Revenues: Co Revenues $100 to $999 Million | input panel=data
 type=checkbox checked
  Metropolitan Area: Austin-Round Rock  | input panel=data
 type=checkbox checked

 which is as expected.

 However,

 with q as
 (
 same query here
 )
 select q from q

 yields:

 q
 ---
  (Co Revenues: Co Revenues $100 to $999 Million,input panel=data
 type=checkbox checked,)
  (Metropolitan Area: Austin-Round Rock,input panel=data type=checkbox
 checked,)


 Note the trailing comma inside the (), which certainly looks bogus to me. If
 I replace select q with select row(q.*) it disappears.

 It doesn't happen in all cases, and I'm trying to work out a minimal
 reproducible example. But it sure is puzzling.

there are no null fields, right? if the last field is sometimes null
you'd see that (you probably ruled that out though).  when you say
'sometimes', do you mean for some rows and not others? or for some
queries?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-18 Thread Andrew Dunstan


On 07/18/2012 03:18 PM, Merlin Moncure wrote:

On Wed, Jul 18, 2012 at 1:58 PM, Andrew Dunstan and...@dunslane.net wrote:

I'm chasing up an issue from a client who has this problem (in 9.1):

with q as
(
 some query here
)
select q.* from q

yields:

job_scope   | checked_col
---+--
  Co Revenues: Co Revenues $100 to $999 Million | input panel=data
type=checkbox checked
  Metropolitan Area: Austin-Round Rock  | input panel=data
type=checkbox checked

which is as expected.

However,

with q as
(
 same query here
)
select q from q

yields:

q
---
  (Co Revenues: Co Revenues $100 to $999 Million,input panel=data
type=checkbox checked,)
  (Metropolitan Area: Austin-Round Rock,input panel=data type=checkbox
checked,)


Note the trailing comma inside the (), which certainly looks bogus to me. If
I replace select q with select row(q.*) it disappears.

It doesn't happen in all cases, and I'm trying to work out a minimal
reproducible example. But it sure is puzzling.

there are no null fields, right? if the last field is sometimes null
you'd see that (you probably ruled that out though).  when you say
'sometimes', do you mean for some rows and not others? or for some
queries?





No, the inner query has two fields.

It happens for all rows, but not for all two-field-resulting queries as 
q. I'm trying to find a simple case rather than the rather complex query 
my customer is using.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-18 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 07/18/2012 03:18 PM, Merlin Moncure wrote:
 there are no null fields, right? if the last field is sometimes null
 you'd see that (you probably ruled that out though).  when you say
 'sometimes', do you mean for some rows and not others? or for some
 queries?

 No, the inner query has two fields.

 It happens for all rows, but not for all two-field-resulting queries as 
 q. I'm trying to find a simple case rather than the rather complex query 
 my customer is using.

I'm wondering about a rowtype with a third, dropped column.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-18 Thread Tom Lane
After fixing the assorted breakage discussed yesterday, I still wasn't
seeing any ForwardFsyncRequest requests coming from the bgwriter during
a regression test run, which made me wonder if there was yet another
bug.  What I find is that because of the recent increase in the
out-of-the-box shared_buffers setting to 128MB, the regression database
fits comfortably in shared buffers (its total footprint appears to be
about 41MB at the moment).  This means that the shared-buffer freelist
never becomes empty, so StrategyGetBuffer never advances the clock sweep
pointer, so after one pass over the buffer pool BgBufferSync decides
that it's lapped the clock sweep and never does anything more.

In short, then, the background writer process is entirely useless for
any database that fits completely into shared buffers.  The only
background writes that get generated in such a case are from the
checkpoint sweep, and AFAICT the backend buffer writes that get
counted by ForwardFsyncRequest are not true buffer writes but mdextend
calls.  (Which likely explains why their number is so consistent over
repeated regression test runs --- the variance is well under 1% for me.)

So that raises two independent sets of questions:

1. Do we like the fact that the bgwriter isn't doing anything in this
situation?  It seems arguably OK for writes to happen only for
checkpointing purposes if there is no memory pressure.  But having the
bgwriter wake up 5 times a second to decide it has nothing to do seems
a bit wasteful.  I'm inclined to think maybe it should go into the
recently added hibernation mode anytime the buffer freelist isn't
empty.  Or maybe you could argue that this scenario isn't worth any
optimization effort, but with many-gig RAM becoming more and more
common, I don't think I agree.

2. It's rather disturbing that a fairly large swath of functionality
just stopped getting tested at all by the buildfarm.  Do we want to
rethink the shared_buffers increase?  Or artificially bloat the
regression database to make it larger than 128MB?  Or do something else
to ensure we still exercise the DB-bigger-than-buffers case?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-07-18 Thread Heikki Linnakangas

On 13.07.2012 02:00, Alexander Korotkov wrote:

On Thu, Jul 12, 2012 at 10:29 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


Thanks. Can you do something about TrickFunctionCall2, please? (
http://archives.postgresql.**org/message-id/4FE2C968.**
2010...@enterprisedb.comhttp://archives.postgresql.org/message-id/4fe2c968.2010...@enterprisedb.com)
A separate patch to refactor that in the existing gist opclass would
probably be best.


Done. There are separate patch for get rid of TrickFunctionCall2 and
version of SP-GiST for ranges based on that patch.


Committed the get-rid-of-TrickFunctionCall2 patch. I also changed one 
call in range_gist_same(), which was not using TrickFunctionCall2 but 
was nevertheless doing the same thing in spirit.


I'll try to take a look at the other patch in the next few days.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-18 Thread Andrew Dunstan


On 07/18/2012 03:30 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

On 07/18/2012 03:18 PM, Merlin Moncure wrote:

there are no null fields, right? if the last field is sometimes null
you'd see that (you probably ruled that out though).  when you say
'sometimes', do you mean for some rows and not others? or for some
queries?

No, the inner query has two fields.
It happens for all rows, but not for all two-field-resulting queries as
q. I'm trying to find a simple case rather than the rather complex query
my customer is using.

I'm wondering about a rowtype with a third, dropped column.



As usual Tom has hit the nail on the head. Here's a simple test case 
that demonstrates the problem. I could probably have cut it down more 
but I was following the structure of the original somewhat:


   # with q as
   (
   select max(nspname) as nspname, sum(allind.count) as indices
   from (select indrelid, count(*)
 from pg_index
 group by indrelid) allind
 left outer join pg_class on pg_class.oid = allind.indrelid
 left outer join pg_namespace on pg_class.relnamespace =
   pg_namespace.oid
   group by pg_namespace.oid
   )
   select q from q;

 q
   
 (pg_catalog,91,11)
 (pg_toast,18,99)
   (2 rows)


cheers

andrew





regards, tom lane





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SP-GiST for ranges based on 2d-mapping and quad-tree

2012-07-18 Thread Alexander Korotkov
On Thu, Jul 19, 2012 at 12:22 AM, Heikki Linnakangas 
heikki.linnakan...@enterprisedb.com wrote:

 On 13.07.2012 02:00, Alexander Korotkov wrote:

 On Thu, Jul 12, 2012 at 10:29 AM, Heikki Linnakangas
 heikki.linnakangas@**enterprisedb.comheikki.linnakan...@enterprisedb.com
  wrote:

  Thanks. Can you do something about TrickFunctionCall2, please? (
 http://archives.postgresql.org/message-id/4FE2C968.**
 2010...@enterprisedb.comhttp:**//archives.postgresql.org/**
 message-id/4FE2C968.2010503@**enterprisedb.comhttp://archives.postgresql.org/message-id/4fe2c968.2010...@enterprisedb.com
 )

 A separate patch to refactor that in the existing gist opclass would
 probably be best.


 Done. There are separate patch for get rid of TrickFunctionCall2 and
 version of SP-GiST for ranges based on that patch.


 Committed the get-rid-of-TrickFunctionCall2 patch. I also changed one call
 in range_gist_same(), which was not using TrickFunctionCall2 but was
 nevertheless doing the same thing in spirit.


Good, thanks!

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] row literal problem

2012-07-18 Thread Merlin Moncure
On Wed, Jul 18, 2012 at 3:42 PM, Andrew Dunstan and...@dunslane.net wrote:

 On 07/18/2012 03:30 PM, Tom Lane wrote:

 Andrew Dunstan and...@dunslane.net writes:

 On 07/18/2012 03:18 PM, Merlin Moncure wrote:

 there are no null fields, right? if the last field is sometimes null
 you'd see that (you probably ruled that out though).  when you say
 'sometimes', do you mean for some rows and not others? or for some
 queries?

 No, the inner query has two fields.
 It happens for all rows, but not for all two-field-resulting queries as
 q. I'm trying to find a simple case rather than the rather complex query
 my customer is using.

 I'm wondering about a rowtype with a third, dropped column.



 As usual Tom has hit the nail on the head. Here's a simple test case that
 demonstrates the problem. I could probably have cut it down more but I was
 following the structure of the original somewhat:

# with q as
(
select max(nspname) as nspname, sum(allind.count) as indices
from (select indrelid, count(*)
  from pg_index
  group by indrelid) allind
  left outer join pg_class on pg_class.oid = allind.indrelid
  left outer join pg_namespace on pg_class.relnamespace =
pg_namespace.oid
group by pg_namespace.oid
)
select q from q;

  q

  (pg_catalog,91,11)
  (pg_toast,18,99)
(2 rows)

hm, it's the 'group by' -- for example if you add group by
pg_namespace.oid, group by pg_namespace.oid || 'abc', you can invent
columns that come back into the rowtype.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-18 Thread Merlin Moncure
On Wed, Jul 18, 2012 at 3:56 PM, Merlin Moncure mmonc...@gmail.com wrote:
 hm, it's the 'group by' -- for example if you add group by
 pg_namespace.oid, group by pg_namespace.oid || 'abc', you can invent
 columns that come back into the rowtype.

here's a cut down example:
with q as (select max(v) from (select 1 as v) q group by v) select q from q;

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] DELETE vs TRUNCATE explanation

2012-07-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 12:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Well, that argument is exactly why the code is designed the way it is...
 but we are now finding out that sending useless fsync requests isn't as
 cheap as all that.

 I agree, but I think the problem can be solved for a pretty modest
 amount of effort without needing to make fsync PGC_POSTMASTER.  Your
 proposal to refactor the pendingOpsTable representation seems like it
 will help a lot.  Perhaps you should do that first and then we can
 reassess.
 ...
 In my view, the elephant in the room here is that it's dramatically
 inefficient for every backend to send an fsync request on every block
 write.  For many users, in many workloads, all of those requests will
 be for just a tiny handful of relation segments.  The fsync queue
 compaction code works as well as it does for precisely that reason -
 when it triggers, we typically can compact a list of thousands or
 millions of entries down to less than two dozen.  In other words, as I
 see it, the issue here is not so much that 100% of the fsync requests
 are useless when fsync=off, but rather that 99.9% of them are useless
 even when fsync=on.

 In any case, I'm still of the opinion that we ought to try making one
 fix (your proposed refactoring of the pendingOpsTable) and then see
 where we're at.

I've been chewing on this issue some more, and no longer like my
previous proposal, which was

 ... What I'm thinking about
 is reducing the hash key to just RelFileNodeBackend + ForkNumber,
 so that there's one hashtable entry per fork, and then storing a
 bitmap to indicate which segment numbers need to be sync'd.  At
 one gigabyte to the bit, I think we could expect the bitmap would
 not get terribly large.  We'd still have a cancel flag in each
 hash entry, but it'd apply to the whole relation fork not each
 segment.

The reason that's not so attractive is the later observation that what
we really care about optimizing is FORGET_RELATION_FSYNC for all the
forks of a relation at once, which we could produce just one request
for with trivial refactoring of smgrunlink/mdunlink.  The above
representation doesn't help for that.  So what I'm now thinking is that
we should create a second hash table, with key RelFileNode only,
carrying two booleans: a cancel-previous-fsyncs bool and a
please-unlink-after-checkpoint bool.  (The latter field would allow us
to drop the separate pending-unlinks data structure.)  Entries would
be made in this table when we got a FORGET_RELATION_FSYNC or
UNLINK_RELATION_REQUEST message -- note that in 99% of cases we'd get
both message types for each relation, since they're both created during
DROP.  (Maybe we could even combine these request types.)  To use the
table, as we scan the existing per-fork-and-segment hash table, we'd
have to do a lookup in the per-relation table to see if there was a
later cancel message for that relation.  Now this does add a few cycles
to the processing of each pendingOpsTable entry in mdsync ... but
considering that the major work in that loop is an fsync call, it is
tough to believe that anybody would notice an extra hashtable lookup.

However, I also came up with an entirely different line of thought,
which unfortunately seems incompatible with either of the improved
table designs above.  It is this: instead of having a request queue
that feeds into a hash table hidden within the checkpointer process,
what about storing the pending-fsyncs table as a shared hash table
in shared memory?  That is, ForwardFsyncRequest would not simply
try to add the request to a linear array, but would do a HASH_ENTER
call on a shared hash table.  This means the de-duplication occurs
for free and we no longer need CompactCheckpointerRequestQueue at all.
Basically, this would amount to saying that the original design was
wrong to try to micro-optimize the time spent in ForwardFsyncRequest,
and that we'd rather pay a little more per ForwardFsyncRequest call
to avoid the enormous response-time spike that will occur when
CompactCheckpointerRequestQueue has to run.  (Not to mention that
the checkpointer would eventually have to do HASH_ENTER anyway.)
I think this would address your observation above that the request
queue tends to contain an awful lot of duplicates.

But I only see how to make that work with the existing hash table
structure, because with either of the other table designs, it's
difficult to set a predetermined limit on the amount of shared
memory needed.  The segment-number bitmaps could grow uncomfortably
large in the first design, while in the second there's no good way
to know how large the per-relation table has to be to cover a given
size for the per-fork-and-segment table.  (The sore spot here is that
once we've accepted a per-fork entry, failing to record a relation-level
cancel for it is not an option, so we can't just return failure.)

So if we go that way it seems like we still have the 

Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-18 Thread Andrew Dunstan


On 07/18/2012 03:59 PM, Tom Lane wrote:

2. It's rather disturbing that a fairly large swath of functionality
just stopped getting tested at all by the buildfarm.  Do we want to
rethink the shared_buffers increase?  Or artificially bloat the
regression database to make it larger than 128MB?  Or do something else
to ensure we still exercise the DB-bigger-than-buffers case?


A couple of other ideas:

The buildfarm does have the ability to set config data after initdb has 
run (which I just enhanced in the latest release). So a buildfarm owner 
could add a config line for shared_buffers which would override what 
initdb had set.


Or we could provide an initdb flag which would set an upper bound on 
shared_buffers, and have make check (at least) use it.


I'd rather not bloat the regression database if we can reasonably avoid 
it. Buildfarm members are often tight on space.


cheers

andrew





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] row literal problem

2012-07-18 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 hm, it's the 'group by' -- for example if you add group by
 pg_namespace.oid, group by pg_namespace.oid || 'abc', you can invent
 columns that come back into the rowtype.

Yeah, the whole-row variable is evidently picking up resjunk columns
from the inner query.  Haven't looked to see exactly where.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-18 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 The buildfarm does have the ability to set config data after initdb has 
 run (which I just enhanced in the latest release). So a buildfarm owner 
 could add a config line for shared_buffers which would override what 
 initdb had set.

 Or we could provide an initdb flag which would set an upper bound on 
 shared_buffers, and have make check (at least) use it.

 I'd rather not bloat the regression database if we can reasonably avoid 
 it. Buildfarm members are often tight on space.

Agreed on not wanting to bloat the regression DB just for this reason.
We see enough out of disk space failures already in the buildfarm.

I like the idea of modifying make check only, because then a typical
buildfarm run could exercise both DB-smaller-than-buffers (in the
installcheck case) and DB-larger-than-buffers (in make check).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Checkpointer split has broken things dramatically (was Re: DELETE vs TRUNCATE explanation)

2012-07-18 Thread Simon Riggs
On 17 July 2012 23:56, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Jul 16, 2012 at 3:18 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 BTW, while we are on the subject: hasn't this split completely broken
 the statistics about backend-initiated writes?

 Yes, it seems to have done just that.

 So I went to fix this in the obvious way (attached), but while testing
 it I found that the number of buffers_backend events reported during
 a regression test run barely changed; which surprised the heck out of
 me, so I dug deeper.  The cause turns out to be extremely scary:
 ForwardFsyncRequest isn't getting called at all in the bgwriter process,
 because the bgwriter process has a pendingOpsTable.  So it just queues
 its fsync requests locally, and then never acts on them, since it never
 runs any checkpoints anymore.

 This implies that nobody has done pull-the-plug testing on either HEAD
 or 9.2 since the checkpointer split went in (2011-11-01), because even
 a modicum of such testing would surely have shown that we're failing to
 fsync a significant fraction of our write traffic.

That problem was reported to me on list some time ago, and I made note
to fix that after last CF.

I added a note to 9.2 open items about it myself, but it appears my
fix was too simple and fixed only the reported problem not the
underlying issue. Reading your patch gave me strong deja vu, so not
sure what happened there.

Not very good from me. Feel free to thwack me to fix such things if I
seem not to respond quickly enough.

I'm now looking at the other open items in my area.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] b-tree index search algorithms

2012-07-18 Thread Samuel Vogel

Am 17.07.12 19:38, schrieb Tom Lane:
btree knows nothing about the datatypes it's working on except that 
they have comparison functions. Converting the values to some sort of 
numeric scale that you can interpolate on seems logically dubious and 
fraught with practical difficulties. Now, we do have some code in 
selfuncs.c that tries to do that, for some data types, but it's only 
for planner estimation purposes, and we don't rely very heavily on its 
results even in that context. Depending on it to be right for search 
purposes sounds pretty scary. 


'convert_string_to_scalar' and others look interesting in selfuncs.c, 
thanks for the pointer!



Okay, how are indexes on char/text columns handled then?

The datum values will be pointers to strings.


I can simply dereference it and read all bytes until a null-byte appears 
(depending on the collation and that I know that it actually is a string)?


The btree code is (or reasonably can be) aware that such values are 
pass-by-reference, and how to get to the bits. But the comparison 
semantics of two different values are not something it knows about 
except by asking the comparison function. This can be quite a 
nontrivial matter even for text, since we follow strcoll() comparison 
rules. regards, tom lane 


How would the b-tree know exactly that a value is only a reference? And 
even in that case you say that it could get the bits, but make no use of 
it, since it does not know what they represent, right?


Regards,
Samuel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] CHECK NO INHERIT syntax

2012-07-18 Thread Peter Eisentraut
Sorry to raise this once again, but I still find this CHECK NO INHERIT
syntax to a bit funny.  We are currently using something like

CHECK NO INHERIT (foo  0)

But we already have a different syntax for attaching attributes to
constraints (NOT DEFERRABLE, NOT VALID,  etc.), so it would make more
sense to have

CHECK (foo  0) NO INHERIT

Besides consistency, this makes more sense, because the attribute is a
property of the constraint as a whole, not of the checking.

This would also extend more easily to other constraint types.  For
example, when unifying CHECK and NOT NULL constraints, as is planned, or
when allowing inherited unique constraints, as is planned further down
the road.

There is also a hole in the current implementation.  Domain constraints
silently allow NO INHERIT to be specified, even though other senseless
attributes are rejected.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-18 Thread Andrew Dunstan


On 07/18/2012 05:37 PM, Tom Lane wrote:

Andrew Dunstan and...@dunslane.net writes:

The buildfarm does have the ability to set config data after initdb has
run (which I just enhanced in the latest release). So a buildfarm owner
could add a config line for shared_buffers which would override what
initdb had set.
Or we could provide an initdb flag which would set an upper bound on
shared_buffers, and have make check (at least) use it.
I'd rather not bloat the regression database if we can reasonably avoid
it. Buildfarm members are often tight on space.

Agreed on not wanting to bloat the regression DB just for this reason.
We see enough out of disk space failures already in the buildfarm.

I like the idea of modifying make check only, because then a typical
buildfarm run could exercise both DB-smaller-than-buffers (in the
installcheck case) and DB-larger-than-buffers (in make check).




Agreed. Something like:

--max_shared_buffers=32MB

for initdb, plus code to use it in pg_regress, should fit the bill.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] b-tree index search algorithms

2012-07-18 Thread Tom Lane
Samuel Vogel s...@muel-vogel.de writes:
 Am 17.07.12 19:38, schrieb Tom Lane:
 The datum values will be pointers to strings.

 I can simply dereference it and read all bytes until a null-byte appears 
 (depending on the collation and that I know that it actually is a string)?

We use a length word and then data, with no trailing null, at least for
text and varchar.  This area has been hacked for efficiency until it's
pretty complicated, but you can read about it in postgres.h.

 How would the b-tree know exactly that a value is only a reference? And 
 even in that case you say that it could get the bits, but make no use of 
 it, since it does not know what they represent, right?

It has access to the data type's basic storage parameters, which are
typbyval, typlen, and typalign; and we have standard conventions for
identifying the length etc of variable-length values.  It's just the
meaning of the payload data bytes that's data-type-private.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgwriter, regression tests, and default shared_buffers settings

2012-07-18 Thread Amit Kapila
 From: pgsql-hackers-ow...@postgresql.org
[mailto:pgsql-hackers-ow...@postgresql.org] 
 On Behalf Of Tom Lane

 So that raises two independent sets of questions:

 1. Do we like the fact that the bgwriter isn't doing anything in this
 situation?  It seems arguably OK for writes to happen only for
 checkpointing purposes if there is no memory pressure.  But having the
 bgwriter wake up 5 times a second to decide it has nothing to do seems
 a bit wasteful.  I'm inclined to think maybe it should go into the
 recently added hibernation mode anytime the buffer freelist isn't
 empty.  Or maybe you could argue that this scenario isn't worth any
 optimization effort, but with many-gig RAM becoming more and more
 common, I don't think I agree.

I also believe it should go to hibernation mode if the freelist is not
exhanusted or even it could be when freelist is less than (50% some
threshold number) used.
I have one doubt regarding this approach. According to above, what I
understood is after going to hibernation ideally it should get wakeup either
at timeout or when freelist is exhausted.
However current code wakes up before ensuring whether the buffer allocation
can be done from freelist.  

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers