Re: [HACKERS] Latch for the WAL writer - further reducing idle wake-ups.

2012-05-02 Thread Heikki Linnakangas

On 03.05.2012 03:41, Robert Haas wrote:

On Wed, May 2, 2012 at 7:21 PM, Tom Lane  wrote:

Adding any contention at all to XLogInsert doesn't seem like a smart
idea, even if you failed to measure any problem in the specific tests
you made.  I wonder whether we could not improve matters by adding
an additional bool "wal_writer_needs_wakening" in the state that's
considered to be protected by WALInsertLock.


I am skeptical about this, although it could be right.  It could also
be better the way Peter did it; a fetch of an uncontended cache line
is pretty cheap.


I'm very wary of adding any extra shared memory accesses to XLogInsert. 
I spent a lot of time trying to eliminate them in my XLogInsert scaling 
patch. It might be ok if the flag is usually not modified, and we don't 
add any extra barrier instructions in there, but it would be better to 
avoid it.


One simple idea would be to only try to set the latch every 100 
XLogInsert calls in the backend. That would cut whatever contention it 
might cause by a factor of 100, making it negligible.



 Another approach - which I think might be better
still - is to not bother kicking the WAL writer and let it wake up
when it wakes up.  Maybe have it hibernate for 3 seconds instead of
10, or something like that.  It seems unlikely to cause any real
problem if WAL writer takes a couple seconds to get with the program
after a long period of inactivity; note that an async commit will kick
it anyway, and a sync commit will probably half to flush WAL whether
the WAL writer wakes up or not.


Yeah, that'd be even simpler.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Future In-Core Replication

2012-05-02 Thread Jim Nasby

On 4/29/12 6:03 AM, Simon Riggs wrote:

The DML-WITH-LIMIT-1 is required to do single logical updates on tables
>  with non-unique rows.
>  And as for any logical updates we will have huge performance problem
>  when doing UPDATE or DELETE on large table with no indexes, but
>  fortunately this problem is on slave, not master;)

While that is possible, I would favour the do-nothing approach. By
making the default replication mode = none, we then require a PK to be
assigned before allowing replication mode = on for a table. Trying to
replicate tables without PKs is a problem that can wait basically.



Something that a in-core method might be able to do that an external one can't 
would be to support a method of uniquely identifying rows in tables with no 
PK's. A gross example (that undoubtedly wouldn't work in the real world) would 
be using TID's. A real-world implementation might be based on a hidden serial 
column.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: xReader, double-effort (was: Temporary tables under hot standby)

2012-05-02 Thread Jim Nasby

On 4/29/12 9:27 AM, Kevin Grittner wrote:

Maybe I can help with that by describing what the Wisconsin court
system does for circuit court data.


Thanks for the write-up, it was insightful.

One thing I wanted to mention is that non-binary replication has an added 
advantage over binary from a DR standpoint: if corruption occurs on a master it 
is more likely to make it into your replicas thanks to full page writes. You 
might want to consider that depending on how sensitive your data is.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Torn page hazard in ginRedoUpdateMetapage()

2012-05-02 Thread Tom Lane
Daniel Farina  writes:
> On Wed, May 2, 2012 at 6:06 PM, Noah Misch  wrote:
>> Can we indeed assume that all support-worthy filesystems align the start of
>> every file to a physical sector?  I know little about modern filesystem
>> design, but these references leave me wary of that assumption:
>> 
>> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html
>> http://en.wikipedia.org/wiki/Block_suballocation
>> 
>> If it is a safe assumption, we could exploit it elsewhere.

> Not to say whether this is safe or not, but it *is* exploited
> elsewhere, as I understand it: the pg_control information, whose
> justification for its safety is its small size.  That may point to a
> very rare problem with pg_control rather the safety of the assumption
> it makes.

I think it's somewhat common now for filesystems to attempt to optimize
very small files (on the order of a few dozen bytes) in that way.  It's
hard to see where's the upside for changing the conventional storage
allocation when the file is sector-sized or larger; the file system does
have to be prepared to rewrite the file on demand, and moving it from
one place to another isn't cheap.

That wikipedia reference argues for doing this type of optimization on
the last partial block of a file, which is entirely irrelevant for our
purposes since we always ask for page-multiples of space.  (The fact
that much of that might be useless padding is, I think, unknown to the
filesystem.)

Having said all that, I wasn't really arguing that this was a guaranteed
safe thing for us to rely on; just pointing out that it's quite likely
that the issue hasn't been seen in the field because of this type of
consideration.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latch for the WAL writer - further reducing idle wake-ups.

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 11:42 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> ... It seems unlikely to cause any real
>> problem if WAL writer takes a couple seconds to get with the program
>> after a long period of inactivity; note that an async commit will kick
>> it anyway, and a sync commit will probably half to flush WAL whether
>> the WAL writer wakes up or not.
>
> That's a good point.  What about only kicking the WAL writer in code
> paths where a backend found itself having to write/flush WAL for itself?
> The added overhead is very surely negligible in such a situation.

Yeah, I think that would make sense, though I'd probably still argue
for a hibernation period not quite so long as ten seconds.  Actually,
what I'd really like is for this to be adaptive: if we find that
there's no WAL to write, increase the time until the next wakeup by 10
ms until we hit the maximum of, say, 3 seconds.  If we find that there
is WAL to write, cut the time until the next wakeup in half until we
hit a minimum of, say, 20ms.  And, if we're forced to write/flush WAL
ourselves, or we async commit, kick the WAL writer in the pants and
wake him up right away.  That way we're willing to get
super-aggressive when needed, but we don't stay there very long once
the pounding ends.  Also, we avoid having a hard "cut" between regular
sleeps and deep hibernation; instead, we kind of gradually drift off.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Torn page hazard in ginRedoUpdateMetapage()

2012-05-02 Thread Daniel Farina
On Wed, May 2, 2012 at 6:06 PM, Noah Misch  wrote:
> Can we indeed assume that all support-worthy filesystems align the start of
> every file to a physical sector?  I know little about modern filesystem
> design, but these references leave me wary of that assumption:
>
> http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html
> http://en.wikipedia.org/wiki/Block_suballocation
>
> If it is a safe assumption, we could exploit it elsewhere.

Not to say whether this is safe or not, but it *is* exploited
elsewhere, as I understand it: the pg_control information, whose
justification for its safety is its small size.  That may point to a
very rare problem with pg_control rather the safety of the assumption
it makes.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Features of Postgresql and Postgres-xc with MySQL

2012-05-02 Thread Vivek Singh Raghuwanshi
Thanks this is very helpful

Regards
ViVek

On Thu, May 3, 2012 at 4:57 AM, Michael Paquier
wrote:

> On Wed, May 2, 2012 at 2:37 PM, Vivek Singh Raghuwanshi <
> vivekraghuwan...@gmail.com> wrote:
>
>> Please send me the link or white papers from where i can get information
>> like.
>> 3. Feature comparison of PostgreSQL and Postgres-XC
>> 4. and can we use Postgres-XC in production with mission critical env
>> with heavy load
>>
> If you are looking for some white papers about Postgres-XC:
> - some presentation documents done by people:
> http://sourceforge.net/projects/postgres-xc/files/Presentation/
> - some publications:
> http://sourceforge.net/projects/postgres-xc/files/Publication/
> - some presentations I did about the project:
> http://michael.otacoo.com/presentations/conferences/
>
> Hope this helps.
> Regards,
> --
> Michael Paquier
> http://michael.otacoo.com
>



-- 
ViVek Raghuwanshi
Mobile -+91-09595950504

Skype - vivek_raghuwanshi


Re: [HACKERS] Latch for the WAL writer - further reducing idle wake-ups.

2012-05-02 Thread Tom Lane
Robert Haas  writes:
> ... It seems unlikely to cause any real
> problem if WAL writer takes a couple seconds to get with the program
> after a long period of inactivity; note that an async commit will kick
> it anyway, and a sync commit will probably half to flush WAL whether
> the WAL writer wakes up or not.

That's a good point.  What about only kicking the WAL writer in code
paths where a backend found itself having to write/flush WAL for itself?
The added overhead is very surely negligible in such a situation.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Torn page hazard in ginRedoUpdateMetapage()

2012-05-02 Thread Noah Misch
On Mon, Apr 30, 2012 at 02:35:20PM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > When GIN changes a metapage, we WAL-log its ex-header content and never use 
> > a
> > backup block.  This reduces WAL volume since the vast majority of the 
> > metapage
> > is unused.  However, ginRedoUpdateMetapage() only restores the WAL-logged
> > content if the metapage LSN predates the WAL record LSN.  If a metapage 
> > write
> > tore and updated the LSN but not the other content, we would fail to 
> > complete
> > the update.  Instead, unconditionally reinitialize the metapage similar to 
> > how
> > _bt_restore_meta() handles the situation.
> 
> > I found this problem by code reading and did not attempt to build a test 
> > case
> > illustrating its practical consequences.  It's possible that there's no
> > problem in practice on account of some reason I haven't contemplated.
> 
> I think there's no problem in practice; the reason is that the
> GinMetaPageData struct isn't large enough to extend past the first
> physical sector of the page.  So it's in the same disk sector as the
> LSN and tearing is impossible.  Still, this might be a good
> future-proofing move, in case GinMetaPageData gets larger.

Can we indeed assume that all support-worthy filesystems align the start of
every file to a physical sector?  I know little about modern filesystem
design, but these references leave me wary of that assumption:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg14690.html
http://en.wikipedia.org/wiki/Block_suballocation

If it is a safe assumption, we could exploit it elsewhere.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch submission: truncate trailing nulls from heap rows to reduce the size of the null bitmap

2012-05-02 Thread Josh Berkus
On 5/2/12 10:20 AM, Jameison Martin wrote:
> Attached are the following as per various requests:
>   * test_results.txt: the performance benchmarking results, 
> 
>   * TestTrailingNull.java: the performance benchmarking code, with a few 
> additional scenarios as per various requests
> 
>   * hardinfo_report.txt: some information about the hardware and OS of 
> the system on which the benchmarks were run, and
> 
>   * postgresql.conf: the postgresql.conf used when running benchmarks. 
> Note that the changes made to the vanilla postgresql.conf can be identified 
> by looking for the string 'jamie' in the file I attached (there aren't that 
> many)

Nice, thanks.  I'll try some of my own tests when I get a chance; I have
a really good use-case for this optimization.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-05-02 Thread Josh Berkus
Michael,

> What is the use case for temporary tables on a hot standby server?
> 
> Perhaps this is a noobie question, but it seems to me that a hot standby
> server's use by* applications* or *users* should be limited to transactions
> that don't alter the database in any form.

A very common use for asynchronous replicas is to offload long-running
reporting jobs onto the replica so that they don't bog down the master.
 However, long-running reporting jobs often require temporary tables,
especially if they use some 3rd-party vendor's reporting tool.  For
example, the average Microstrategy report involves between 1 and 12
temporary tables.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latch for the WAL writer - further reducing idle wake-ups.

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 7:21 PM, Tom Lane  wrote:
> It is getting a bit late to be considering such changes for 9.2, but
> I'm willing to review and commit this if there's not anybody who feels
> strongly that it's too late.  Personally I think it's in the nature of
> cleanup and so fair game as long as we haven't formally started beta.
> However I will confess to some bias about wanting to get the server's
> idle wake-up rate down, because Fedora people have been bugging me
> about that for a long time now.  So I'm probably not the best person to
> objectively evaluate whether we should hold this for 9.3.  Comments?

Well, I feel that one of the weaknesses of our CommitFest process is
that changes like this (which are really pretty small) end up having
the same deadline as patches that are large (command triggers,
checksums, etc.); in fact, they sometimes end up having an earlier
deadline, because the people doing the big stuff end up continuing to
hack on it for another couple months while the door is shut to smaller
improvements.  So I'm not going to object if you feel like slipping
this one in.  I looked it over myself and I think it's broadly
reasonable, although I'm not too sure about the particular criteria
chosen for sending the WAL writer to sleep and waking it up again.
And like you I'd like to see some more improvement in this area.

> Adding any contention at all to XLogInsert doesn't seem like a smart
> idea, even if you failed to measure any problem in the specific tests
> you made.  I wonder whether we could not improve matters by adding
> an additional bool "wal_writer_needs_wakening" in the state that's
> considered to be protected by WALInsertLock.

I am skeptical about this, although it could be right.  It could also
be better the way Peter did it; a fetch of an uncontended cache line
is pretty cheap.  Another approach - which I think might be better
still - is to not bother kicking the WAL writer and let it wake up
when it wakes up.  Maybe have it hibernate for 3 seconds instead of
10, or something like that.  It seems unlikely to cause any real
problem if WAL writer takes a couple seconds to get with the program
after a long period of inactivity; note that an async commit will kick
it anyway, and a sync commit will probably half to flush WAL whether
the WAL writer wakes up or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Advisory locks seem rather broken

2012-05-02 Thread Tom Lane
According to
http://archives.postgresql.org/pgsql-general/2012-04/msg00374.php
advisory locks now cause problems for prepared transactions, which
ought to ignore them.  It appears to me that this got broken by
commit 62c7bd31c8878dd45c9b9b2429ab7a12103f3590, which marked the
userlock lock method as transactional, which seems just about 100%
misguided to me.  At the very least this would require reconsidering
every single place that tests lock transactionality, and that evidently
did not happen.

If this patch weren't already in a released branch I would be arguing
for reverting it.  As is, I think we're going to have to clean it up.
I don't have time to look at it in detail right now, though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Features of Postgresql and Postgres-xc with MySQL

2012-05-02 Thread Michael Paquier
On Wed, May 2, 2012 at 2:37 PM, Vivek Singh Raghuwanshi <
vivekraghuwan...@gmail.com> wrote:

> Please send me the link or white papers from where i can get information
> like.
> 3. Feature comparison of PostgreSQL and Postgres-XC
> 4. and can we use Postgres-XC in production with mission critical env with
> heavy load
>
If you are looking for some white papers about Postgres-XC:
- some presentation documents done by people:
http://sourceforge.net/projects/postgres-xc/files/Presentation/
- some publications:
http://sourceforge.net/projects/postgres-xc/files/Publication/
- some presentations I did about the project:
http://michael.otacoo.com/presentations/conferences/

Hope this helps.
Regards,
-- 
Michael Paquier
http://michael.otacoo.com


Re: [HACKERS] Latch for the WAL writer - further reducing idle wake-ups.

2012-05-02 Thread Tom Lane
Peter Geoghegan  writes:
> Attached patch latches up the WAL Writer, reducing wake-ups and thus
> saving electricity in a way that is more-or-less analogous to my work
> on the BGWriter:
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6d90eaaa89a007e0d365f49d6436f35d2392cfeb
> I am hoping this gets into 9.2 . I am concious of the fact that this
> is quite late, but it the patch addresses an open item, the concluding
> part of a much wider feature.

It is getting a bit late to be considering such changes for 9.2, but
I'm willing to review and commit this if there's not anybody who feels
strongly that it's too late.  Personally I think it's in the nature of
cleanup and so fair game as long as we haven't formally started beta.
However I will confess to some bias about wanting to get the server's
idle wake-up rate down, because Fedora people have been bugging me
about that for a long time now.  So I'm probably not the best person to
objectively evaluate whether we should hold this for 9.3.  Comments?

Schedule questions aside, I'm disturbed by this bit:

> My choice of XLogInsert() as an additional site at which to call
> SetLatch() was one that wasn't taken easily, and frankly I'm not
> entirely confident that I couldn't have been just as effective while
> placing the SetLatch() call in a less hot, perhaps higher-level
> codepath.

Adding any contention at all to XLogInsert doesn't seem like a smart
idea, even if you failed to measure any problem in the specific tests
you made.  I wonder whether we could not improve matters by adding
an additional bool "wal_writer_needs_wakening" in the state that's
considered to be protected by WALInsertLock.  XLogInsert would check
this while still holding the lock, and only consider that it needs to do
a SetLatch if the flag was set, whereupon it would clear it before
releasing the lock.  In the normal case this would add one uncontended
fetch followed by boolean-test-and-jump to the work done while holding
the lock, which should be pretty negligible.  Then, the WAL writer would
need to take WALInsertLock to set that flag, but presumably it should
only be doing that when there is no contention for the lock.  (In fact,
we could have it do a ConditionalLockAcquire on WALInsertLock for the
purpose, and consider that failure means it shouldn't go to sleep after
all.)

Now this might sound pretty much equivalent to testing the latch's
is_set flag; perhaps it is and I'm worrying over nothing.  But I'm
thinking that the wal_writer_needs_wakening flag would be in a cache
line that an acquirer of WALInsertLock would have to get ownership of
anyway, if it is adjacent to variables that XLogInsert has to manipulate
anyway.  On the other hand, the WAL writer's process latch would be in
some other cache line that would also need to get passed around a lot,
if it's touched during every XLogInsert.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Unnecessary WAL archiving after failover

2012-05-02 Thread Robert Haas
On Fri, Mar 23, 2012 at 10:03 AM, Fujii Masao  wrote:
> On second thought, I found other issues about WAL archiving after
> failover. So let me clarify the issues again.
>
> Just after failover, there can be three kinds of WAL files in new
> master's pg_xlog directory:
>
> (1) WAL files which were recycled to by restartpoint
>
> I've already explained upthread the issue which these WAL files cause
> after failover.

Check.

> (2) WAL files which were restored from the archive
>
> In 9.1 or before, the restored WAL files don't remain after failover
> because they are always restored onto the temporary filename
> "RECOVERYXLOG". So the issue which I explain from now doesn't exist
> in 9.1 or before.
>
> In 9.2dev, as the result of supporting cascade replication,
> an archived WAL file is restored onto correct file name so that
> cascading walsender can send it to another standby. This restored
> WAL file has neither .ready nor .done archive status file. After
> failover, checkpoint checks the archive status file of the restored
> WAL file to attempt to recycle it, finds that it has neither .ready
> nor ,done, and creates .ready. Because of existence of .ready,
> it will be archived again even though it obviously already exists in
> the archival storage :(
>
> To prevent a restored WAL file from being archived again, I think
> that .done should be created whenever WAL file is successfully
> restored (of course this should happen only when archive_mode is
> enabled). Thought?
>
> Since this is the oversight of cascade replication, I'm thinking to
> implement the patch for 9.2dev.

Yes, I think we had better fix this in 9.2.  As you say, it's a loose
end from streaming replication.  Do you have a patch?

> (3) WAL files which were streamed from the master
>
> These WAL files also don't have any archive status, so checkpoint
> creates .ready for them after failover. And then, all or many of
> them will be archived at a time, which would cause I/O spike on
> both WAL and archival storage.
>
> To avoid this problem, I think that we should change walreceiver
> so that it creates .ready as soon as it completes the WAL file. Also
> we should change the archiver process so that it starts up even in
> standby mode and archives the WAL files.
>
> If each server has its own archival storage, the above solution would
> work fine. But if all servers share the archival storage, multiple archiver
> processes in those servers might archive the same WAL file to
> the shared area at the same time. Is this OK? If not, to avoid this,
> we might need to separate archive_mode into two: one for normal mode
> (i.e., master), another for standbfy mode. If the archive is shared,
> we can ensure that only one archiver in the master copies the WAL file
> at the same time by disabling WAL archiving in standby mode but
> enabling it in normal mode. Thought?

Another option would be to run the archiver in both modes and somehow
pass a flag indicating whether it's running in standby mode or normal
running.

> Invoking the archiver process in standby mode is new feature,
> not a bug fix. It's too late to propose new feature for 9.2. So I'll
> propose this for 9.3.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: additional error fields

2012-05-02 Thread Tom Lane
"Kevin Grittner"  writes:
> Tom Lane  wrote:
>> "Kevin Grittner"  writes:
>>> That "F0" class looks suspicious; are those really defined by
>>> standard or did we encroach on standard naming space with
>>> PostgreSQL-specific values?

>> I think we screwed up on that :-(.  So we ought to renumber those
>> codes anyway.  Perhaps use "PF" instead of "F0"?
 
> Sounds good to me.

I thought for a few minutes about whether we ought to try to sneak
such a change into 9.2.  But given that we're talking about probably
doing a number of other SQLSTATE reassignments in the future, it
seems likely better to wait and absorb all that pain in a single
release cycle.  It seems moderately unlikely that any client-side
code is dependent on these specific assignments, but still I'd rather
not see a dribble of "we changed some SQLSTATEs" compatibility flags
across several successive releases.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] online debloatification (was: extending relations more efficiently)

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 4:39 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> Brainstorming wildly, how about something like this:
>
>> 1. Insert a new copy of the tuple onto some other heap page.  The new
>> tuple's xmin will be that of the process doing the tuple move, and
>> we'll also set a flag indicating that a move is in progress.
>> 2. Set a flag on the old tuple, indicating that a tuple move is in
>> progress.  Set its TID to the new location of the tuple.  Set xmax to
>> the tuple mover's XID.  Optionally, truncate away the old tuple data,
>> leaving just the tuple header.
>> 3. Scan all indexes and replace any references to the old tuple's TID
>> with references to the new tuple's TID.
>> 4. Commit.
>
> What happens when you crash partway through that?

Well, there are probably a few loose ends here, but the idea is that
if we crash after step 2 is complete, the next vacuum is responsible
for performing steps 3 and 4.  As written, there's probably a problem
if we crash between (1) and (2); I think those would need to be done
atomically, or at least we need to make sure that the moving-in flag
is set on the new tuple if and only if there is actually a redirect
pointing to it.

> Also, what happens if
> somebody wishes to update the tuple before the last step is complete?

Then we let them.  The idea is that they see the redirect tuple at the
old TID, follow it to the new copy of the tuple, and update that
instead.

> In any case, this doesn't address the fundamental problem with unlocked
> tuple movement, which is that you can't just arbitrarily change a
> tuple's TID when there might be other operations relying on the TID
> to hold still.

Well, that's why I invented the redirect tuple, so that anyone who was
relying on the TID to hold still would see the redirect and say, oh, I
need to go look at this other TID instead.  It's entirely possible
there's some reason why that can't work, but at the moment I'm not
seeing it.  I see that there's a problem if the old TID gets freed
while someone's relying on it, but replacing it with a pointer to some
other TID seems like it ought to be workable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] online debloatification (was: extending relations more efficiently)

2012-05-02 Thread Tom Lane
Robert Haas  writes:
> Brainstorming wildly, how about something like this:

> 1. Insert a new copy of the tuple onto some other heap page.  The new
> tuple's xmin will be that of the process doing the tuple move, and
> we'll also set a flag indicating that a move is in progress.
> 2. Set a flag on the old tuple, indicating that a tuple move is in
> progress.  Set its TID to the new location of the tuple.  Set xmax to
> the tuple mover's XID.  Optionally, truncate away the old tuple data,
> leaving just the tuple header.
> 3. Scan all indexes and replace any references to the old tuple's TID
> with references to the new tuple's TID.
> 4. Commit.

What happens when you crash partway through that?  Also, what happens if
somebody wishes to update the tuple before the last step is complete?

In any case, this doesn't address the fundamental problem with unlocked
tuple movement, which is that you can't just arbitrarily change a
tuple's TID when there might be other operations relying on the TID
to hold still.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Modeling consumed shmem sizes, and some thorns

2012-05-02 Thread Daniel Farina
Hello List,

I'd like to share with you some experiences we've had while
investigating what we'd have to do to make very-very tiny databases.

First, the formulae at
http://www.postgresql.org/docs/9.1/static/kernel-resources.html#SHARED-MEMORY-PARAMETERS
(17-2) seem misleading, particularly with regard to the overhead of
supporting a large number of connections: it undercounts by a rather
lot.  I think the estimate of 270 bytes per lock seems too far low on
Linux-amd64.  In addition, this number seem undercounted by 50% or
more because of the addition of predicate locks in 9.1.

Presuming the base cost of 1800-base-cost-per-connection is still
right, experimentally it seems to me that the right numbers are closer
to 700 bytes per max_locks_per_transaction, and 650 for each
max_pred_locks_per_transaction, although there appear to be some
non-linear behavior that make this a hazy projection.

Besides accuracy, there is a thornier problem here that has to do with
hot standby (although the use case is replication more generally) when
one has heterogeneously sized database resources. As-is, it is
required that locking-related structures -- max_connections,
max_prepared_xacts, and max_locks_per_xact (but not predicate locks,
is that an oversight?) must be a larger number on a standby than on a
primary.

In a heterogeneous environment where one uses WAL-based replication,
that means that to obtain unity and full compatibility among
different-sized systems one must always permit a large number of
connections (specifically, the largest number supported by any
database configuration), and those large number of connections can
occupy a large fraction of the overall memory allotted to a small
database, making the amount of lock-related memory consumption on,
say, a database that is intended to only receive 100MB of shmem
approach nearly 50% of the overall total, and that is rather
unfortunate.  I can see why that'd be hard to fix (maybe, instead, a
more logical replication layer is a better investment of time), but I
thought it an interesting consideration that was worth discussing.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How hard would it be to support LIKE in return declaration of generic record function calls ?

2012-05-02 Thread Merlin Moncure
On Wed, May 2, 2012 at 12:06 PM, Peter Eisentraut  wrote:
> On ons, 2012-05-02 at 13:40 +0200, Hannu Krosing wrote:
>> How hard would it be to add support for LIKE syntax, similar to table
>> def in field list declaration for generic record functions
>>
>> What I'dd like to be able to do is to have a generic json_to_record
>> function
>>
>> CREATE OR REPLACE RECORD json_to_record(json) RETURNS RECORD AS $$
>> ...
>> $$ LANGUAGE ... ;
>>
>> and then be able to call it like this
>>
>> insert into test2
>> select * from json_to_record(jrec json) as (like test2);
>
> That would be very useful, and shouldn't be too hard to implement.  (I
> had to look about three times to understand what this was supposed to
> achieve, but I think the syntax is the right one after all.)

Although I like the functionality, is this better than the trick used
by hstore/populate_record?  That approach doesn't require syntax
changes and allows you to execute the function without 'FROM'.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/Python result set slicing broken in Python 3

2012-05-02 Thread Peter Eisentraut
This doesn't work anymore with Python 3:

rv = plpy.execute(...)
do_something(rv[0:1])

Apparently, they changed the C API for doing slicing, or rather made one
of the two APIs for it silently do nothing.  Details are difficult to
find, but this email message seems to contain something:
.

I'll try to sort this out sometime, but if someone wants to take a shot
at it, go ahead.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Beta time?

2012-05-02 Thread Bruce Momjian
 On Wed, May 02, 2012 at 01:15:48PM -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > On ons, 2012-05-02 at 00:31 -0400, Tom Lane wrote:
> >> Checking this patch, I noticed that config.guess and config.sub harbor
> >> most of the remaining references to those platforms, which reminded me:
> >> don't we usually update those files from autoconf upstream before beta?
> 
> > Yes, once we know when beta is, we can move on that. ;-)
> 
> Next week, I thought.

How are we handling the Monday release with everyone at PGCon?  Was that
resolved?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Bruce Momjian
On Wed, May 02, 2012 at 10:37:58AM -0700, Jeff Janes wrote:
> I could try to clean up and post the patch that implements this if you want.
> 
> The second method was just to do --enable-profiling on a stock build
> and look at the call graph section of gprof output.  It attributed 50%
> to pg_plan_queries and children and about 10% to each of
> pg_parse_query and pg_analyze_and_rewrite (including their respective
> children).  I don't put tremendous faith in gprof's call graph, but
> the fact that the results are in agreement with the other method gives
> me more confidence in both.

Those are the ratio's I (and I think Tom) expected to see.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] online debloatification (was: extending relations more efficiently)

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 1:06 PM, Alvaro Herrera
 wrote:
> Excerpts from Robert Haas's message of mié may 02 12:55:17 -0400 2012:
>> On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera
>>  wrote:
>> > Agreed.  Perhaps to solve this issue what we need is a way to migrate
>> > tuples from later pages into earlier ones.  (This was one of the points,
>> > never resolved, that we discussed during the VACUUM FULL rework.)
>>
>> Yeah, I agree.  And frankly, we need to find a way to make it work
>> without taking AccessExclusiveLock on the relation.  Having to run
>> VACUUM FULL is killing actual users and scaring off potential ones.
>
> And ideally without bloating the indexes while at it.

Yeah.

Brainstorming wildly, how about something like this:

1. Insert a new copy of the tuple onto some other heap page.  The new
tuple's xmin will be that of the process doing the tuple move, and
we'll also set a flag indicating that a move is in progress.
2. Set a flag on the old tuple, indicating that a tuple move is in
progress.  Set its TID to the new location of the tuple.  Set xmax to
the tuple mover's XID.  Optionally, truncate away the old tuple data,
leaving just the tuple header.
3. Scan all indexes and replace any references to the old tuple's TID
with references to the new tuple's TID.
4. Commit.
5. Once the XID of the tuple mover is all-visible, nuke the old TID
and clear the flag on the new tuple indicating a move-in-progress
(these two operations must be done together, atomically, with a single
WAL record covering both).

Any scan that encounters the old tuple will decide whether or not it
can see the tuple based on the xmin & xmax in the old tuple's header.
If it decides it can see it, it follows the TID pointer and does its
work using the new tuple instead.  Scans that encounter the new tuple
need no special handling; the existing visibility rules are fine for
that case.  Prune operations must not truncate away tuples that are
being moved into or out of the page, and vacuum must not mark pages
containing such tuples as all-visible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] plpython crash (PG 92)

2012-05-02 Thread Peter Eisentraut
On lör, 2012-04-28 at 00:32 -0400, Tom Lane wrote:
> I'm inclined to think that the best fix is for
> PLy_spi_execute_fetch_result to copy the tupledesc into
> TopMemoryContext, not the current context.  This is a tad scary from a
> memory leakage standpoint, but I suppose that if python fails to recover
> the PLyResultObject, this isn't the only memory that's going to be
> leaked.
> 
> This area appears to be shy a regression test case or two, in any event.

Fixed like that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch pg_is_in_backup()

2012-05-02 Thread Gabriele Bartolini

Hi Gilles,

   Sorry for the delay.

Il 03/04/12 14:21, Gilles Darold ha scritto:

+1, this is also my point of view.


   I have looked at the patch that contains both pg_is_in_backup() and 
pg_backup_start_time().


   From a functional point of view it looks fine to me. I was thinking 
of adding the BackupInProgress() at the beginning of 
pg_backup_start_time(), but the AllocateFile() function already make 
sure the file exists.


   I have performed some basic testing of both functions and tried to 
inject invalid characters in the start time field of the backup_label 
file and it is handled (with an exception) by the server. Cool.


   I spotted though some formatting issues, in particular indentation 
and multi-line comments. Some rows are longer than 80 chars.


   Please resubmit with these cosmetic changes and it is fine with me. 
Thank you.


Cheers,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Magnus Hagander
On Wed, May 2, 2012 at 3:33 PM, Peter Geoghegan  wrote:
> On 2 May 2012 04:57, Tom Lane  wrote:
>> FWIW, I think only developers not packagers would really be taking such
>> a hit.  I assume we'd continue to ship prebuilt lexer output in
>> tarballs, so there'd seldom be a reason for a packager to need to run
>> the tool.  Given the extremely slow rate of churn of the lexer, it might
>> not be necessary for most developers to have the tool installed, either,
>> if we were willing to put the derived file into git.
>
> Incidentally, I had an unrelated conversation with someone (I think it
> might have been Heikki) a while back, where it was suggested that Flex
> and Bison could be run through web services. This might actually make

Might've been me - I've been doing that for a long time to work around
winflex issues. But I never got around to doing anything like access
control or so, I just ran it on a hidden ip on a random port, and
simple curl call on the windows box..

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Jeff Janes
On Wed, May 2, 2012 at 9:31 AM, Peter Geoghegan  wrote:
> On 2 May 2012 17:20, Jeff Janes  wrote:
>> For -S -M simple, the time spent planning is 5 times more than the
>> time spent parsing.  It may be worthwhile to reduce the time spent
>> parsing, but if the goal is parity with MySQL it probably isn't the
>> place to start.
>
> Could you please share your figures and methodology? I've heard of far
> larger proportions than that.

I used two methods.  One was to hack exec_simple_query so that, under
the control of a new GUC, it would do things such as pg_parse_query
the query 101 times, throwing away the results of the first 100 before
proceeding to use the 101 parse result as normal.  Then I just run
pgbench under both settings, take the difference in 1/TPS between them
and divide by 100 to get the seconds per parse (and multiple by 1e6 to
get usec/parse)

I did the same thing for pg_analyze_and_rewrite, and for
pg_analyze_and_rewrite+pg_plan_queries (pg_plan_queries scribbles on
the structure produced by pg_analyze_and_rewrite, so you have to
repeat both as a unit, and then subtract the the
pg_analyze_and_rewrite timings off afterwards to isolate just the
planning) .

On my current laptop and rebased to git HEAD, I got
2usec/pg_parse_query, 2usec/pg_analyze_and_rewrite, and
12usec/pg_plan_queries.   Since my laptop is dual core, I did this
with -c2 -j2.

Back when I originally implemented and tested this on much older
hardware and about one year older pgsql code base, the absolute values
of usec/action were several fold higher, but the ratios of 1:1:6 were
about the same.

This does risk that it will overlook caching effects by repeating the
same thing in a tight loop.  I.e. parses 2 through 101 might be much
faster than parse 1 was.  Also, it risks that I simply don't know what
I'm doing and my attempts to throw away the results of a parse are
misbegotten--I just overwrote the old pointer with the new one and
assume the memory context would clean up the resulting orphans.

I could try to clean up and post the patch that implements this if you want.

The second method was just to do --enable-profiling on a stock build
and look at the call graph section of gprof output.  It attributed 50%
to pg_plan_queries and children and about 10% to each of
pg_parse_query and pg_analyze_and_rewrite (including their respective
children).  I don't put tremendous faith in gprof's call graph, but
the fact that the results are in agreement with the other method gives
me more confidence in both.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Remove dead ports

2012-05-02 Thread Tom Lane
Peter Eisentraut  writes:
> On ons, 2012-05-02 at 00:31 -0400, Tom Lane wrote:
>> Checking this patch, I noticed that config.guess and config.sub harbor
>> most of the remaining references to those platforms, which reminded me:
>> don't we usually update those files from autoconf upstream before beta?

> Yes, once we know when beta is, we can move on that. ;-)

Next week, I thought.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Remove dead ports

2012-05-02 Thread Peter Eisentraut
On ons, 2012-05-02 at 00:31 -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > Remove dead ports
> 
> Checking this patch, I noticed that config.guess and config.sub harbor
> most of the remaining references to those platforms, which reminded me:
> don't we usually update those files from autoconf upstream before beta?

Yes, once we know when beta is, we can move on that. ;-)

Btw., I had intentionally kept the uses in ps_status.c and getrusage.c,
because they remain useful in case someone wants to use these files for
reference.  But I guess that's debatable, because we no longer have a
way to prove that those uses actually continue to work.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié may 02 12:55:17 -0400 2012:
> 
> On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera
>  wrote:
> > Agreed.  Perhaps to solve this issue what we need is a way to migrate
> > tuples from later pages into earlier ones.  (This was one of the points,
> > never resolved, that we discussed during the VACUUM FULL rework.)
> 
> Yeah, I agree.  And frankly, we need to find a way to make it work
> without taking AccessExclusiveLock on the relation.  Having to run
> VACUUM FULL is killing actual users and scaring off potential ones.

And ideally without bloating the indexes while at it.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How hard would it be to support LIKE in return declaration of generic record function calls ?

2012-05-02 Thread Peter Eisentraut
On ons, 2012-05-02 at 13:40 +0200, Hannu Krosing wrote:
> How hard would it be to add support for LIKE syntax, similar to table
> def in field list declaration for generic record functions
> 
> What I'dd like to be able to do is to have a generic json_to_record
> function 
> 
> CREATE OR REPLACE RECORD json_to_record(json) RETURNS RECORD AS $$
> ...
> $$ LANGUAGE ... ;
> 
> and then be able to call it like this
> 
> insert into test2 
> select * from json_to_record(jrec json) as (like test2);

That would be very useful, and shouldn't be too hard to implement.  (I
had to look about three times to understand what this was supposed to
achieve, but I think the syntax is the right one after all.)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] clog double-dip in heap_hot_search_buffer

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 12:19 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Wed, May 2, 2012 at 12:06 PM, Tom Lane  wrote:
>>> Otherwise this seems like a good idea.
>
>> Do you think I should apply this to 9.2, or wait until 9.3?
>
> We're not at beta yet, and it seems pretty safe/self-contained, so
> I have no objection to committing now.

OK, done.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 12:46 PM, Alvaro Herrera
 wrote:
> Agreed.  Perhaps to solve this issue what we need is a way to migrate
> tuples from later pages into earlier ones.  (This was one of the points,
> never resolved, that we discussed during the VACUUM FULL rework.)

Yeah, I agree.  And frankly, we need to find a way to make it work
without taking AccessExclusiveLock on the relation.  Having to run
VACUUM FULL is killing actual users and scaring off potential ones.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié may 02 12:37:35 -0400 2012:
> 
> On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera
>  wrote:
> > Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012:
> >> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen  wrote:
> >> > On 2012-05-01 22:06, Robert Haas wrote:
> >> >> It might also be interesting to provide a mechanism to pre-extend a
> >> >> relation to a certain number of blocks, though if we did that we'd
> >> >> have to make sure that autovac got the memo not to truncate those
> >> >> pages away again.
> >> >
> >> > Good point.  And just to check before skipping over it, do we know that
> >> > autovacuum not leaving enough slack space is not a significant cause of 
> >> > the
> >> > bottlenecks in the first place?
> >>
> >> I'm not sure exactly what you mean by this: autovacuum doesn't need
> >> any slack space.  Regular DML operations can certainly benefit from
> >> slack space, both within each page and overall within the relation.
> >> But there's no evidence that vacuum is doing too good a job cleaning
> >> up the mess, forcing the relation to be re-extended.  Rather, the
> >> usual (and frequent) complaint is that vacuum is leaving way too much
> >> slack space - i.e. bloat.
> >
> > Hm.  I see those two things as different -- to me, bloat is unremoved
> > dead tuples, whereas slack space would be free space that can be reused
> > by new tuples.  Slack space is useful as it avoids relation extension;
> > bloat is not.
> 
> I guess I think of bloat as including both unremoved dead tuples and
> unwanted internal free space.  If you create a giant table, delete 9
> out of every 10 tuples, and vacuum, the table is still "bloated", IMV.

Agreed.  Perhaps to solve this issue what we need is a way to migrate
tuples from later pages into earlier ones.  (This was one of the points,
never resolved, that we discussed during the VACUUM FULL rework.)

> > I wonder, though, if we should set a less-than-100 fillfactor for heap
> > relations.  Just like default_statistic_target, it seems that the
> > default value should be a conservative tradeoff between two extremes.
> > This doesn't help extension for bulk insertions a lot, of course, but
> > it'd be useful for tables where HOT updates happen with some regularity.
> 
> Perhaps, but in theory that should be self-correcting: the data should
> spread itself onto the number of pages where HOT pruning is able to
> prevent further relation extension.

True.  I gather you consider that the cases where it doesn't happen due
to particular conditions are the ones that need manual tweaking.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Tom Lane
Robert Haas  writes:
> On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera
>  wrote:
>> Hm.  I see those two things as different -- to me, bloat is unremoved
>> dead tuples, whereas slack space would be free space that can be reused
>> by new tuples.  Slack space is useful as it avoids relation extension;
>> bloat is not.

> I guess I think of bloat as including both unremoved dead tuples and
> unwanted internal free space.  If you create a giant table, delete 9
> out of every 10 tuples, and vacuum, the table is still "bloated", IMV.

The difficulty is to tell the difference between useless free space and
useful free space.  If there's a reasonable probability of putting new
data into a given chunk of free space in the near future, it's not
bloat.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request to add options to tools/git_changelog

2012-05-02 Thread Bruce Momjian
On Sun, Apr 29, 2012 at 02:06:48PM -0400, Jay Levitt wrote:
> Bruce Momjian wrote:
> >I am again requesting the addition of options to tools/git_changelog so
> >I can more easily produce the release notes.  I asked for this during
> >9.1 development and it was rejected.  I am currently using my own
> >custom version of the tool, but have to merge community improvements
> >into the tool every year before I use it.
> 
> FYI in the general case of "I have to maintain a patch set": Now
> that PG is on git, there's a tool called Stacked Git that lets you
> use git's excellent merge capabilities to maintain patches.
> 
> http://www.procode.org/stgit/

I am unclear what stgit does that can't be done with git branches?  It
mentions pushing and popping patches --- is that it?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 12:26 PM, Alvaro Herrera
 wrote:
> Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012:
>> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen  wrote:
>> > On 2012-05-01 22:06, Robert Haas wrote:
>> >> It might also be interesting to provide a mechanism to pre-extend a
>> >> relation to a certain number of blocks, though if we did that we'd
>> >> have to make sure that autovac got the memo not to truncate those
>> >> pages away again.
>> >
>> > Good point.  And just to check before skipping over it, do we know that
>> > autovacuum not leaving enough slack space is not a significant cause of the
>> > bottlenecks in the first place?
>>
>> I'm not sure exactly what you mean by this: autovacuum doesn't need
>> any slack space.  Regular DML operations can certainly benefit from
>> slack space, both within each page and overall within the relation.
>> But there's no evidence that vacuum is doing too good a job cleaning
>> up the mess, forcing the relation to be re-extended.  Rather, the
>> usual (and frequent) complaint is that vacuum is leaving way too much
>> slack space - i.e. bloat.
>
> Hm.  I see those two things as different -- to me, bloat is unremoved
> dead tuples, whereas slack space would be free space that can be reused
> by new tuples.  Slack space is useful as it avoids relation extension;
> bloat is not.

I guess I think of bloat as including both unremoved dead tuples and
unwanted internal free space.  If you create a giant table, delete 9
out of every 10 tuples, and vacuum, the table is still "bloated", IMV.

> I wonder, though, if we should set a less-than-100 fillfactor for heap
> relations.  Just like default_statistic_target, it seems that the
> default value should be a conservative tradeoff between two extremes.
> This doesn't help extension for bulk insertions a lot, of course, but
> it'd be useful for tables where HOT updates happen with some regularity.

Perhaps, but in theory that should be self-correcting: the data should
spread itself onto the number of pages where HOT pruning is able to
prevent further relation extension.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] eqjoinsel_semi still sucks ...

2012-05-02 Thread Tom Lane
I looked into Maxim Boguk's complaint of bad estimation of antijoin size:
http://archives.postgresql.org/pgsql-general/2012-05/msg00033.php

I can reproduce what I think the problem is in the regression database.
We do okay with this:

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.unique2);
QUERY PLAN  
   
---
 Hash Anti Join  (cost=395.26..1003.26 rows=1 width=244) (actual 
time=264.324..264.324 rows=0 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.050..47.798 rows=1 loops=1)
   ->  Hash  (cost=270.26..270.26 rows=1 width=4) (actual 
time=129.420..129.420 rows=1 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 274kB
 ->  Index Only Scan using tenk1_unique2 on tenk1 b  (cost=0.00..270.26 
rows=1 width=4) (actual time=0.422..65.480 rows=1 loops=1)
   Heap Fetches: 0
 Total runtime: 267.732 ms
(8 rows)

but not so okay when a filter condition is added inside the sub-select:

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.unique2 and b.two = 0);
  QUERY PLAN
  
--
 Hash Anti Join  (cost=545.50..1091.00 rows=1 width=244) (actual 
time=123.713..265.185 rows=5090 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..458.00 rows=1 width=244) (actual 
time=0.048..46.685 rows=1 loops=1)
   ->  Hash  (cost=483.00..483.00 rows=5000 width=4) (actual 
time=123.483..123.483 rows=5000 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 137kB
 ->  Seq Scan on tenk1 b  (cost=0.00..483.00 rows=5000 width=4) (actual 
time=0.059..91.405 rows=5000 loops=1)
   Filter: (two = 0)
   Rows Removed by Filter: 5000
 Total runtime: 284.889 ms
(9 rows)

Now, eqjoinsel_semi is correctly estimating that the condition
a.thousand = b.unique2 is unselective in itself: all values of
a.thousand will have join partners in the first case.  The problem comes
in trying to account for the additional filter condition.  The heuristic
we're currently using is to reduce the number of distinct values assumed
for the inner variable according to the selectivity of the additional
conditions.  In this case, though, that results in reducing ndistinct
for b.unique2 from 1 to 5000, which is still more than ndistinct for
a.thousand (i.e., 1000), so the final selectivity estimate doesn't
change at all.  Oops.

On reflection I think that the idea of clamping ndistinct beforehand is
just wrong, and what we ought to do instead is apply a multiplier to the
selectivity estimate afterwards.  In the case of a base rel we could
just multiply by the selectivity of its baserestrictinfo list.  For join
rels it's a bit harder to guess how much a given input relation might
have been decimated, but if the join's estimated size is smaller than
the output size of the base rel the correlation var came from, we could
multiply by that ratio (on top of whatever correction came from the base
rel's restriction clauses).

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Peter Geoghegan
On 2 May 2012 17:20, Jeff Janes  wrote:
> For -S -M simple, the time spent planning is 5 times more than the
> time spent parsing.  It may be worthwhile to reduce the time spent
> parsing, but if the goal is parity with MySQL it probably isn't the
> place to start.

Could you please share your figures and methodology? I've heard of far
larger proportions than that.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mié may 02 08:14:36 -0400 2012:
> On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen  wrote:
> > On 2012-05-01 22:06, Robert Haas wrote:
> >> It might also be interesting to provide a mechanism to pre-extend a
> >> relation to a certain number of blocks, though if we did that we'd
> >> have to make sure that autovac got the memo not to truncate those
> >> pages away again.
> >
> > Good point.  And just to check before skipping over it, do we know that
> > autovacuum not leaving enough slack space is not a significant cause of the
> > bottlenecks in the first place?
> 
> I'm not sure exactly what you mean by this: autovacuum doesn't need
> any slack space.  Regular DML operations can certainly benefit from
> slack space, both within each page and overall within the relation.
> But there's no evidence that vacuum is doing too good a job cleaning
> up the mess, forcing the relation to be re-extended.  Rather, the
> usual (and frequent) complaint is that vacuum is leaving way too much
> slack space - i.e. bloat.

Hm.  I see those two things as different -- to me, bloat is unremoved
dead tuples, whereas slack space would be free space that can be reused
by new tuples.  Slack space is useful as it avoids relation extension;
bloat is not.

I wonder, though, if we should set a less-than-100 fillfactor for heap
relations.  Just like default_statistic_target, it seems that the
default value should be a conservative tradeoff between two extremes.
This doesn't help extension for bulk insertions a lot, of course, but
it'd be useful for tables where HOT updates happen with some regularity.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] smart shutdown at end of transaction (was: Default mode for shutdown)

2012-05-02 Thread Bruce Momjian
On Sun, Apr 29, 2012 at 10:19:38AM +0100, Simon Riggs wrote:
> Maybe we don't need to do this over multiple releases, but we do need
> to give warning of possible incompatibilities. It would be good to see
> a specific post on hackers called "Planned Incompatibilities in 9.2",
> or collect such things on the open items wiki, so that people
> listening can see what might happen and get a chance to object. Or if
> changes do go ahead, at least we give them a few months warning to
> change the downstream software. Otherwise all that happens is our new
> release comes out and fewer people use it because it takes ages to
> actually realign the software stack enough for our software to be
> used.

The release notes would certainly feature this as an incompatibility,
and would be present even before beta started.  Unless they skip reading
the release notes, it would be hard for them to miss this change.  I
also blog when major release notes are available for viewing.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Jeff Janes
On Tue, May 1, 2012 at 5:53 PM, Peter Geoghegan  wrote:
> Quite apart from the practical difficulties that we have with Flex
> (the fact that the authors are non-responsive and possibly retired,
> that annoying compiler warning, and the fact that we are forced to
> maintain our own Windows binaries of 2.5.35), it has some notable
> disadvantages. I am aware that the MySQL people use their own
> hand-coded lexical analyzer named sql_lex.cc, which provides a yacc
> interface, while avoiding using any lexical analyzer generator
> whatsoever. They can't have done this just for fun, and no doubt this
> goes some way to explaining their continued performance advantage for
> very simple queries. I have heard people complain about Postgres
> parser overhead for "pgbench -S" style use-cases where it is
> unreasonably high, and I've heard them do so more than once.

For -S -M simple, the time spent planning is 5 times more than the
time spent parsing.  It may be worthwhile to reduce the time spent
parsing, but if the goal is parity with MySQL it probably isn't the
place to start.

(If you use a bottom-up profiler, the time spent in planning is
scattered over so many different functions that none of them look very
important individually)

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] clog double-dip in heap_hot_search_buffer

2012-05-02 Thread Tom Lane
Robert Haas  writes:
> On Wed, May 2, 2012 at 12:06 PM, Tom Lane  wrote:
>> Otherwise this seems like a good idea.

> Do you think I should apply this to 9.2, or wait until 9.3?

We're not at beta yet, and it seems pretty safe/self-contained, so
I have no objection to committing now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] clog double-dip in heap_hot_search_buffer

2012-05-02 Thread Simon Riggs
On Wed, May 2, 2012 at 5:07 PM, Robert Haas  wrote:
> On Wed, May 2, 2012 at 12:06 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> +     /* Deleter committed, so tuple is alive if the XID is old enough. */
>>> +     return TransactionIdPrecedes(HeapTupleHeaderGetXmax(tuple), 
>>> OldestXmin);
>>
>> s/alive/dead/ in that comment?
>
> Yep, good catch, thanks.
>
>> Otherwise this seems like a good idea.
>
> Do you think I should apply this to 9.2, or wait until 9.3?

Now please. Well done.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-05-02 Thread Simon Riggs
On Wed, May 2, 2012 at 4:14 PM, Bruce Momjian  wrote:

> I think if implementing global temporary tables only for hot standby
> user (#7), it might be of limited usefulness, but the ability to avoid
> system table churn (#1) means global temporary tables would have a wide
> usefulness, even without hot standby use.

Yes, many parts of Noah's proposal would be useful for normal running.
And as Jim points out, they are SQL Standard, as well as supported by
Oracle and DB2, so again GTTs would hit that TODO item.

And Noah's proposal does take us more than 50%, maybe 80% of the way
to what I think would be most useful for HS.

So overall, I do encourage the proposal.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] clog double-dip in heap_hot_search_buffer

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 12:06 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> +     /* Deleter committed, so tuple is alive if the XID is old enough. */
>> +     return TransactionIdPrecedes(HeapTupleHeaderGetXmax(tuple), 
>> OldestXmin);
>
> s/alive/dead/ in that comment?

Yep, good catch, thanks.

> Otherwise this seems like a good idea.

Do you think I should apply this to 9.2, or wait until 9.3?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] clog double-dip in heap_hot_search_buffer

2012-05-02 Thread Tom Lane
Robert Haas  writes:
> + /* Deleter committed, so tuple is alive if the XID is old enough. */
> + return TransactionIdPrecedes(HeapTupleHeaderGetXmax(tuple), OldestXmin);

s/alive/dead/ in that comment?  Otherwise this seems like a good idea.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-05-02 Thread Christopher Browne
On Wed, May 2, 2012 at 11:39 AM, Michael Nolan  wrote:
> What is the use case for temporary tables on a hot standby server?

Simple...

We required a "hot standby" server in order to get improved reliability.

But we don't want it to sit there chewing power + money, unused.

We want to *use* it to support our reporting applications.

And the developers used temporary tables to marshal results used in
some of those reports.

There are conflicting senses of "read-only" here...
  - In one strict sense, to generate tuples in temp tables means it's
not "read only" access.
  - But since the users running reports aren't allowed to modify the
data in the application tables that they are querying, how is that
*not* fairly characterized as "read only" access???
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Temporary tables under hot standby

2012-05-02 Thread Michael Nolan
What is the use case for temporary tables on a hot standby server?

Perhaps this is a noobie question, but it seems to me that a hot standby
server's use by* applications* or *users* should be limited to transactions
that don't alter the database in any form.

However, I can see where temporary tables might be needed at the system
level (if not already available) in order to prepare more efficient plans
for some complex read-only queries.
--
Mike Nolan


[HACKERS] Latch for the WAL writer - further reducing idle wake-ups.

2012-05-02 Thread Peter Geoghegan
Attached patch latches up the WAL Writer, reducing wake-ups and thus
saving electricity in a way that is more-or-less analogous to my work
on the BGWriter:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6d90eaaa89a007e0d365f49d6436f35d2392cfeb

I am hoping this gets into 9.2 . I am concious of the fact that this
is quite late, but it the patch addresses an open item, the concluding
part of a much wider feature. In any case, it is a useful patch, that
ought to be committed at some point. I should point out:

1. This functionality was covered by the group commit patch that I
worked on back in January, which was submitted in advance of the
commitfest deadline. However, an alternative implementation was
ultimately committed that did not consider WAL Writer wake-ups.

2. The WAL writer is the most important auxiliary process to latch-up.
Though it is tied with the BGWriter at 5 wake-ups per second by
default, I consider the WAL Writer to be more important than the
BGWriter because I find it much more plausible that the WAL Writer
really won't need to be around for much of the time, as with a
read-mostly work load. "Cloud" type deployments often have read-mostly
workloads, so we can still save some power even if the DB is actually
servicing lots of read queries. That being the case, it would be a
shame if we didn't get this last one in, as it adds a lot more value
than any of the other patches.

3. This is a fairly simple patch; as I've said, it works in a way that
is quite analogous to the BGWriter patch, applying lessons learned
there.

With this patch, my instrumentation shows that wake-ups when Postgres
reaches a fully idle state are just 2.7 per second for the entire
postgres process group, quite an improvement on the 7.6 per second in
HEAD. This is exactly what you'd expect from a reduction of 5 wake-ups
per second to 0.1 per second on average for the WAL Writer.

I have determined this with PowerTOP 1.13 on my Fedora 16 laptop. Here
is an example session, began after the cluster reached a fully idle
state, with this patch applied (if, alternatively, I want to see
things at per-process granularity, I can get that from PowerTOP 1.98
beta 1, which is available from my system's package manager):

[peter@peterlaptop powertop-1.13]$ sudo ./powertop -d --time=300
[sudo] password for peter:
PowerTOP 1.13   (C) 2007 - 2010 Intel Corporation

Collecting data for 300 seconds


CnAvg residency
C0 (cpu running)( 2.8%)
polling   0.0ms ( 0.0%)
C1 mwait  0.5ms ( 1.0%)
C2 mwait  0.9ms ( 0.6%)
C3 mwait  1.4ms ( 0.1%)
C4 mwait  6.7ms (95.4%)
P-states (frequencies)
  2.61 Ghz 5.7%
  1.80 Ghz 0.1%
  1200 Mhz 0.1%
  1000 Mhz 0.2%
   800 Mhz93.5%
Wakeups-from-idle per second : 171.3interval: 300.0s
no ACPI power usage estimate available
Top causes for wakeups:
  23.0% (134.5)   chrome
***SNIP***
   0.5% (  2.7)   postgres
***SNIP***

This is a rather low number, that will make us really competitive with
other RDBMSs in this area. Recall that we started from 11.5 wake-ups
for an idle Postgres cluster with a default configuration.

To put the 2.7 number in context, I measured MySQL's wake-ups at 2.2
last year (mysql-server version 5.1.56, Fedora 14), though I
subsequently saw much higher numbers (over 20 per second) for version
5.5.19 on Fedora 16, so you should probably take that with a grain of
salt - I don't know anything about MySQL, and so cannot really be sure
that I'm making an objective comparison in comparing that number with
the number of wake-ups Postgres has with a stock postgresql.conf.

I've employed the same trick used when a buffer is dirtied for the
BGWriter - most of the time, the SetLatch() calls will check a single
flag, and find it already set. We are careful to only "arm" the latch
with a call to ResetLatch() when it is really needed. Rather than
waiting for the clocksweep to be lapped, we wait for a set number of
iterations of consistent inactivity.

I've made the WAL Writer use its process latch, rather than the latch
that was previously within XLogCtl. This seems much more idiomatic, as
in doing so we reserve the right to register generic signal handlers.
With a non-process latch, we'd have to worry about signal invalidation
issues on an ongoing basis, since the handler wouldn't be calling
SetLatch() against the latch we waited on. I have also added a comment
in latch.h generally advising against ad-hoc shared latches where .

I took initial steps to quantify the performance hit from this patch.
A simple "insert.sql" pgbench-tools benchmark on my laptop, with a
generic configuration showed no problems, though I do not assume that
this conclusively proves the case. Results:

http://walwriterlatch.staticloud.com/

My choice of XLogInsert() as an additional site at which to call
SetLatch() was one that wasn't taken easily, and frankly I'm not
entirely confident that I couldn't have been just as effective w

Re: [HACKERS] Temporary tables under hot standby

2012-05-02 Thread Bruce Momjian
On Thu, Apr 26, 2012 at 08:56:40AM +0100, Simon Riggs wrote:
> A full GTT implementation is not required and the design differed from
> that. I don't think "hideously complicated" is accurate, that's just
> you're way of saying "and I disagree". Either route is pretty complex
> and not much to choose between them, apart from the usefulness of the
> end product - GTTs are not that beneficial as a feature in themselves.
> 
> The current problems of our temp table approach are
> 1. Catalog bloat
> 2. Consumption of permanent XIDs for DML on temp tables.  This increases 
> COMMIT
>   cost on the master and is a non-starter under hot standby.
> 3. autovacuum cannot VACUUM temporary tables, but their relfrozenxid values
>   delay pg_clog truncation and can trigger a wraparound-prevention shutdown.
> 4. sinval traffic from every CREATE TEMP TABLE et al.
> 5. CREATE TEMP TABLE records AccessExclusiveLock in the global lock table.
> 6. We don't automatically drop temporary tables that existed at the point of a
>   crash, because they look much like permanent tables.2. Cross-backend
> access/security
> 7. Temp tables don't work on HS
> 8. No Global Temp tables
> 
> Implementing GTTs solves (8) and provides some useful tools to solve
> other points. Note that GTTs do not themselves solve 1-7 in full,
> hence my point that GTTs are an endpoint not a way station. The way
> forwards is not to concentrate on GTTs but to provide a set of
> facilities that allow all the more basic points 1-6 to be addressed,
> in full and then solve (7) and (8).  If we pretend (8) solves (7) as
> well, we will make mistakes in implementation that will waste time and
> deliver reduced value.
> 
> In passing I note that GTTs are required to allow PostgresXC to
> support temp tables, since they need a mechanism to makes a single
> temp table definition work on multiple nodes with different data in
> each.
> 
> Simply put, I don't think we should be emphasising things that are
> needed for PostgresXC and EDB AS, but not that important for
> PostgreSQL users.

I think if implementing global temporary tables only for hot standby
user (#7), it might be of limited usefulness, but the ability to avoid
system table churn (#1) means global temporary tables would have a wide
usefulness, even without hot standby use.  The idea of sharing optimizer
statistics also has a lot of merit.  

FYI, global temp tables have been on the TODO list long before EDB
employment or PostgresXC.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] clog double-dip in heap_hot_search_buffer

2012-05-02 Thread Robert Haas
heap_hot_search_buffer() does this:

valid = HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer);

If it turns out that the tuple isn't valid (i.e. visible to our scan)
and we haven't yet found any live tuples in the current HOT chain,
then we check whether it's visible to anyone at all:

if (all_dead && *all_dead &&
HeapTupleSatisfiesVacuum(heapTuple->t_data, RecentGlobalXmin,
 buffer) != HEAPTUPLE_DEAD)
*all_dead = false;

This is obviously an important optimization for accelerating index
cleanup, but it has an unfortunate side-effect: it considerably
increases the frequency of CLOG access.  Normally,
HeapTupleSatisfiesVisibility() will sent hint bits on the tuple, but
sometimes it can't, either because the inserter has not yet committed
or the inserter's commit record hasn't been flushed or the deleter
hasn't committed or the deleter's commit record hasn't been flushed.
When that happens, HeapTupleSatisfiesVacuum() gets called a moment
later and repeats the same CLOG lookups.  It is of course possible for
a state change to happen in the interim, but that's not really a
reason to repeat the lookups; asking the same question twice in a row
just in case you should happen to get an answer you like better the
second time is not generally a good practice, even if it occasionally
works.

The attached patch adds a new function HeapTupleIsSurelyDead(), a
cut-down version of HeapTupleSatisfiesVacuum().  It assumes that,
first, we only care about distinguishing between dead and anything
else, and, second, that any transaction for which hint bits aren't yet
set is still running.  This allows it to be a whole lot simpler than
HeapTupleSatisfiesVacuum() and to get away without doing any CLOG
access.  It also changes heap_hot_search_buffer() to use this new
function in lieu of HeapTupleSatisfiesVacuum().

I found this problem by using 'perf record -e cs -g' and 'perf report
-g' to find out where context switches were happening.  It turns out
that this is a very significant contributor to CLOG-related context
switches.  Retesting with those same tools shows that the patch does
in fact make those context switches go away.  On a long pgbench test,
the effects of WALInsertLock contention, ProcArrayLock contention,
checkpoint-related latency, etc. will probably swamp the effect of the
patch.  On a short test, however, the effects are visible; and in
general anything that optimizes away access to heavily contended
shared memory data structures is probably a good thing.  Permanent
tables, scale factor 100, 30-second tests:

master:
tps = 22175.025992 (including connections establishing)
tps = 22072.166338 (including connections establishing)
tps = 22653.876341 (including connections establishing)

with patch:
tps = 26586.623556 (including connections establishing)
tps = 25564.098898 (including connections establishing)
tps = 25756.036647 (including connections establishing)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


surely-dead-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: additional error fields

2012-05-02 Thread Kevin Grittner
Tom Lane  wrote:
> "Kevin Grittner"  writes:
>> That "F0" class looks suspicious; are those really defined by
>> standard or did we encroach on standard naming space with
>> PostgreSQL-specific values?
> 
> I think we screwed up on that :-(.  So we ought to renumber those
> codes anyway.  Perhaps use "PF" instead of "F0"?
 
Sounds good to me.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: additional error fields

2012-05-02 Thread Tom Lane
"Kevin Grittner"  writes:
> That "F0" class looks suspicious; are those really defined by standard or
> did we encroach on standard naming space with PostgreSQL-specific
> values?

I think we screwed up on that :-(.  So we ought to renumber those
codes anyway.  Perhaps use "PF" instead of "F0"?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: additional error fields

2012-05-02 Thread Kevin Grittner
Tom Lane  wrote:
 
> My guess is that all the ones defined in the SQL standard are
> "expected" errors, more or less by definition, and thus not
> interesting according to Peter G's criteria.
 
On a scan through the list, I didn't see any exceptions to that,
except for the "F0" class.  To restate what the standard reserves
for standard SQLSTATE values in the form of a regular expression, it
looks like:
 
'^[0-4A-H][0-9A-Z][0-4A-H][0-9A-Z][0-9A-Z]$'
 
Eyeballing the errcode page in the docs, it looks like there are
PostgreSQL-assigned values that start with '5', 'P', and 'X'.  That
"F0" class looks suspicious; are those really defined by standard or
did we encroach on standard naming space with PostgreSQL-specific
values?

We also have PostgreSQL-specific values in standard classes where we
use 'P' for the third character, which is fine.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add conversion from pg_wchar to multibyte

2012-05-02 Thread Alexander Korotkov
On Wed, May 2, 2012 at 5:48 PM, Robert Haas  wrote:

> On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov 
> wrote:
>  > Imagine we've two queries:
> > 1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
> > 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
> >
> > The first query require reading posting lists of trigrams "abc" and
> "bcd".
> > The second query require reading posting lists of trigrams "abc", "bcd",
> > "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
> > We could decide to use index scan for first query and sequential scan for
> > second query because number of posting list to read is high. But it is
> > unreasonable because actually second query is narrower than the first
> one.
> > We can use same index scan for it, recheck will remove all false
> positives.
> > When number of trigrams is high we can just exclude some of them from
> index
> > scan. It would be better than just decide to do sequential scan. But the
> > question is what trigrams to exclude? Ideally we would leave most rare
> > trigrams to make index scan cheaper.
>
> True.  I guess I was thinking more of the case where you've got
> abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|  There's probably some point
> at which it becomes silly to think about using the index.


Yes, such situations are also possible.

 >> Well, I'm not an expert on encodings, but it seems like a logical
> >> extension of what we're doing right now, so I don't really see why
> >> not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
> >> though.  Presumably either the old code is right (in which case, don't
> >> change it) or the new code is right (in which case, there's a bug fix
> >> needed here that ought to be discussed and committed separately from
> >> the rest of the patch).  Maybe I am missing something.
> >
> > Unfortunately I didn't understand original logic
> of pg_mule2wchar_with_len.
> > I just did proposal about how it could be. I hope somebody more familiar
> > with this code would clarify this situation.
>
> Well, do you think the current code is buggy, or not?


Probably, but I'm not sure. The conversion seems lossy to me unless I'm
missing something about mule encoding.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Patch: add conversion from pg_wchar to multibyte

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov  wrote:
>> I was thinking you could perhaps do it just based on the *number* of
>> trigrams, not necessarily their frequency.
>
> Imagine we've two queries:
> 1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
> 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
>
> The first query require reading posting lists of trigrams "abc" and "bcd".
> The second query require reading posting lists of trigrams "abc", "bcd",
> "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
> We could decide to use index scan for first query and sequential scan for
> second query because number of posting list to read is high. But it is
> unreasonable because actually second query is narrower than the first one.
> We can use same index scan for it, recheck will remove all false positives.
> When number of trigrams is high we can just exclude some of them from index
> scan. It would be better than just decide to do sequential scan. But the
> question is what trigrams to exclude? Ideally we would leave most rare
> trigrams to make index scan cheaper.

True.  I guess I was thinking more of the case where you've got
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|  There's probably some point
at which it becomes silly to think about using the index.

>> > Probably you have some comments on idea of conversion from pg_wchar to
>> > multibyte? Is it acceptable at all?
>>
>> Well, I'm not an expert on encodings, but it seems like a logical
>> extension of what we're doing right now, so I don't really see why
>> not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
>> though.  Presumably either the old code is right (in which case, don't
>> change it) or the new code is right (in which case, there's a bug fix
>> needed here that ought to be discussed and committed separately from
>> the rest of the patch).  Maybe I am missing something.
>
> Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
> I just did proposal about how it could be. I hope somebody more familiar
> with this code would clarify this situation.

Well, do you think the current code is buggy, or not?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Peter Geoghegan
On 2 May 2012 04:57, Tom Lane  wrote:
> FWIW, I think only developers not packagers would really be taking such
> a hit.  I assume we'd continue to ship prebuilt lexer output in
> tarballs, so there'd seldom be a reason for a packager to need to run
> the tool.  Given the extremely slow rate of churn of the lexer, it might
> not be necessary for most developers to have the tool installed, either,
> if we were willing to put the derived file into git.

Incidentally, I had an unrelated conversation with someone (I think it
might have been Heikki) a while back, where it was suggested that Flex
and Bison could be run through web services. This might actually make
hacking Postgres on windows far easier, because the last time I tried
to do that the hard way, I gave up, suspecting that there must be some
kind of Winflex bug that selectively manifests itself - certainly, the
population of windows hackers is small enough that it could go
unnoticed for quite a while. Such an approach could be part of the
solution to this problem (although, incidentally, Quex maintains
visual studio support quite well, and even has graphical instructions
here: http://quex.sourceforge.net/doc/html/intro/visual_studio_trouble.html
).

It might be the case that some kind of virtualisation and/or
authentication (Postgres community account required) could make this
approach practical. It just isn't the path of least resistance right
now. Visual studio builds would be far easier if we did this, which
might encourage more hackers to venture into Windows land.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add conversion from pg_wchar to multibyte

2012-05-02 Thread Alexander Korotkov
On Wed, May 2, 2012 at 4:50 PM, Robert Haas  wrote:

> On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov 
> wrote:
> > Right. When number of trigrams is big, it is slow to scan posting list of
> > all of them. The solution is this case is to exclude most frequent
> trigrams
> > from index scan.  But, it require some kind of statistics of trigrams
> > frequencies which we don't have. We could estimate frequencies using some
> > hard-coded assumptions about natural languages. Or we could exclude
> > arbitrary trigrams. But I don't like both these ideas. This problem is
> also
> > relevant for LIKE/ILIKE search using trigram indexes.
>
> I was thinking you could perhaps do it just based on the *number* of
> trigrams, not necessarily their frequency.
>

Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';

The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd",
"cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for
second query because number of posting list to read is high. But it is
unreasonable because actually second query is narrower than the first one.
We can use same index scan for it, recheck will remove all false positives.
When number of trigrams is high we can just exclude some of them from index
scan. It would be better than just decide to do sequential scan. But the
question is what trigrams to exclude? Ideally we would leave most rare
trigrams to make index scan cheaper.


> > Probably you have some comments on idea of conversion from pg_wchar to
> > multibyte? Is it acceptable at all?
>
> Well, I'm not an expert on encodings, but it seems like a logical
> extension of what we're doing right now, so I don't really see why
> not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
> though.  Presumably either the old code is right (in which case, don't
> change it) or the new code is right (in which case, there's a bug fix
> needed here that ought to be discussed and committed separately from
> the rest of the patch).  Maybe I am missing something.


Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
I just did proposal about how it could be. I hope somebody more familiar
with this code would clarify this situation.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] proposal: additional error fields

2012-05-02 Thread Tom Lane
Peter Eisentraut  writes:
> On tis, 2012-05-01 at 20:13 -0400, Tom Lane wrote:
>> I don't deny that we probably need to reclassify a few error cases,
>> and fix some elogs that should be ereports, before this approach would
>> be really workable.  My point is that it's *close*, whereas "let's
>> invent some new error severities" is not close to reality and will
>> break all sorts of stuff.

> We might hit a road block because some of these sqlstates are defined by
> the SQL standard.

My guess is that all the ones defined in the SQL standard are "expected"
errors, more or less by definition, and thus not interesting according
to Peter G's criteria.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add conversion from pg_wchar to multibyte

2012-05-02 Thread Robert Haas
On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov  wrote:
> Right. When number of trigrams is big, it is slow to scan posting list of
> all of them. The solution is this case is to exclude most frequent trigrams
> from index scan.  But, it require some kind of statistics of trigrams
> frequencies which we don't have. We could estimate frequencies using some
> hard-coded assumptions about natural languages. Or we could exclude
> arbitrary trigrams. But I don't like both these ideas. This problem is also
> relevant for LIKE/ILIKE search using trigram indexes.

I was thinking you could perhaps do it just based on the *number* of
trigrams, not necessarily their frequency.

> Probably you have some comments on idea of conversion from pg_wchar to
> multibyte? Is it acceptable at all?

Well, I'm not an expert on encodings, but it seems like a logical
extension of what we're doing right now, so I don't really see why
not.  I'm confused by the diff hunks in pg_mule2wchar_with_len,
though.  Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch).  Maybe I am missing something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only scans vs. Hot Standby, round two

2012-05-02 Thread Robert Haas
On Thu, Apr 26, 2012 at 8:03 PM, Robert Haas  wrote:
> So, as a first step, I've committed a patch that just throws a hard
> conflict.  I think we probably want to optimize this further, and I'm
> going to work investigate that next.  But it seemed productive to get
> this much out of the way first, so I did.

I've been thinking about this some more.  What's worrying me is that a
visibility conflict, however we implement it, could be *worse* from
the user's point of view than just killing the query.  After all,
there's a reasonable likelihood that a single visibility map page
covers the whole relation (or all the blocks that the user is
interested in), so any sort of conflict is basically going to turn the
index-only scan into an index-scan plus some extra overhead.  And if
the planner had known that the user was going to get an index-only
scan rather than just a plain index scan, it might well have picked
some other plan in the first place.

Another problem is that, if we add a test for visibility conflicts
into visibilitymap_test(), I'm afraid we're going to drive up the cost
of that function very significantly.  Previous testing suggests that
that efficiency or lack thereof of that function is already a
performance problem for index-only scans, which kinda makes me not
that excited about adding another branch in there somewhere (and even
less excited about any proposed implementation that would add an
lwlock acquire/release or similar).

So on further reflection I'm thinking it may be best just to stick
with a hard conflict for now and see what feedback we get from beta
testers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Robert Haas
On Wed, May 2, 2012 at 7:16 AM, Jeroen Vermeulen  wrote:
> On 2012-05-01 22:06, Robert Haas wrote:
>> It might also be interesting to provide a mechanism to pre-extend a
>> relation to a certain number of blocks, though if we did that we'd
>> have to make sure that autovac got the memo not to truncate those
>> pages away again.
>
> Good point.  And just to check before skipping over it, do we know that
> autovacuum not leaving enough slack space is not a significant cause of the
> bottlenecks in the first place?

I'm not sure exactly what you mean by this: autovacuum doesn't need
any slack space.  Regular DML operations can certainly benefit from
slack space, both within each page and overall within the relation.
But there's no evidence that vacuum is doing too good a job cleaning
up the mess, forcing the relation to be re-extended.  Rather, the
usual (and frequent) complaint is that vacuum is leaving way too much
slack space - i.e. bloat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Have we out-grown Flex?

2012-05-02 Thread Peter Geoghegan
On 2 May 2012 04:24, Robert Haas  wrote:
> I think Tom's question of whether the parser or lexer is the problem
> is something that ought to be investigated.  Personally, I suspect
> that our tendency to use lists everywhere, for everything, an artifact
> of our proud LISP heritage, may be costing us dearly in terms of parse
> time.  However, there's a difference between suspicion and proof, and
> I certainly have none of the latter.

It's funny that you should say that, because I actually had a
discussion with Greg Stark over drinks about this recently. John
Bentley has described an experiment that undermines many traditional
beliefs about the trade-offs represented by using a linked list rather
than an array. The test is, using a modern computer, generate N
integers at random and insert them in order into a sequence. Then,
randomly remove integers from the sequence, one at a time. What is the
performance at different sizes of N when the sequence is a
doubly-linked list, and when it is an array? If you graph the two, the
results are perhaps rather surprising. I think his graph went up to
100,000. The initial result shows a line representing an array down
near the bottom of the graph. The list line looks exponential. Even if
you use memory pooling so the list doesn't have to allocate memory as
needed, the array still roundly beats the list, albeit not quite so
convincingly and without the list hitting the roof near 100,000. I
don't think that I need to point out that this is for inserting and
deleting, and that's when you're supposed to use lists.

The point is that on modern architectures, with many layers of cache,
the cost of the linear search to get the insertion point completely
dominates - this is without the array availing of a binary search, in
the interest of fairness. CPU caches happen to do a very good job of
moving over on-average half of the array for inserting elements at
random points. The list is much larger than the array, with the two
extra pointers per element (yeah, I know that we use singly linked
lists, but we have other disadvantages compared to typical C lists),
which matters. Predictable usage patterns - that is, temporal and
spatial locality, resulting in good usage of the memory hierarchy
matters a lot. We're not talking about a small difference, either. I
think the difference in the published test was something like the
array was 50 - 100 times faster. The list results in far more cache
misses than the array. So, I'm right there with you - using lists
everywhere is bad news.

As for the question of Flex/Quex, I'm not in an immediate position to
sink any more time into it, but it remains on my list of things to
pursue for 9.3, though it's only about number 3 on that list right
now.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] How hard would it be to support LIKE in return declaration of generic record function calls ?

2012-05-02 Thread Hannu Krosing
Hi Hackers

How hard would it be to add support for LIKE syntax, similar to table
def in field list declaration for generic record functions

What I'dd like to be able to do is to have a generic json_to_record
function 

CREATE OR REPLACE RECORD json_to_record(json) RETURNS RECORD AS $$
...
$$ LANGUAGE ... ;

and then be able to call it like this

insert into test2 
select * from json_to_record(jrec json) as (like test2);

ERROR:  syntax error at or near "like"

instead of explicitly spelling out the structure of table test2 in the
AS part.

insert into test2 
select * from json_to_record(jrec json)
 as (id int, data2 test, tstamp timestamp);
INSERT 0 1


PS.

As a pie-in-the-sky wish I'd prefer of course even simpler syntax of

insert into test2 json_to_record(jrec json);

or at least  

insert into test2 json_to_record(jrec json)::test2;

:)
- 
---
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] extending relations more efficiently

2012-05-02 Thread Jeroen Vermeulen

On 2012-05-01 22:06, Robert Haas wrote:


It might also be interesting to provide a mechanism to pre-extend a
relation to a certain number of blocks, though if we did that we'd
have to make sure that autovac got the memo not to truncate those
pages away again.


Good point.  And just to check before skipping over it, do we know that 
autovacuum not leaving enough slack space is not a significant cause of 
the bottlenecks in the first place?



Jeroen

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Analyzing foreign tables & memory problems

2012-05-02 Thread Albe Laurenz
I wrote:
> Noah Misch wrote:
>>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>>> and compute_scalar_stats, values whose length exceed
>>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>>> other than that they are counted as "too wide rows" and assumed
>>> to be all different.
>>>
>>> This works fine with regular tables;

>>> With foreign tables the situation is different.  Even though
>>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>>> rows will be fetched from the foreign table.  This can easily
>>> exhaust maintenance_work_mem.

>>> I can think of two remedies:
>>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>>>so that the authors of foreign data wrappers are aware of the
>>>problem and can avoid it on their side.
>>>This would be quite simple.

>> Seems reasonable.  How would the FDW return an indication that a
value was
>> non-NULL but removed due to excess width?
> 
> The FDW would return a value of length WIDTH_THRESHOLD+1 that is
> long enough to be recognized as too long, but not long enough to
> cause a problem.

Here is a simple patch for that.

Yours,
Laurenz Albe


analyze.patch
Description: analyze.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?

2012-05-02 Thread Hannu Krosing
On Tue, 2012-05-01 at 19:11 -0400, Tom Lane wrote:
> Andrew Dunstan  writes:
> > On Tue, May 1, 2012 at 9:56 AM, Joey Adams 
> > wrote:
> >> No, the RFC says (emphasis mine):
> >> 
> >> A JSON *text* is a serialized object or array.
> >> 
> >> If we let the JSON type correspond to a *value* instead, this
> >> restriction does not apply, and the JSON type has a useful recursive
> >> definition.
> 
> > I think you're playing with words. But in any case, the RFC says this
> > regarding generators:
> > 5. Generators
> >A JSON generator produces JSON text.  The resulting text MUST
> >strictly conform to the JSON grammar.
> 
> I read over the RFC, and I think the only reason why they restricted
> JSON texts to represent just a subset of JSON values is this cute
> little hack in section 3 (Encoding):
> 
>Since the first two characters of a JSON text will always be ASCII
>characters [RFC0020], it is possible to determine whether an octet
>stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>at the pattern of nulls in the first four octets.
>00 00 00 xx  UTF-32BE
>00 xx 00 xx  UTF-16BE
>xx 00 00 00  UTF-32LE
>xx 00 xx 00  UTF-16LE
>xx xx xx xx  UTF-8
> 
> They need a guaranteed 2 ASCII characters to make that work, and
> they won't necessarily get that many with a bare string literal.
> 
> Since for our purposes there is not, and never will be, any need to
> figure out whether a JSON input string is encoded in UTF16 or UTF32,
> I find myself agreeing with the camp that says we might as well consider
> that our JSON type corresponds to JSON values not JSON texts.  I also
> notice that json_in() seems to believe that already.
> 
> However, that doesn't mean I'm sold on the idea of getting rid of
> array_to_json and row_to_json in favor of a universal "to_json()"
> function.  In particular, both of those have optional "pretty_bool"
> arguments that don't fit nicely at all in a generic conversion
> function.  The meaning of that flag is very closely tied to the
> input being an array or record respectively.

The flags probably should not be tied to specific type, as JSON is
recursive and as such I think the current one-top-level-element-per row
is quite limited form of pretty-printing.

I have a table with a field the type of which is an array of type of
another table, and what I currently get with pretty=true is

hannu=# select row_to_json(test3, true) from test3;
-[ RECORD
1 
]
row_to_json | {"id":1,
|
"data3":[{"id":1,"data2":{"id":1,"data":"0.262814193032682","tstamp":"2012-04-05
 13:21:03.235204"},"tstamp":"2012-04-05 
13:25:03.644497"},{"id":2,"data2":{"id":2,"data":"0.157406373415142","tstamp":"2012-04-05
 13:21:05.2033"},"tstamp":"2012-04-05 13:25:03.644497"}],
|  "tstamp":"2012-04-16 14:40:15.795947"}

What I would like to get what python's pprint does for the same json:

>>> pprint(row)
{'id': 1,
 'data3': [{'data2': {'data': '0.262814193032682',
  'id': 1,
  'tstamp': '2012-04-05 13:21:03.235204'},
'id': 1,
'tstamp': '2012-04-05 13:25:03.644497'},
   {'data2': {'data': '0.157406373415142',
  'id': 2,
  'tstamp': '2012-04-05 13:21:05.2033'},
'id': 2,
'tstamp': '2012-04-05 13:25:03.644497'}],
 'tstamp': '2012-04-16 14:40:15.795947'}

If we have a pretty flag why not make it work all the way down the
structure ?

> I'm inclined to leave these functions as they are, and consider
> adding a universal "to_json(anyelement)" (with no options) later.

To achieve recursive prettyprinting the better way is to have an
universal to_json(anyelement) with a prettyprinting option 

to_json(datum anyelement, indent int)

with the behavior that if indent is NULL or negative integer no
pretty-printing is done, if it is 0 printing starts at left margin and
if it is a positive integer then this number of spaces is added to the
left for each row (except the first one) of the json representation.

And it would be overridable for specific types, so that hstore could
provide its own

to_json(datum hstore, indent int)

which would do the correct pretty-printing for hstor-as-json_object
representation.

> Because it would not have options, it would not be meant to cover
> cases where there's value in formatting or conversion options;
> so it wouldn't render the existing functions entirely obsolete,
> nor would it mean there would be no need for other specialized
> conversion functions.

I don't object to row_to_json() and array_to_json() functions being
there as a convenience and as the two "official" functions guara

Re: [HACKERS] JSON in 9.2 - Could we have just one to_json() function instead of two separate versions ?

2012-05-02 Thread Hannu Krosing
On Tue, 2012-05-01 at 21:22 -0400, David Johnston wrote:
> On May 1, 2012, at 20:41, Hannu Krosing  wrote:
> > 
> > Most people don't work in strongly-typed environment, and thus would
> > work around such restriction if they need a simple JSON value at the
> > other end of the interchange.
> > 
> > 
> >> My personal take it is have it fail since any arbitrary decision to cast 
> >> to JSON Text

For arrays and records the json text and jason value are exactly the
same. it is just that json representations of simple types are
officially not JSON texts.

>  is going to make someone unhappy and supposedly they can 
> >> modify their query so that the result generates whatever format they 
> >> desire.
> > 
> > Do you actually have such an experience or is it just a wild guess ?
> > 
> > 
> 
> So even given the semantic differences between an object and a scalar 
> I am better understanding where interpreting JSON as JSON Value makes 
> sense.  However, if I convert a record or array to JSON I expect to get 
> a JSON Text even if the there is only a single column or value in the input.  

Of course you will, and you will get a Json Text even for empty object
or array. 

array[1] and 1 and {'one':1} are all different and will stay such.

> I guess my take is that record -> JSON text while anything else is JSON 
> value.  Whether it is worth maiming the special case for record is 
> worthwhile I really do not know but the semantic difference does exist; 
> and record output is a significant aspect of PostgreSQL output.

I have never suggested that we special-case an 1-element record or list
and start returning only the contained value for these.

> I get the ease-of-use aspect but also recognize that sometimes being slightly 
> harder to use is worthwhile if you eliminate ambiguities or limit the 
> possibility to make mistakes.

There are no ambiguities in what is returnded for record or array.

But not being able to return JSON values via cast to json for some types
or not using such casts will make extending the json support for types
by user much much harder. And nonstandard. 

Using simple cast to json is very PostgreSQL-ish way to give support of
json to any type



-- 
---
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: additional error fields

2012-05-02 Thread Peter Eisentraut
On tis, 2012-05-01 at 20:13 -0400, Tom Lane wrote:
> I don't deny that we probably need to reclassify a few error cases,
> and fix some elogs that should be ereports, before this approach would
> be really workable.  My point is that it's *close*, whereas "let's
> invent some new error severities" is not close to reality and will
> break all sorts of stuff.

We might hit a road block because some of these sqlstates are defined by
the SQL standard.  But at least we should try this first, and if it
doesn't work make another field that contains the admin/server-level
severity instead of the client-side/flow-control severity level.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers