Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Jeff Davis
On Thu, 2011-06-30 at 12:28 +0200, Florian Pflug wrote:
> Well, arrays are containers, and we need two values to construct a range,

What about empty ranges? What about infinite ranges?

It seems quite a bit more awkward to shoehorn ranges into an array than
to use a real type (even if it's intermediate and otherwise useless).

> Hm, I guess. I'm sill no huge fan of RANGEINPUT, but if we prevent
> it from being used as a column type and from being used as an argument
> type, then I guess it's workable...
> 
> Btw, what happened to the idea of making RANGE(...) a special syntactic
> construct instead of a normal function call? Did we discard that for its
> intrusiveness, or were there other reasons?

It has not been discarded; as far as I'm concerned it's still on the
table. The main advantage is that it doesn't require an intermediate
type, and that requiring a cast (or some specification of the range
type) might be a little more natural. The downside is that, well, it's
new syntax, and there's a little inertia there.

But if it's actually better, we should do it. If an intermediate type
seems to be problematic, or if people think it's strange to require
casting, then I think this is reasonable.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Jeff Davis
On Thu, 2011-06-30 at 09:59 -0700, David E. Wheeler wrote:
> On Jun 30, 2011, at 9:34 AM, Jeff Davis wrote:
> 
> > Then how do you get a text range that doesn't correspond to the
> > LC_COLLATE setting?
> 
> You cast it.

My original solution was something like this, except involving domains.
With a sufficient amount of casting of all arguments to anything
involving a range type, it works, but it's a little too un-SQL-like.
There was at least one fairly strong objection to my approach, but if
you have some further thoughts along that line, I'm open to suggestion.

Also, what if the LC_COLLATE is C, and you want to cast it to en_US
collation?
  range('a','Z')
would be invalid in the C locale, and it would fail before you had a
chance to cast it.

> Cast where you need it explicit, and have a reasonable default when
> it's not cast.

I thought about that, too, but it's not ideal, either. That means that
something might start out as the only range type for a given subtype,
and doesn't need explicit casts. Then you define another range type over
that subtype, and all the original queries break because they are now
ambiguous.

I think the fundamental differences with range types that we're dealing
with are:
 1. multiple range types might reasonbly exist for a single subtype
 2. the order is a fundamental part of the type definition, not just an
extra argument useful for operations on the range type

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [ADMIN] PANIC while doing failover (streaming replication)

2011-06-30 Thread Fujii Masao
On Fri, Jul 1, 2011 at 2:18 PM, Mikko Partio  wrote:
> Hello list
> I have two a machine cluster with PostgreSQL 9.0.4 and streaming
> replication. In a normal situation I did a failover -- touched the trigger
> file in standby to promote it to production mode. I have done this
> previously without any complications but now the to-be-production-database
> had a PANIC and shut itself down. From the logs:
> postgres[10751]: [2-1] 2011-06-30 17:25:24 EEST [10751]: [1-1] user=,db=
> LOG:  streaming replication successfully connected to primary
> postgres[10736]: [10-1] 2011-07-01 07:50:29 EEST [10736]: [10-1] user=,db=
> LOG:  trigger file found: /postgresql/data/finish_replication.trigger
> postgres[10751]: [3-1] 2011-07-01 07:50:29 EEST [10751]: [2-1] user=,db=
> FATAL:  terminating walreceiver process due to administrator command
> postgres[10736]: [11-1] 2011-07-01 07:50:30 EEST [10736]: [11-1] user=,db=
> LOG:  redo done at AE/8B1855C8
> postgres[10736]: [12-1] 2011-07-01 07:50:30 EEST [10736]: [12-1] user=,db=
> LOG:  last completed transaction was at log time 2011-07-01
> 07:50:14.67424+03
> postgres[10736]: [13-1] 2011-07-01 07:50:30 EEST [10736]: [13-1] user=,db=
> LOG:  restored log file "000600AE008B" from archive
> postgres[10736]: [14-1] 2011-07-01 07:50:30 EEST [10736]: [14-1] user=,db=
> LOG:  selected new timeline ID: 7
> postgres[10736]: [15-1] 2011-07-01 07:50:30 EEST [10736]: [15-1] user=,db=
> LOG:  restored log file "0006.history" from archive
> postgres[10736]: [16-1] 2011-07-01 07:50:30 EEST [10736]: [16-1] user=,db=
> LOG:  archive recovery complete
> postgres[10736]: [17-1] 2011-07-01 07:50:30 EEST [10736]: [17-1] user=,db=
> WARNING:  page 68975 of relation base/3711580/4398155 was uninitialized
> postgres[10736]: [18-1] 2011-07-01 07:50:30 EEST [10736]: [18-1] user=,db=
> WARNING:  page 782 of relation base/3711580/4395394 was uninitialized
> postgres[10736]: [19-1] 2011-07-01 07:50:30 EEST [10736]: [19-1] user=,db=
> WARNING:  page 68948 of relation base/3711580/4398155 was uninitialized
> postgres[10736]: [20-1] 2011-07-01 07:50:30 EEST [10736]: [20-1] user=,db=
> WARNING:  page 68986 of relation base/3711580/4398155 was uninitialized
> [ there are maybe 50 lines of these WARNING messages for all together six
> different relations, all indexes ]
> postgres[10736]: [73-1] 2011-07-01 07:50:31 EEST [10736]: [73-1] user=,db=
> PANIC:  WAL contains references to invalid pages
> postgres[10727]: [2-1] 2011-07-01 07:50:31 EEST [10727]: [2-1] user=,db=
> LOG:  startup process (PID 10736) was terminated by signal 6: Aborted
> postgres[10727]: [3-1] 2011-07-01 07:50:31 EEST [10727]: [3-1] user=,db=
> LOG:  terminating any other active server processes
> Then I started postgres as it was shut down:
> postgres[12191]: [1-1] 2011-07-01 07:50:59 EEST [12191]: [1-1] user=,db=
> LOG:  database system was interrupted while in recovery at log time
> 2011-07-01 07:44:02 EEST
> postgres[12191]: [1-2] 2011-07-01 07:50:59 EEST [12191]: [2-1] user=,db=
> HINT:  If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> postgres[12191]: [2-1] 2011-07-01 07:50:59 EEST [12191]: [3-1] user=,db=
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> postgres[12191]: [3-1] 2011-07-01 07:50:59 EEST [12191]: [4-1] user=,db=
> LOG:  redo starts at AE/8B13E398
> postgres[12191]: [4-1] 2011-07-01 07:50:59 EEST [12191]: [5-1] user=,db=
> LOG:  consistent recovery state reached at AE/8C00
> postgres[12191]: [5-1] 2011-07-01 07:50:59 EEST [12191]: [6-1] user=,db=
> LOG:  unexpected pageaddr AE/6E00 in log file 174, segment 140, offset 0
> postgres[12191]: [6-1] 2011-07-01 07:50:59 EEST [12191]: [7-1] user=,db=
> LOG:  redo done at AE/8B1855C8
> postgres[12191]: [7-1] 2011-07-01 07:50:59 EEST [12191]: [8-1] user=,db=
> LOG:  last completed transaction was at log time 2011-07-01
> 07:50:14.67424+03
> postgres[12194]: [1-1] 2011-07-01 07:50:59 EEST [12194]: [1-1] user=,db=
> LOG:  autovacuum launcher started
> postgres[12189]: [1-1] 2011-07-01 07:50:59 EEST [12189]: [1-1] user=,db=
> LOG:  database system is ready to accept connections
> Are these log messages something to worry about?

http://archives.postgresql.org/pgsql-hackers/2008-12/msg01335.php
According to the above thread, this issue seems to exist for a few years,
and be unsolved yet. Could you provide a self-contained test case?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add support for logging current role (what to review?)

2011-06-30 Thread Magnus Hagander
On Fri, Jul 1, 2011 at 03:36, Alvaro Herrera  wrote:
> Excerpts from Stephen Frost's message of jue jun 30 18:35:40 -0400 2011:
>
>> > I know of no such list, and I think this field
>> > useful/important enough that people who are using csv logging would
>> > want it anyway. +1 on just tacking on the field and causing a flag day
>> > for csv users.
>>
>> Honestly, I think it was *me* who raised the issue that we don't have a
>> header for CSV logs and that it sucks for people using CSV files.  We've
>> changed it in the past (application_name was added, iirc) and there
>> wasn't much noise of it that I recall.  If everyone's happy with that,
>> it's fine by me.
>
> I don't understand why it is so much a deal that 9.1 has a different CSV
> table definition than 9.0 anyway (or any two release combination).  As
> long as both are clearly and correctly documented in the respective
> pages, it shouldn't be an issue at all.  If anyone attempts to load CSV
> log files into the wrong table definition, the problem should make
> itself evident pretty quickly.

I don't see that as a big problem either. A file header would make it
easier to detect for tools though - either by a regular CSV header or
by just always logging a row with the version number in the first line
of each file. Making it an actual CSV header has the added benefit of
being a standard way that non-pg-specific tools know how to deal
with..

If it's a pg specific tool, it's likely going to require version
specific information anyway. This is just one of them. Now, if the
order and ocntents of the fields is made entirely configurable, that
basically creates an unlimited number of permutations, and *that* may
make things really hard on tool developers. And without a header, it
makes the tool developers work completely impossible - but harder even
with.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Jeff Davis
On Thu, 2011-06-30 at 09:58 -0700, David E. Wheeler wrote:
> On Jun 30, 2011, at 9:29 AM, Jeff Davis wrote:
> 
> > Right. In that respect, it's more like a record type: many possible
> > record types exist, but you only define the ones you want.
> 
> Well, okay. How is this same problem handled for RECORD types, then?

What problem, exactly? For a given list of subtypes, there is only one
valid record type.

Also, record is not a great example. The implementation uses at least
one pretty horrible hack.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Online base backup from the hot-standby

2011-06-30 Thread Jun Ishiduka

> On this commitfest, the goal of the patch is to be able to be 
> recovered using "Minimum recovery ending location" in the control file.

Done.

Regards.


Jun Ishizuka
NTT Software Corporation
TEL:045-317-7018
E-Mail: ishizuka@po.ntts.co.jp



standby_online_backup_02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Jaime Casanova
Fujii Masao  writes:

> On Fri, Jul 1, 2011 at 2:25 AM, Robert Haas  wrote:
>>> Some actions aren't even transactional, such as DROP DATABASE, amongst
>>
>> Good point.  We'd probably need to add a timestamp to the drop
>> database record, as that's a case that people would likely want to
>> defend against with this feature.
>
> This means that recovery_target_* code would also need to deal with
> DROP DATABASE case.
>

there is no problem if you use "restore point" names... but of course
you lose flexibility (ie: you can't restore to 5 minutes before now)

mmm... a lazy idea: can't we just create a restore point wal record
*before* we actually drop the database? then we won't need to modify
logic about recovery_target_* (if it is only DROP DATABASE maybe that's
enough about complicating code) and we can provide that protection since
9.1

-- 
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL 
Soporte 24x7, desarrollo, capacitación y servicios

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add support for logging current role (what to review?)

2011-06-30 Thread Alvaro Herrera
Excerpts from Stephen Frost's message of jue jun 30 18:35:40 -0400 2011:

> > I know of no such list, and I think this field
> > useful/important enough that people who are using csv logging would
> > want it anyway. +1 on just tacking on the field and causing a flag day
> > for csv users.
> 
> Honestly, I think it was *me* who raised the issue that we don't have a
> header for CSV logs and that it sucks for people using CSV files.  We've
> changed it in the past (application_name was added, iirc) and there
> wasn't much noise of it that I recall.  If everyone's happy with that,
> it's fine by me.

I don't understand why it is so much a deal that 9.1 has a different CSV
table definition than 9.0 anyway (or any two release combination).  As
long as both are clearly and correctly documented in the respective
pages, it shouldn't be an issue at all.  If anyone attempts to load CSV
log files into the wrong table definition, the problem should make
itself evident pretty quickly.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Fujii Masao
On Fri, Jul 1, 2011 at 3:25 AM, Robert Haas  wrote:
> On Thu, Jun 30, 2011 at 1:51 PM, Kevin Grittner
>  wrote:
>> I think doing anything in PostgreSQL around this beyond allowing
>> DBAs to trust their server clocks is insane.  The arguments for
>> using and trusting ntpd is pretty much identical to the arguments
>> for using and trusting the OS file systems.
>
> Except that implementing our own file system would likely have more
> benefit and be less work than implementing our own time
> synchronization, at least if we want it to be reliable.
>
> Again, I am not trying to pretend that this is any great shakes.
> MySQL's version of this feature apparently does somehow compensate for
> time skew, which I assume must mean that their replication works
> differently than ours - inter alia, it probably requires a TCP socket
> connection between the servers.  Since we don't require that, it
> limits our options in this area, but also gives us more options in
> other areas.  Still, if I could think of a way to do this that didn't
> depend on time synchronization, then I'd be in favor of eliminating
> that requirement.  I just can't; and I'm inclined to think it isn't
> possible.
>
> I wouldn't be opposed to having an option to try to detect time skew
> between the master and the slave and, say, display that information in
> pg_stat_replication.  It might be useful to have that data for
> monitoring purposes, and it probably wouldn't even be that much code.
> However, I'd be a bit hesitant to use that data to "correct" the
> amount of time we spend waiting for time-delayed replication, because
> it would doubtless be extremely imprecise compared to real time
> synchronization, and considerably more error-prone.  IOW, what you
> said.

I agree with Robert. It's difficult to implement time-synchronization feature
which can deal with all the cases, and I'm not sure if that's really
worth taking
our time.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Fujii Masao
On Fri, Jul 1, 2011 at 2:25 AM, Robert Haas  wrote:
>> Some actions aren't even transactional, such as DROP DATABASE, amongst
>
> Good point.  We'd probably need to add a timestamp to the drop
> database record, as that's a case that people would likely want to
> defend against with this feature.

This means that recovery_target_* code would also need to deal with
DROP DATABASE case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reducing the overhead of frequent table locks, v4

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 6:28 PM, Thom Brown  wrote:
> On 27 June 2011 15:13, Robert Haas  wrote:
>> I didn't get a lot of comments on my the previous version of my patch
>> to accelerate table locks.
>>
>> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00953.php
>>
>> Here's a new version anyway.  In this version, I have:
>>
>> 1. Made pg_locks show fast-path locks as well as regular locks, with a
>> new Boolean column to distinguish.
>> 2. Renamed fpLWLock to backendLock, on the belief that we're going to
>> want something like this for quite a bit more than just this one
>> patch.
>> 3. Removed some debugging code.
>
> Are you able to benchmark this again, and possibly get Stefan to give
> it a go too?

I probably can, but there's not going to be any significant
difference.  The patch hasn't changed much.  Note that the version
where Stefan saw a big performance regression when he pounded the
server was actually a combination of this patch and the fast vxid
locks patch.  If we only apply this one, I believe there will still be
enough lock manager contention to stifle the worst of those effects.

Of course, even if we apply both, there's only going to be a
regression on servers with >32 cores, I think, and even then only when
the number of clients is greater than the number of cores.  Everyone
else will see a benefit.  I think that the next thing for me to tackle
is the SInvalReadLock spinlock contention, but what I'm really hurting
for is some code review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch file questions?

2011-06-30 Thread Andrew Dunstan



On 06/30/2011 07:13 PM, Casey Havenor wrote:


When I run the patch I'm getting some Hunk fails?  Is this because I'm not
using the same version that the author intended the patch for?




Yes. Welcome to the world of patch bitrot.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch file questions?

2011-06-30 Thread Casey Havenor
Ok - loaded up Linux - fun stuff. 
Figured out how to get the code PostgreSQL version 9.0.4 - Have that in a
directory. 
Got the patch from above... 
Placed into same directory. 

Loaded the dependents for the final compile and install with .configure. 

BUT 

When I run the patch I'm getting some Hunk fails?  Is this because I'm not
using the same version that the author intended the patch for? 

>From the output it looks to be going through the files properly but just no
able to inset the new code? 

I've tried ignoring white space and even tried a force - of course I backed
everything up prior to this. :)

-
Warmest regards, 

Casey Havenor
--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Patch-file-questions-tp4536031p4540442.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] add support for logging current role (what to review?)

2011-06-30 Thread Stephen Frost
* Alex Hunsaker (bada...@gmail.com) wrote:
> I think if Stephen was proposing 10 fields, or if there was a list of
> fields we were planning on adding in the next release or 3, it might
> be worth re-factoring. 

I know of at least one person (in an earlier piece of the thread
discussing this patch) who was talking about other fields they'd like
included in the CSV log which aren't currently.  I don't recall what
that was though, but I think it might have been something like line #
from inside stored procedures..

> I know of no such list, and I think this field
> useful/important enough that people who are using csv logging would
> want it anyway. +1 on just tacking on the field and causing a flag day
> for csv users.

Honestly, I think it was *me* who raised the issue that we don't have a
header for CSV logs and that it sucks for people using CSV files.  We've
changed it in the past (application_name was added, iirc) and there
wasn't much noise of it that I recall.  If everyone's happy with that,
it's fine by me.

I do want to rework the logging infrastructure (as discussed in the dev
meeting), but I see that whole thing as rather orthogonal to this
change.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] reducing the overhead of frequent table locks, v4

2011-06-30 Thread Thom Brown
On 27 June 2011 15:13, Robert Haas  wrote:
> I didn't get a lot of comments on my the previous version of my patch
> to accelerate table locks.
>
> http://archives.postgresql.org/pgsql-hackers/2011-06/msg00953.php
>
> Here's a new version anyway.  In this version, I have:
>
> 1. Made pg_locks show fast-path locks as well as regular locks, with a
> new Boolean column to distinguish.
> 2. Renamed fpLWLock to backendLock, on the belief that we're going to
> want something like this for quite a bit more than just this one
> patch.
> 3. Removed some debugging code.

Are you able to benchmark this again, and possibly get Stefan to give
it a go too?

Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make the visibility map crash-safe.

2011-06-30 Thread Jeff Davis
On Thu, 2011-06-30 at 07:50 -0400, Robert Haas wrote:
> I compare the performance of commit
> 431ab0e82819b31fcd1e33ecb52c2cd3b4b41da7 (post-patch) with commit
> 431ab0e82819b31fcd1e33ecb52c2cd3b4b41da7 (pre-patch).

I believe that is a copy/paste error.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Merlin Moncure
On Thu, Jun 30, 2011 at 11:44 AM, Robert Haas  wrote:
> On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure  wrote:
>>> I think the basic problem is that the cache pages are so large.  It's
>>> hard to make them smaller because that increases the cost of accessing
>>> the cache, as you already noted, but at the same time, making an
>>> eviction decision on a cache that holds 64K entries based on 100 data
>>> points seems like waving around a pretty large-caliber weapon in a
>>> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
>>> it's hard to avoid the niggling fear that someone might accidentally
>>> get shot.
>>
>> Right...it's essentially a rolling statistical sampling of transaction
>> IDs based on past acceses.  Going from say, 100 to 1000 will probably
>> drop your errror margin quite a bit but I really wonder how benefit
>> you can get from going past that.
>
> An interesting idea might be to forcibly cause a replacement every 100
> tuples (or every 1000 tuples) and see if that causes a noticeable
> performance regression.  If it doesn't, that's a good data point.

To test this I disabled the cache check on top of
HeapTupleSatisfiesMVCC and forced a cache entry in place of it with a
bogus xid (so every tuple scanned was treated as a 'miss'.  Scanning 1
million records in memory over and over went up a few percentage
points -- converging on about 280ms instead of 270ms.   This is with
bumped to 1000 miss array:

oprofile said:
regular hinted tuple case:
120  10.2041  advance_aggregates
107   9.0986  heapgettup_pagemode
776.5476  ExecProject
746.2925  heapgetpage
726.1224  ExecProcNode
726.1224  ExecScan
695.8673  advance_transition_function
665.6122  heap_getnext
584.9320  HeapTupleSatisfiesMVCC

hinted tuple with pathological cache entry:
111   8.9588  advance_aggregates
109   8.7974  heapgettup_pagemode
856.8604  ExecProject
816.5375  heapgetpage
776.2147  ExecScan
705.6497  ExecStoreTuple
705.6497  HeapTupleSatisfiesMVCC

this was a fairly short profiling run but the numbers are fairly
consistent.  the replacement is fairly cheap...sorting 1000 ints
doesn't take all that long. 100 is even faster.

> I think the sour spot for this whole idea is likely to be the case
> where you have a workload that is 90% read and 10% write, or something
> like that.  On an all-read workload (like pgbench -S), you're
> hopefully going to converge to a state where all the hint-bits are
> set, once VACUUM kicks in.  And on an all-write workload I think that
> you might lose the effect in the noise.  Not sure if we have an easy
> way to set up such a workload with pgbench, though.

it's trivial to test with pgbench -- just use the random number
facility to generate an id for some table and update where random() >
.9.   However this does not generate nearly enough 'misses' to
exercise cache replacement in any meaningful way.  My workstation vm
can punch out about 5k transactions/sec.  With only 10% getting a new
xid and writing back a tuple to the table only 500 xids are getting
generated/second.  At that rate it takes quite a while to burn through
the 256k transactions the cache ranges over.  Also this case is not
bound by scan performance anyways making profiling it a little silly
-- HeapTupleSatisfiesMVCC is simply not as big of a bottleneck in OLTP
workloads.  Scan heavy loads is where this matters, and scan heavy
loads naturally tend to generate less xids per tuple scanned.

>>> Just for the sake of argument, let's suppose we made an array with 64K
>>> elements, each one representing one 64K chunk of the XID space.  Each
>>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>>> memory... probably not good, but I'm just thinking out loud here.
>>> Every time we're asked about an XID, we bump the corresponding
>>> counter.  Periodically (say, every 64K probes) we zip through all the
>>> counters and consider performing a cache replacement.  We look at the
>>> not-already-cached page that has the highest counter value and compare
>>> it to the counter value for the cached pages.  If it is higher and the
>>> difference exceeds some safety margin (to protect against hysteresis),
>>> then we do the replacement.  I think something like this would allow
>>> you to have a lot more information available to make a direct
>>> apples-to-apples comparison at cache replacement time, as compared
>>> with what you have now.
>>
>> yeah -- what you've done here is basically two things: 1. by mapping
>> static ranges you get to skip the sort (but not the scan) during the
>> replacement, and 2. by vastly expanding the sampling size you
>> eliminate thrashing via noise in the rolling sample.  This comes at a
>> significant memory cost and loss of some nimbleness in the cache (i'm
>> not completely sure more aggressive replacement is 'bad')  -- although
>> that could be mitigated by replacin

Re: [HACKERS] [COMMITTERS] pgsql: Enable CHECK constraints to be declared NOT VALID

2011-06-30 Thread Devrim GÜNDÜZ
On Thu, 2011-06-30 at 15:09 -0400, Alvaro Herrera wrote:



> I don't think we have a policy for this, but I have done it for some
> time now and nobody has complained, so I sort of assumed it was okay.
> Besides, some of the people pouring the money in does care about it;
> moreover, it provides a little incentive for other companies that
> might also be in a position to fund development but lack the "peer
> approval" of the idea, or a final little push.
> 
> So what's the general opinion here? 

+1 for adding sponsor name to the commit message. It will encourage
companies more. 
-- 
Devrim GÜNDÜZ
Principal Systems Engineer @ EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Community: devrim~PostgreSQL.org, devrim.gunduz~linux.org.tr
http://www.gunduz.org  Twitter: http://twitter.com/devrimgunduz


signature.asc
Description: This is a digitally signed message part


Re: [HACKERS] [COMMITTERS] pgsql: Enable CHECK constraints to be declared NOT VALID

2011-06-30 Thread David E. Wheeler
On Jun 30, 2011, at 12:09 PM, Alvaro Herrera wrote:

> Robert Hass (whose name I misspelled in the commit message above) just
> mentioned to me (in an answer to my apologizing about it) that he didn't
> think that mentioning sponsors for patch development was a good idea.
> 
> I don't think we have a policy for this, but I have done it for some
> time now and nobody has complained, so I sort of assumed it was okay.
> Besides, some of the people pouring the money in does care about it;
> moreover, it provides a little incentive for other companies that might
> also be in a position to fund development but lack the "peer approval"
> of the idea, or a final little push.
> 
> So what's the general opinion here?

I certainly see no harm in it, and contributors at all levels -- including 
sponsors of new features or fixes -- ought to be acknowledged and thanked.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Enable CHECK constraints to be declared NOT VALID

2011-06-30 Thread Alvaro Herrera
Excerpts from Alvaro Herrera's message of jue jun 30 11:58:09 -0400 2011:
> Enable CHECK constraints to be declared NOT VALID
> 
> [...]
> 
> This patch was sponsored by Enova Financial.

Robert Hass (whose name I misspelled in the commit message above) just
mentioned to me (in an answer to my apologizing about it) that he didn't
think that mentioning sponsors for patch development was a good idea.

I don't think we have a policy for this, but I have done it for some
time now and nobody has complained, so I sort of assumed it was okay.
Besides, some of the people pouring the money in does care about it;
moreover, it provides a little incentive for other companies that might
also be in a position to fund development but lack the "peer approval"
of the idea, or a final little push.

So what's the general opinion here?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 1:51 PM, Kevin Grittner
 wrote:
> I think doing anything in PostgreSQL around this beyond allowing
> DBAs to trust their server clocks is insane.  The arguments for
> using and trusting ntpd is pretty much identical to the arguments
> for using and trusting the OS file systems.

Except that implementing our own file system would likely have more
benefit and be less work than implementing our own time
synchronization, at least if we want it to be reliable.

Again, I am not trying to pretend that this is any great shakes.
MySQL's version of this feature apparently does somehow compensate for
time skew, which I assume must mean that their replication works
differently than ours - inter alia, it probably requires a TCP socket
connection between the servers.  Since we don't require that, it
limits our options in this area, but also gives us more options in
other areas.  Still, if I could think of a way to do this that didn't
depend on time synchronization, then I'd be in favor of eliminating
that requirement.  I just can't; and I'm inclined to think it isn't
possible.

I wouldn't be opposed to having an option to try to detect time skew
between the master and the slave and, say, display that information in
pg_stat_replication.  It might be useful to have that data for
monitoring purposes, and it probably wouldn't even be that much code.
However, I'd be a bit hesitant to use that data to "correct" the
amount of time we spend waiting for time-delayed replication, because
it would doubtless be extremely imprecise compared to real time
synchronization, and considerably more error-prone.  IOW, what you
said.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Josh Berkus
Kevin,

> I think doing anything in PostgreSQL around this beyond allowing
> DBAs to trust their server clocks is insane.  The arguments for
> using and trusting ntpd is pretty much identical to the arguments
> for using and trusting the OS file systems.

Oh, you don't want to implement our own NTP?  Coward!

;-)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Kevin Grittner
Josh Berkus  wrote:
 
> when we last had the argument about time synchronization,
> Kevin Grittner (I believe) pointed out that unsynchronized
> replication servers have an assortment of other issues ... like
> any read query involving now().
 
I don't remember making that point, although I think it's a valid
one.
 
What I'm sure I pointed out is that we have one central router which
synchronizes to a whole bunch of atomic clocks around the world
using the normal "discard the outliers and average the rest"
algorithm, and then *every singe server and workstation on our
network synchronizes to that router*.  Our database servers are all
running on Linux using ntpd.  Our monitoring spams us with email if
any of the clocks falls outside nominal bounds.  (It's been many
years since we had a misconfigured server which triggered that.)
 
I think doing anything in PostgreSQL around this beyond allowing
DBAs to trust their server clocks is insane.  The arguments for
using and trusting ntpd is pretty much identical to the arguments
for using and trusting the OS file systems.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Josh Berkus
On 6/30/11 10:25 AM, Robert Haas wrote:
> So I think keeping it defined it terms of time is the
> right way forward, even though the need for external time
> synchronization is, certainly, not ideal.

Actually, when we last had the argument about time synchronization,
Kevin Grittner (I believe) pointed out that unsynchronized replication
servers have an assortment of other issues ... like any read query
involving now().  As the person who originally brought up this hurdle, I
felt that his argument defeated mine.

Certainly I can't see any logical way to have time delay in the absence
of clock synchronization of some kind.  Also, I kinda feel like this
discussion seems aimed at overcomplicating a feature which only a small
fraction of our users will ever use.Let's keep it as simple as possible.

As for delay on streaming replication, I'm for it.  I think that
post-9.1, thanks to pgbasebackup, the number of our users who are doing
archive log shipping is going to drop tremendously.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 6:45 AM, Simon Riggs  wrote:
> The only way to control this is with a time delay that can be changed
> while the server is running. A recovery.conf parameter doesn't allow
> that, so another way is preferable.

True.  We've talked about making the recovery.conf parameters into
GUCs, which would address that concern (and some others).

> I think the time problems are more complex than said. The patch relies
> upon transaction completion times, but not all WAL records have a time
> attached to them. Plus you only used commits anyway, not sure why.

For the same reason we do that with the recovery_target_* code -
replaying something like a heap insert or heap update doesn't change
the user-visible state of the database, because the records aren't
visible anyway until the commit record is replayed.

> Some actions aren't even transactional, such as DROP DATABASE, amongst

Good point.  We'd probably need to add a timestamp to the drop
database record, as that's a case that people would likely want to
defend against with this feature.

> others. Consecutive records can be hours apart, so it would be
> possible to delay on some WAL records but then replay records that
> happened minutes ago, then wait hours for the next apply. So this
> patch doesn't do what it claims in all cases.
>
> Similar discussion on max_standby_delay covered exactly that ground
> and went on for weeks in 9.0. IIRC I presented the same case you just
> did and we agreed in the end that was not acceptable. I'm not going to
> repeat it. Please check the archives.

I think this case is a bit different.  First, max_standby_delay is
relevant for any installation using Hot Standby, whereas this is a
feature that specifically involves time.  Saying that you have to have
time synchronization for Hot Standby to work as designed is more of a
burden than saying you need time synchronization *if you want to use
the time-delayed recovery feature*.  Second, and maybe more
importantly, no one has come up with an idea for how to make this work
reliably in the presence of time skew.  Perhaps we could provide a
simple time-skew correction feature that would work in the streaming
case (though probably not nearly as well as running ntpd), but as I
understand your argument, you're saying that most people will want to
use this with archiving.  I don't see how to make that work without
time synchronization.  In the max_standby_delay case, the requirement
is that queries not get cancelled too aggressively while at the same
time letting the standby get too far behind the master, which leaves
some flexibility in terms of how we actually make that trade-off, and
we eventually found a way that didn't require time synchronization,
which was an improvement.  But for a time-delayed standby, the
requirement at least AIUI is that the state of the standby lag the
master by a certain time interval, and I don't see any way to do that
without comparing slave timestamps with master timestamps.  If we can
find a similar clever trick here, great!  But I'm not seeing how to do
it.

Now, another option here is to give up on the idea of a time-delayed
standby altogether and instead allow the standby to lag the master by
a certain number of WAL segments or XIDs.  Of course, if we do that,
then we will not have a feature called "time-delayed standbys".
Instead, we will have a feature called "standbys delayed by a certain
number of WAL segments (or XIDs)".  That certainly caters to some of
the same use cases, but I think it severely lacking in the usability
department.  I bet the first thing most people will do is to try to
figure out how to translate between those metrics and time, and I bet
we'll get complaints on systems where the activity load is variable
and therefore the time lag for a fixed WAL-segment lag or XID-lag is
unpredictable.  So I think keeping it defined it terms of time is the
right way forward, even though the need for external time
synchronization is, certainly, not ideal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 1:00 PM, Josh Berkus  wrote:
> On 6/30/11 2:00 AM, Simon Riggs wrote:
 Manual (or scripted) intervention is always necessary if you reach disk
 >> 100% full.
>>> >
>>> > Wow, that's a pretty crappy failure mode... but I don't think we need
>>> > to fix it just on account of this patch.  It would be nice to fix, of
>>> > course.
>> How is that different to running out of space in the main database?
>>
>> If I try to pour a pint of milk into a small cup, I don't blame the cup.
>
> I have to agree with Simon here.  ;-)
>
> We can do some things to make this easier for administrators, but
> there's no way to "solve" the problem.  And the things we could do would
> have to be advanced optional modes which aren't on by default, so they
> wouldn't really help the DBA with poor planning skills.  Here's my
> suggestions:
>
> 1) Have a utility (pg_archivecleanup?) which checks if we have more than
> a specific settings's worth of archive_logs, and breaks replication and
> deletes the archive logs if we hit that number.  This would also require
> some way for the standby to stop replicating *without* becoming a
> standalone server, which I don't think we currently have.
>
> 2) Have a setting where, regardless of standby_delay settings, the
> standby will interrupt any running queries and start applying logs as
> fast as possible if it hits a certain number of unapplied archive logs.
>  Of course, given the issues we had with standby_delay, I'm not sure I
> want to complicate it further.
>
> I think we've already fixed the biggest issue in 9.1, since we now have
> a limit on the number of WALs the master will keep if archiving is
> failing ... yes?  That's the only big *avoidable* failure mode we have,
> where a failing standby effectively shuts down the master.

I'm not sure we changed anything in this area for 9.1.  Am I wrong?
wal_keep_segments was present in 9.0.  Using that instead of archiving
is a reasonable way to bound the amount of disk space that can get
used, at the cost of possibly needing to rebuild the standby if things
get too far behind.  Of course, in any version, you could also use an
archive_command that will remove old files to make space if the disk
is full, with the same downside: if the standby isn't done with those
files, you're now in for a rebuild.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid index rebuilds for no-rewrite ALTER TABLE ALTER TYPE

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 11:55 AM, Noah Misch  wrote:
> On Wed, Jun 29, 2011 at 09:42:06AM -0400, Robert Haas wrote:
>> On Tue, Jun 28, 2011 at 3:40 PM, Noah Misch  wrote:
>
>> > Here's the call stack in question:
>> >
>> > ? ? ? ?RelationBuildLocalRelation
>> > ? ? ? ?heap_create
>> > ? ? ? ?index_create
>> > ? ? ? ?DefineIndex
>> > ? ? ? ?ATExecAddIndex
>> >
>> > Looking at it again, it wouldn't bring the end of the world to add a 
>> > relfilenode
>> > argument to each. None of those have more than four callers.
>>
>> Yeah.  Those functions take an awful lot of arguments, which suggests
>> that some refactoring might be in order, but I still think it's
>> cleaner to add another argument than to change the state around
>> after-the-fact.
>>
>> > ATExecAddIndex()
>> > would then call RelationPreserveStorage() before calling DefineIndex(), 
>> > which
>> > would in turn put things in a correct state from the start. ?Does that seem
>> > appropriate? ?Offhand, I do like it better than what I had.
>>
>> I wish we could avoid the whole death-and-resurrection thing
>> altogether, but off-hand I'm not seeing a real clean way to do that.
>> At the very least we had better comment it to death.
>
> I couldn't think of a massive amount to say about that, but see what you think
> of this level of commentary.
>
> Looking at this again turned up a live bug in the previous version: if the old
> index file were created in the current transaction, we would wrongly remove 
> its
> delete-at-abort entry as well as its delete-at-commit entry.  This leaked the
> disk file.  Fixed by adding an argument to RelationPreserveStorage() 
> indicating
> which kind to remove.  Test case:
>
>        BEGIN;
>        CREATE TABLE t AS SELECT * FROM generate_series(1,10) t(n);
>        CREATE INDEX ti ON t(n);
>        SELECT pg_relation_filepath('ti');
>        ALTER TABLE t ALTER n TYPE int;
>        ROLLBACK;
>        CHECKPOINT;
>        -- file named above should be gone
>
> I also updated the ATPostAlterTypeCleanup() variable names per discussion and
> moved IndexStmt.oldNode to a more-natural position in the structure.

On first blush, that looks a whole lot cleaner.  I'll try to find some
time for a more detailed review soon.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Josh Berkus
On 6/30/11 2:00 AM, Simon Riggs wrote:
>>> Manual (or scripted) intervention is always necessary if you reach disk
>>> >> 100% full.
>> >
>> > Wow, that's a pretty crappy failure mode... but I don't think we need
>> > to fix it just on account of this patch.  It would be nice to fix, of
>> > course.
> How is that different to running out of space in the main database?
> 
> If I try to pour a pint of milk into a small cup, I don't blame the cup.

I have to agree with Simon here.  ;-)

We can do some things to make this easier for administrators, but
there's no way to "solve" the problem.  And the things we could do would
have to be advanced optional modes which aren't on by default, so they
wouldn't really help the DBA with poor planning skills.  Here's my
suggestions:

1) Have a utility (pg_archivecleanup?) which checks if we have more than
a specific settings's worth of archive_logs, and breaks replication and
deletes the archive logs if we hit that number.  This would also require
some way for the standby to stop replicating *without* becoming a
standalone server, which I don't think we currently have.

2) Have a setting where, regardless of standby_delay settings, the
standby will interrupt any running queries and start applying logs as
fast as possible if it hits a certain number of unapplied archive logs.
 Of course, given the issues we had with standby_delay, I'm not sure I
want to complicate it further.

I think we've already fixed the biggest issue in 9.1, since we now have
a limit on the number of WALs the master will keep if archiving is
failing ... yes?  That's the only big *avoidable* failure mode we have,
where a failing standby effectively shuts down the master.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread David E. Wheeler
On Jun 30, 2011, at 9:34 AM, Jeff Davis wrote:

> Then how do you get a text range that doesn't correspond to the
> LC_COLLATE setting?

You cast it.

> Does that mean you couldn't dump/reload from a
> system with one collation and get the same values in a system with a
> different collation? That would be very strange.

No, pg_dump should always explicitly cast things. But there should be a 
reasonable default behavior if I'm in psql and don't cast.

> Or, what about other types that just happen to have multiple useful
> total orders?

Cast where you need it explicit, and have a reasonable default when it's not 
cast.

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread David E. Wheeler
On Jun 30, 2011, at 9:29 AM, Jeff Davis wrote:

> Right. In that respect, it's more like a record type: many possible
> record types exist, but you only define the ones you want.

Well, okay. How is this same problem handled for RECORD types, then?

>> By default, no range types would exists I believe.
> 
> I was planning to include _some_ by default. Probably not text ranges,
> but integer and timestamp[tz] ranges. If nothing else, it makes it
> easier to document.

+1

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 11:18 AM, Merlin Moncure  wrote:
>> I think the basic problem is that the cache pages are so large.  It's
>> hard to make them smaller because that increases the cost of accessing
>> the cache, as you already noted, but at the same time, making an
>> eviction decision on a cache that holds 64K entries based on 100 data
>> points seems like waving around a pretty large-caliber weapon in a
>> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
>> it's hard to avoid the niggling fear that someone might accidentally
>> get shot.
>
> Right...it's essentially a rolling statistical sampling of transaction
> IDs based on past acceses.  Going from say, 100 to 1000 will probably
> drop your errror margin quite a bit but I really wonder how benefit
> you can get from going past that.

An interesting idea might be to forcibly cause a replacement every 100
tuples (or every 1000 tuples) and see if that causes a noticeable
performance regression.  If it doesn't, that's a good data point.

I think the sour spot for this whole idea is likely to be the case
where you have a workload that is 90% read and 10% write, or something
like that.  On an all-read workload (like pgbench -S), you're
hopefully going to converge to a state where all the hint-bits are
set, once VACUUM kicks in.  And on an all-write workload I think that
you might lose the effect in the noise.  Not sure if we have an easy
way to set up such a workload with pgbench, though.

>> Just for the sake of argument, let's suppose we made an array with 64K
>> elements, each one representing one 64K chunk of the XID space.  Each
>> element is a 4-byte unsigned integer, so the array takes up 256kB of
>> memory... probably not good, but I'm just thinking out loud here.
>> Every time we're asked about an XID, we bump the corresponding
>> counter.  Periodically (say, every 64K probes) we zip through all the
>> counters and consider performing a cache replacement.  We look at the
>> not-already-cached page that has the highest counter value and compare
>> it to the counter value for the cached pages.  If it is higher and the
>> difference exceeds some safety margin (to protect against hysteresis),
>> then we do the replacement.  I think something like this would allow
>> you to have a lot more information available to make a direct
>> apples-to-apples comparison at cache replacement time, as compared
>> with what you have now.
>
> yeah -- what you've done here is basically two things: 1. by mapping
> static ranges you get to skip the sort (but not the scan) during the
> replacement, and 2. by vastly expanding the sampling size you
> eliminate thrashing via noise in the rolling sample.  This comes at a
> significant memory cost and loss of some nimbleness in the cache (i'm
> not completely sure more aggressive replacement is 'bad')  -- although
> that could be mitigated by replacing at more frequent intervals.

Well, that gets at an interesting question: how often SHOULD we
replace cache pages?  My gut is that it's 10K-100K and your gut is
that it's 100-1000, which is a hundred-fold difference.  Seems like we
need some kind of emprical data to decide what actually works well in
real life.

> Your range counting strategy might work -- I think to do it right so
> that you don't have to map the entire range is going to require two
> levels of ranges such that you activate a very wide range first before
> looking at 64k subranges. That way you could reduce memory consumption
> significantly without sorting during the replacement.

I don't think you need two levels of ranges.  Just keep track of the
minimum and maximum XID covered by the array (these must always be 64M
transactions apart, say, if the array has 1024 entries, and each cache
page covers 64K XIDs).  When you see a given XID, and it's between the
minimum and the maximum, then bump the counter in bucket XID/64K.  If
you see an XID that is older than the minimum, then ignore it; we
won't consider devoting cache pages to really old XIDs.  If you see an
XID that is newer than the maximum, then just increase the minimum and
maximum by enough to put the new XID in range.  For every 64K
increment by which you advance the minimum and maximum, you need to
zero one entry in the array (since it's no longer covering the same
portion of the XID space) but all the other entries remain intact.

(There is a small problem here of how to work out how to initially set
the minimum and maximum to something sensible, and to make sure that
they don't get out of whack if a backend sits idle for a few billion
transactions, but that seems like it should be solvable.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Jeff Davis
On Wed, 2011-06-29 at 12:34 -0400, Robert Haas wrote:
> But now that I'm thinking about this a little more, I'm worried about this 
> case:
> 
> CREATE TABLE foo AS RANGE('something'::funkytype, 'somethingelse'::funktype);
> DROP TYPE funkytype;
> 
> It seems to me that the first statement had better fail, or else the
> second one is going to create a hopeless mess (imagine that a new type
> comes along and gets the OID of funkytype).

Interesting point. I don't think it's a problem because pseudo-types
can't be used that way, so that provides us a mechanism to stop it. But
it means that we have to be a little more sure that such values can't
persist anywhere.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Jeff Davis
On Wed, 2011-06-29 at 10:15 -0700, David E. Wheeler wrote:
> On Jun 29, 2011, at 10:13 AM, Florian Pflug wrote:
> 
> > Because there might be more than one range type for a
> > base type. Say there are two range types over text, one
> > with collation 'de_DE' and one with collation 'en_US'.
> > What would the type of
> >  range('foo', 'f')
> > be?
> 
> The one that corresponds to the current LC_COLLATE setting.

Then how do you get a text range that doesn't correspond to the
LC_COLLATE setting? Does that mean you couldn't dump/reload from a
system with one collation and get the same values in a system with a
different collation? That would be very strange.

Or, what about other types that just happen to have multiple useful
total orders?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Jeff Davis
On Thu, 2011-06-30 at 09:11 +0200, Florian Pflug wrote:
> > How would the system catalogs be initialized under that theory: surely
> > you're not going to seed (nr. of types) * (nr. of collations) * (nr. of
> > opclasses) range types in initdb?
> 
> There's CREATE RANGE.

Right. In that respect, it's more like a record type: many possible
record types exist, but you only define the ones you want.

> By default, no range types would exists I believe.

I was planning to include _some_ by default. Probably not text ranges,
but integer and timestamp[tz] ranges. If nothing else, it makes it
easier to document.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] creating CHECK constraints as NOT VALID

2011-06-30 Thread Alvaro Herrera
Excerpts from Robert Haas's message of sáb jun 18 23:53:17 -0400 2011:

> I agree.  That's pretty contorted.  How about something like this:
> 

Thanks Jaime and Robert.  I have pushed this patch with these fixes.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid index rebuilds for no-rewrite ALTER TABLE ALTER TYPE

2011-06-30 Thread Noah Misch
On Wed, Jun 29, 2011 at 09:42:06AM -0400, Robert Haas wrote:
> On Tue, Jun 28, 2011 at 3:40 PM, Noah Misch  wrote:

> > Here's the call stack in question:
> >
> > ? ? ? ?RelationBuildLocalRelation
> > ? ? ? ?heap_create
> > ? ? ? ?index_create
> > ? ? ? ?DefineIndex
> > ? ? ? ?ATExecAddIndex
> >
> > Looking at it again, it wouldn't bring the end of the world to add a 
> > relfilenode
> > argument to each. None of those have more than four callers.
> 
> Yeah.  Those functions take an awful lot of arguments, which suggests
> that some refactoring might be in order, but I still think it's
> cleaner to add another argument than to change the state around
> after-the-fact.
> 
> > ATExecAddIndex()
> > would then call RelationPreserveStorage() before calling DefineIndex(), 
> > which
> > would in turn put things in a correct state from the start. ?Does that seem
> > appropriate? ?Offhand, I do like it better than what I had.
> 
> I wish we could avoid the whole death-and-resurrection thing
> altogether, but off-hand I'm not seeing a real clean way to do that.
> At the very least we had better comment it to death.

I couldn't think of a massive amount to say about that, but see what you think
of this level of commentary.

Looking at this again turned up a live bug in the previous version: if the old
index file were created in the current transaction, we would wrongly remove its
delete-at-abort entry as well as its delete-at-commit entry.  This leaked the
disk file.  Fixed by adding an argument to RelationPreserveStorage() indicating
which kind to remove.  Test case:

BEGIN;
CREATE TABLE t AS SELECT * FROM generate_series(1,10) t(n);
CREATE INDEX ti ON t(n);
SELECT pg_relation_filepath('ti');
ALTER TABLE t ALTER n TYPE int;
ROLLBACK;
CHECKPOINT;
-- file named above should be gone

I also updated the ATPostAlterTypeCleanup() variable names per discussion and
moved IndexStmt.oldNode to a more-natural position in the structure.

Thanks,
nm
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index a90b4e2..b6fe908 100644
*** a/doc/src/sgml/xindex.sgml
--- b/doc/src/sgml/xindex.sgml
***
*** 834,846  ALTER OPERATOR FAMILY integer_ops USING btree ADD

 In a B-tree operator family, all the operators in the family must sort
 compatibly, meaning that the transitive laws hold across all the data types
!supported by the family: if A = B and B = C, then A =
!C, and if A < B and B < C, then A < C.  For each
!operator in the family there must be a support function having the same
!two input data types as the operator.  It is recommended that a family be
!complete, i.e., for each combination of data types, all operators are
!included.  Each operator class should include just the non-cross-type
!operators and support function for its data type.

  

--- 834,848 

 In a B-tree operator family, all the operators in the family must sort
 compatibly, meaning that the transitive laws hold across all the data types
!supported by the family: if A = B and B = C, then A = C,
!and if A < B and B < C, then A < C.  Subjecting operands
!to any number of implicit casts or binary coercion casts involving only
!types represented in the family must not change the result other than to
!require a similar cast thereon.  For each operator in the family there must
!be a support function having the same two input data types as the operator.
!It is recommended that a family be complete, i.e., for each combination of
!data types, all operators are included.  Each operator class should include
!just the non-cross-type operators and support function for its data type.

  

***
*** 851,861  ALTER OPERATOR FAMILY integer_ops USING btree ADD
 by the family's equality operators, even when the values are of different
 types.  This is usually difficult to accomplish when the types have
 different physical representations, but it can be done in some cases.
!Notice that there is only one support function per data type, not one
!per equality operator.  It is recommended that a family be complete, i.e.,
!provide an equality operator for each combination of data types.
!Each operator class should include just the non-cross-type equality
!operator and the support function for its data type.

  

--- 853,865 
 by the family's equality operators, even when the values are of different
 types.  This is usually difficult to accomplish when the types have
 different physical representations, but it can be done in some cases.
!Implicit casts and binary coercion casts among types represented in the
!family must preserve this invariant.  Notice that there is only one support
!function per data type, not one per equality operator.  It is recommended
!that a family be complete, i.e., provide a

Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Merlin Moncure
On Wed, Jun 29, 2011 at 3:18 PM, Robert Haas  wrote:
>> With the algorithm as coded, to fully flush the cache you'd have to
>> find a series of *unhinted* tuples distributed across minimum of four
>> 64k wide page ranges, with the number of tuples being over the 5%
>> noise floor.  Furthermore, to eject a cache page you have to have more
>> hits than a page already in the cache got -- hits are tracked on the
>> existing pages and the candidate pages are effectively with the
>> existing pages based on hits with the best four picked.  Also, is it
>> reasonable to expect the cache to help in these types of cases
>> anyways?
>
> *scratches head*
>
> Well, surely you must need to reset the counters for the pages you've
> chosen to include in the cache at some point.  If you don't, then
> there's a strong likelihood that after some period of time the pages
> you have in there will become so firmly nailed into the cache that
> they can never be evicted.

yup -- hit counts are reset on replacement.

> Yeah, I am inclined to think that going very far in that direction is
> going to be a big pile of fail.  TBH, I'm somewhat surprised that you
> managed to get what you already have to work without a performance
> regression.  That code path is extremely hot, and injecting more
> complexity seems like it ought to be a loser.  For purposes of this
> review, I'm taking it as given that you've tested this carefully, but
> I'll admit to lingering skepticism.

You are absolutely right to be skeptical.  Here's the rub -- it's
incredibly difficult to contrive a set of circumstances where the
cache is aggressively replaced, and even if you can somehow coerce
that to happen, the underlying data is permanently altered so that it
can't happen again.  Cheating by mucking around with the xid or
altering the visibility routines to force replacement isn't really
fair.  AFAICT, unhinted tuples tend to be both recent and grouped into
a fairly narrow transaction range basically always.  Sooner or later,
at least for tables that matter, a vacuum is going to roll around and
smear the bits back into the table.  My point is that even for the
worst case I could think of -- pgbench continually jamming records
into the table, replacements tend to be quite rare.  Let's assume
though I'm wrong and the pathological case is likely to happen.

> I think the basic problem is that the cache pages are so large.  It's
> hard to make them smaller because that increases the cost of accessing
> the cache, as you already noted, but at the same time, making an
> eviction decision on a cache that holds 64K entries based on 100 data
> points seems like waving around a pretty large-caliber weapon in a
> fairly uncontrolled fashion.  Maybe it works OK 90% of the time, but
> it's hard to avoid the niggling fear that someone might accidentally
> get shot.

Right...it's essentially a rolling statistical sampling of transaction
IDs based on past acceses.  Going from say, 100 to 1000 will probably
drop your errror margin quite a bit but I really wonder how benefit
you can get from going past that.

> Just for the sake of argument, let's suppose we made an array with 64K
> elements, each one representing one 64K chunk of the XID space.  Each
> element is a 4-byte unsigned integer, so the array takes up 256kB of
> memory... probably not good, but I'm just thinking out loud here.
> Every time we're asked about an XID, we bump the corresponding
> counter.  Periodically (say, every 64K probes) we zip through all the
> counters and consider performing a cache replacement.  We look at the
> not-already-cached page that has the highest counter value and compare
> it to the counter value for the cached pages.  If it is higher and the
> difference exceeds some safety margin (to protect against hysteresis),
> then we do the replacement.  I think something like this would allow
> you to have a lot more information available to make a direct
> apples-to-apples comparison at cache replacement time, as compared
> with what you have now.

yeah -- what you've done here is basically two things: 1. by mapping
static ranges you get to skip the sort (but not the scan) during the
replacement, and 2. by vastly expanding the sampling size you
eliminate thrashing via noise in the rolling sample.  This comes at a
significant memory cost and loss of some nimbleness in the cache (i'm
not completely sure more aggressive replacement is 'bad')  -- although
that could be mitigated by replacing at more frequent intervals.

> Now, the big problem here is that a 256kB array is probably too big,
> but because we don't want to blow out the lower-level CPU caches and
> also because every time we consider performing a cache replacement
> we're going to want to zero the whole thing, and that has to be fast.
> But we probably don't really need the array to cover the entire XID
> space.  vacuum_freeze_min_age defaults to 50 million transactions, and
> vacuum_freeze_table_age defaults to 150 million transac

Re: [HACKERS] Small patch for GiST: move childoffnum to child

2011-06-30 Thread Alexander Korotkov
Hi Jeff,

Thank you for review.

On Thu, Jun 30, 2011 at 8:50 AM, Jeff Janes  wrote:

> So I tried provoking situations where this surrounding code section
>
would get executed, both patched and unpatched.  I have been unable to
> do so--apparently this code is for an incredibly obscure situation
> which I can't induce at will.
>
Yes, it also seems pretty hard to get this code section executed for me. I'm
going to ask Teodor and Oleg about it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Bug in SQL/MED?

2011-06-30 Thread Alvaro Herrera
Excerpts from 花田 茂's message of jue jun 30 06:00:23 -0400 2011:

> I attached a patch which fixes file_fdw to check required option
> (filename) in its validator function.  I think that such requirement
> should be checked again in PlanForeignScan(), as it had been so far.
> Note that this patch requires fdw.patch has been applied.

I think it'd be good to keep the error check in fileGetOptions() just to
ensure that we don't crash in case a catalog is messed up with, though
I'd degrade it to elog(ERROR) from ereport.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq SSL with non-blocking sockets

2011-06-30 Thread Steve Singer

On 11-06-28 02:14 PM, Martin Pihlak wrote:


Hmm, I thought I thought about that. There was a check in the original
patch: "if (conn->sslRetryBytes || (conn->outCount - remaining)>  0)"
So if the SSL retry buffer was emptied it would return 1 if there was
something left in the regular output buffer. Or did I miss something
there?



The issue I saw in the original patch was that at that point in the 
function, sslRetryBytes could be zero (if the data was sent) but  
conn->outCount - remaining would be an incorrect value since "remaining" 
is the number of bytes left to send from sslRetryBuf NOT 
conn->outBuffer.   This is no longer an issue in the updated patch.

I will try to take a closer look at your updated patch in the next few days.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] how to call the function--pqCatenateResultError()

2011-06-30 Thread Robert Haas
2011/6/27 _石头 :
> Hello!~
>
>        Now i encounter a function call problem in PostgreSQL's psql module!
>
>        The situation is as follow:
>         In ./src/bin/psql/common.c, I want to call the
> function pqCatenateResultError().
>         Function pqCatenateResultError() is declared in
> ./src/interfaces/libpq/libpq-init.h
>                                     extern void
> pqCatenateResultError(PGresult *res, const char *msg);
>         and is defined in ./src/interfaces/libpq/fe-exec.c
>                                void
>                                pqCatenateResultError(PGresult *res, const
> char *msg)
>                               {
>                           PQExpBufferData errorBuf;
>                           if (!res || !msg)
>                              return;
>                           initPQExpBuffer(&errorBuf);
>                           if (res->errMsg)
>                   appendPQExpBufferStr(&errorBuf, res->errMsg);
>                           appendPQExpBufferStr(&errorBuf, msg);
>                           pqSetResultError(res, errorBuf.data);
>                           termPQExpBuffer(&errorBuf);
>                               }
>       To call this function in ./common.c, I include 'libpq-init.h' in
> ./src/bin/psql/common.h .
>       As ./common.c include the header file 'common.h'.
>      But when I use pqCatenateResultError() in ./common.c, It appears
> "undefined reference to pqCatenateResultError()" first.
>
>      So I include 'extern void pqCatenateResultError(PGresult *res, const
> char *msg);' at the begining of './common.c' .
>      But this change make no difference to the result.
>
>       I do not know why this happened! Someone hlep me!  Thank you.
>       There is another situation similar to the situation above:
>       Function PQexec() is declared in ./src/interfaces/libpq/libpq-fe.h and
> defined in ./src/interfaces/libpq/fe-exec.c
>                              extern PGresult *PQexec(PGconn *conn, const
> char *query);
>       I can call this function with no error happening!
>       These two situation puzzled me!~

Are you developing on Windows, by any chance?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latch implementation that wakes on postmaster death on both win32 and Unix

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 5:47 AM, Peter Geoghegan  wrote:
> I don't think that the way I've phrased my error messages is
> inconsistent with that style guide, excepty perhaps the pipe()
> reference, but if you feel it's important to try and use "could not",
> I have no objections.

I  like Fujii's rephrasing - we don't usually mention the name of the
system call.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding Japanese README

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 1:49 AM, Hitoshi Harada  wrote:
> 2011/6/30 Itagaki Takahiro :
>> On Thu, Jun 30, 2011 at 09:42, Tatsuo Ishii  wrote:
>>> BTW I will talk to some Japanese speaking developers about my idea if
>>> community agree to add Japanese README to the source tree so that I am
>>> not the only one who are contributing this project
>>>
 Now, if someone wanted to set up a web site or Wiki page with
 translations, that would be up to them.
>>
>> IMHO, the Wiki approach seems to be reasonable than a README file.
>> It will be suitable for adding non-Japanese translations and
>> non-core developer can join to translate or fix the docs.
>
> +1. If we really want to prove the demand, let's start with Wiki,
> which is less invasive than README (though I doubt such pages would be
> updated finally.)

We could consider adding a note to the README files pointing people to
an appropriate wiki page for translated versions, so that someone who
just checks out the source tree has a place to go and start looking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hint bit cache v6

2011-06-30 Thread Robert Haas
On Thu, Jun 30, 2011 at 12:31 AM, Jim Nasby  wrote:
> Would it be reasonable to keep a second level cache that store individual 
> XIDs instead of blocks? That would provide protection for XIDs that are 
> extremely common but don't have a good fit with the pattern of XID ranges 
> that we're caching. I would expect this to happen if you had a transaction 
> that touched a bunch of data (ie: bulk load or update) some time ago (so the 
> other XIDs around it are less likely to be interesting) but not old enough to 
> have been frozen yet. Obviously you couldn't keep too many XIDs in this 
> secondary cache, but if you're just trying to prevent certain pathological 
> cases then hopefully you wouldn't need to keep that many.

Maybe, but I think that's probably still papering around the problem.
I'd really like to find an algorithm that bounds how often we can
flush a page out of the cache to some number of tuples significantly
greater than 100.  The one I suggested yesterday has that property,
for example, although it may have other problems I'm not thinking of.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make the visibility map crash-safe.

2011-06-30 Thread Robert Haas
On Tue, Jun 28, 2011 at 12:29 PM, Robert Haas  wrote:
> On Tue, Jun 28, 2011 at 11:44 AM, Cédric Villemain
>  wrote:
>> out of curiosity, does it affect the previous benchmarks of the feature ?
>
> I don't think there's much performance impact, because the only case
> where we actually have to do any real work is when vacuum comes
> through and marks a buffer we're about to update all-visible.  It
> would probably be good to run some tests to verify that, though.  I'll
> try to set something up.

I did some pgbench runs on the 32-core box Nate Boley gave me access
to.  I'm not sure that pgbench is the best workload to test this with,
but I gave it a shot.  I used these settings:

checkpoint_segments=64
shared_buffers=8GB
synchronous_commit=off
checkpoint_completion_target=0.9

I compare the performance of commit
431ab0e82819b31fcd1e33ecb52c2cd3b4b41da7 (post-patch) with commit
431ab0e82819b31fcd1e33ecb52c2cd3b4b41da7 (pre-patch).  I tried 3
15-minute pgbench runs each with one client, eight clients, and
thirty-two clients, on each branch.  I used scale factor 100,
reinitializing the entire cluster after each run.

one client (prepatch)
tps = 616.412009 (including connections establishing)
tps = 619.189040 (including connections establishing)
tps = 605.357710 (including connections establishing)

one client (postpatch)
tps = 583.295139 (including connections establishing)
tps = 609.236975 (including connections establishing)
tps = 597.674354 (including connections establishing)

eight clients (prepatch)
tps = 2635.459096 (including connections establishing)
tps = 2468.973962 (including connections establishing)
tps = 2592.889454 (including connections establishing)

eight clients (postpatch)
tps = 2602.226439 (including connections establishing)
tps = 2644.201228 (including connections establishing)
tps = 2725.760364 (including connections establishing)

thirty-two clients (prepatch)
tps = 3889.761627 (including connections establishing)
tps = 4365.501093 (including connections establishing)
tps = 3816.119328 (including connections establishing)

thirty-two clients (postpatch)
tps = 3888.620044 (including connections establishing)
tps = 3614.252463 (including connections establishing)
tps = 4090.430296 (including connections establishing)

I think it's pretty difficult to draw any firm conclusions from this
data.  The one and thirty-two core results came out a bit slower; the
eight core results came out a bit faster.  But the variation between
branches is less than the inter-run variation on the same branch, so
if there is an effect, this test doesn't clearly capture it.

Having thought about it a bit, I think that if we ran this test for
long enough, we probably could measure a small effect.   pgbench is a
very write-intensive test, so anything that writes additional WAL
records is going cause checkpoints to happen more frequently, and
there's no denying that this change increases WAL volume slightly.

On another note, these results are also interesting from a scalability
perspective.  The eight-core results are about 4.4x the one-core
results, and the 32-client results are about 6.5x the one-core
results.  Needless to say, there's a lot of room for improvement
there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parameterized aggregate subquery

2011-06-30 Thread Yeb Havinga

On 2011-06-30 09:39, Yeb Havinga wrote:


9) as remarked up a different thread, more than one clause could be 
pushed down a subquery. The current patch only considers alternative 
paths that have at most one clause pushed down. Is this because of the 
call site of best_inner_subqueryscan, i.e. under one join rel? Would 
it be an idea to consider, for each subquery, only a single 
alternative plan that tries to have all suitable clauses pushed down?


Ehm, please forget this remark, I've mistaken join rel for base rel.

-- Yeb


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Simon Riggs
On Wed, Jun 29, 2011 at 7:11 PM, Robert Haas  wrote:

> I don't really see how that's any different from what happens now.  If
> (for whatever reason) the master is generating WAL faster than a
> streaming standby can replay it, then the excess WAL is going to pile
> up someplace, and you might run out of disk space.   Time-delaying the
> standby creates an additional way for that to happen, but I don't
> think it's an entirely new problem.

The only way to control this is with a time delay that can be changed
while the server is running. A recovery.conf parameter doesn't allow
that, so another way is preferable.

I think the time problems are more complex than said. The patch relies
upon transaction completion times, but not all WAL records have a time
attached to them. Plus you only used commits anyway, not sure why.
Some actions aren't even transactional, such as DROP DATABASE, amongst
others. Consecutive records can be hours apart, so it would be
possible to delay on some WAL records but then replay records that
happened minutes ago, then wait hours for the next apply. So this
patch doesn't do what it claims in all cases.

Similar discussion on max_standby_delay covered exactly that ground
and went on for weeks in 9.0. IIRC I presented the same case you just
did and we agreed in the end that was not acceptable. I'm not going to
repeat it. Please check the archives.

So, again +1 for the feature, but -1 for the currently proposed
implementation, based upon review.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Fix leaky-view problem, part 1

2011-06-30 Thread Noah Misch
On Wed, Jun 29, 2011 at 05:05:22PM +0100, Kohei KaiGai wrote:
> 2011/6/28 Noah Misch :
> > On Tue, Jun 28, 2011 at 10:11:59PM +0200, Kohei KaiGai wrote:

> > CREATE VIEW a AS SELECT * FROM ta WHERE ac = 5;
> > ALTER VIEW a OWNER TO alice;
> > CREATE VIEW b AS SELECT * FROM tb WHERE bc = 6;
> > ALTER VIEW b OWNER TO bob;
> > SELECT * FROM a, b;
> >
> > Both the ac=5 and the bc=6 quals do run at the same depth despite enforcing
> > security for different principals. ?I can't think of a way that one view 
> > owner
> > could use this situation to subvert the security of the other view owner, 
> > but I
> > wanted to throw it out.
> >
> Even if view owner set a trap in his view, we have no way to reference 
> variables
> come from outside of the view. In above example, even if I added f_leak() into
> the definition of VIEW A, we cannot give argument to reference VIEW B.

Good point.  Yes, it should be rigorously safe on that account.

> > I was referring to this paragraph:
> >
> > ?On the technical side, I am pretty doubtful that the approach of adding a
> > ?nestlevel to FuncExpr and RelOptInfo is the right way to go. ?I believe we
> > ?have existing code (to handle left joins) that prevents quals from being
> > ?pushed down too far by fudging the set of relations that are supposedly 
> > needed
> > ?to evaluate the qual. ?I suspect a similar approach would work here.
> >
> It seems to me the later half of this paragraph is talking about the problem 
> of
> unexpected qualifier pushing-down over the security barrier; I'm trying to 
> solve
> the problem with the part.2 patch.
> The scenario the part.1 patch tries to solve is order to launch qualifiers, 
> not
> unexpected pushing-down.

Okay, you're probably correct that it wasn't referring to the topic at hand.
I'm still suspicious of the silent assumption about how quals can be assigned
to plan nodes, but I don't have any concrete ideas for avoiding that.

> >> In addition, implementation will become complex, if both of qualifiers 
> >> pulled-up
> >> from security barrier view and qualifiers pulled-up from regular views are 
> >> mixed
> >> within a single qualifier list.
> >
> > I only scanned the part 2 patch, but isn't the bookkeeping already 
> > happening for
> > its purposes? ?How much more complexity would we get to apply the same 
> > strategy
> > to the behavior of this patch?
> >
> If conditional, what criteria we should have to reorder the quelifier?
> The current patch checks the depth at first, then it checks cost if same 
> deptn.
> It is quite simple rule. I have no idea of the criteria to order the
> mixed qualifier
> come from security-barrier views and regular views.

Let's see.  Every qual list will have some depth d such that all quals having
depth >= d are security-relevant, and all others are not security-relevant.
(This does not hold for all means of identifying security-relevant quals, but
it does hold for the CREATE SECURITY VIEW/reloptions strategy proposed in your
part 2 patch.)  Suppose you track whether each Query node represents a
security view, then only increment the qualifier depth for such Query nodes,
rather than all Query nodes.  The tracked depth then becomes a security
partition depth.  Keep the actual sorting algorithm the same.  (Disclaimer: I
haven't been thinking about this nearly as long as you have, so I may be
missing something relatively obvious.)

As it stands, the patch badly damages the performance of this example:

CREATE FUNCTION expensive(int) RETURNS boolean LANGUAGE sql
AS 'SELECT pg_sleep(1); SELECT true' COST 100;
CREATE TABLE t(c) AS SELECT * FROM generate_series(1,3);
EXPLAIN ANALYZE
SELECT * FROM (SELECT * FROM t WHERE expensive(c)) t0 WHERE c = 2;

That doesn't even use a view, let alone a security view.  While I like the
patch's current simplicity, we need to narrow its impact.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Florian Pflug
On Jun29, 2011, at 17:41 , Jeff Davis wrote:
>> Is it? That's actually too bad, since I kinda like it. But anyway,
>> if that's a concern it could also be
>>  range_bounds(ARRAY[1,2]::int8range, '(]')
> 
> What type would the result of that be? What value?

ARRAY[1,2]::int8range would return an int8range instance representing
the range [1,2] (i.e. the set of values {1,2}). range_bounds would then
modify the left bound to be exclusive, and thus return the int8range
(1,2] (i.e. the set of values {2}). range_bounds would have the signature
  range_bounds(anyrange) returns anyrange.

I do think we'll probably want to have functions to modify the boundary
type (open or closed) anyway, so it wouldn't be that huge of a deal if
the range constructor didn't let you specify them.

Empty ranges would be constructed by
  ARRAY[]::int8range
(Hm, ok, now I'm cheating... Currently you'd need to write
ARRAY[]::int8[]::int8range, but fixing that only needs a few lines
in the transformExpression* function that makes ARRAY[]::int8[] work).

>>> * It still suffers similar problems as casting back and forth to text:
>>> ANYARRAY is too general, doesn't really take advantage of the type
>>> system, and not a great fit anyway.
>> 
>> I believe it alleviates the gravest problems of casting back and forth
>> to text. It doesn't have quoting issues and it doesn't potentially lose
>> information.
> 
> I think it still circumvents the type system to a degree. We're just
> putting stuff in an array with no intention of really using it that way.

Well, arrays are containers, and we need two values to construct a range,
so putting them into a container first and then creating the range from that
doesn't seem so bad to me. We do use the full set of features that arrays
provide, since we only ever expect zero, one or two entries. But I don't
think this is different from functions who only support single-dimensional
arrays - they too choose to use only a subset of the features set of arrays.

>> In any case, I wouldn't expect this to *stay* the only way to construct
>> a range forever. But I does have it's virtues for a first incarnation of
>> range type, I believe, mostly because it's completely unintrusive and
>> won't cause any backwards-compatbility headaches in the future
> 
> I'm not sure that your overloading of arrays is completely immune from
> backwards-compatibility problems, should we decide to change it later.
> 
> But regardless, we have quite a lot of time to make a decision before
> 9.2 is released; so let's do it once and do it right.
> 
>> I fear that the intermediate type will turn out to be quite intrusive,
>> at least if we try to handle all the corner cases and loose ends. And if
>> we don't, I'm concerned that we're painting ourselves into a corner here...
> 
> Can you expand on some of the corner-cases and loose ends you're
> concerned about? Does marking it as a pseudotype and making the IO
> functions throw exceptions handle them?

Hm, I guess. I'm sill no huge fan of RANGEINPUT, but if we prevent
it from being used as a column type and from being used as an argument
type, then I guess it's workable...

Btw, what happened to the idea of making RANGE(...) a special syntactic
construct instead of a normal function call? Did we discard that for its
intrusiveness, or were there other reasons?

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of patch Bugfix for XPATH() if expression returns a scalar value

2011-06-30 Thread Florian Pflug
On Jun30, 2011, at 10:33 , Radosław Smogura wrote:
> You may manually enter invalid xml too, You don't need xpath for this.

Please give an example, instead of merely asserting
I get

postgres=# select '<'::xml;
ERROR:  invalid XML content at character 8

> Much more
> In PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit
> I executed
> SELECT XMLELEMENT(name "a", XMLATTRIBUTES(''::xml as v))
> and it's looks like I got correct output ""
> when I looked with text editor into table files I saw same value.

I fail to see your point. XMLATTRIBUTES currently escapes
the value unconditionally, which happens to work as expected in
this case. But try
  SELECT XMLELEMENT(name "a", XMLATTRIBUTES('<'::xml as v))
which returns
  
which I guess isn't what one would expect, otherwise one wouldn't
have added the cast to xml.

I believe the latter example should probably return
  
instead, just as XMLELEMENT(name "a", '<'::xml) returns
  <

But again, this has little to do with the patch at hand, and should
therefore be discussed separately. The fact that XPATH's failure to
escape it's return value happens to fit XMLATTRIBUTE's failure to
not escape it's input value doesn't prove that both functions are
correct. Especially not if other functions like XMLELEMENT *don't*
escape their input values of they're already of type XML.

> But indeed, now PostgreSQL is not type safe against XML type. See
> SELECT XMLELEMENT(name "root", '<', (xpath('//text()', 
> '<'))[1])).

That is *exactly* what my other patch fixes, the one you deem
undesirable. With that other patch applied, I get

  xmlelement   
---
 <<

wich is correct. So here too I fail to see your point.

> You found some problem with escaping, but solution is bad,
> I think problem lies with XML type, which may be used to hold strings
> and proper xml.

But it's not *supposed* to hold arbitrary strings. If it was, the
input function wouldn't check whether or not the value was valid or
not.

Do you agree that, for any type, "SELECT value_of_type_X::text::X"
should never throw an error? Because that, IMHO quite basic, guarantee
is broken without proper escaping of XML values. Try

SELECT (XPATH('/text()', '<'::xml))[1]::text::xml

on unpatched 9.0. It fails, which IMNSHO is very clearly a bug.

> For example above query can't distinguish if
> (xpath('//text()', '<'))[1] is xml text, or xml element.

Text *is* a kind of element in XML. There are different types of
nodes, one of which are is "text node". The fact that it's not surrounded
by <..>,  doesn't make a different, and doesn't change the quoting
rules one bit.

> For me adding support for scalar is really good and needed, but escaping is 
> not.

So far, I remain unconvinced by your arguments. What you argue for
might be sensible if we *didn't* have an XML type and instead simply
had stuff like XPATH() for type TEXT. But we *do* have a type XML,
which *does* validate it's input, so I fail to see how allowing values
of type XML which *don't* pass that input validation can be anything but
a bug.

Compare this to the textual types. We've gone through some pain in
the past to guarantee that textual types never hold values which aren't
valid text in the database's encoding (e.g, we prevent textual values
from containing bytes sequences which aren't valid UTF-8 if the database
encoding is UTF-8). It follows that we should do the same for XML types.

best regards,
Florian Pflug



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in SQL/MED?

2011-06-30 Thread 花田 茂
(2011/06/29 21:23), Albe Laurenz wrote:
> If you invoke any of the SQL/MED CREATE or ALTER commands,
> the validator function is only called if an option list was given.
> 
> That means that you cannot enforce required options at object creation
> time, because the validator function is not always called.
> I consider that unexpected an undesirable behaviour.
> 
> Example:
> CREATE EXTENSION file_fdw;
> CREATE FOREIGN DATA WRAPPER file HANDLER file_fdw_handler VALIDATOR
> file_fdw_validator;
> CREATE SERVER file FOREIGN DATA WRAPPER file;
> CREATE FOREIGN TABLE flat (id integer) SERVER file OPTIONS (format
> 'csv');
> SELECT * FROM flat;
> ERROR:  filename is required for file_fdw foreign tables
> 
> Now file_fdw does not try to enforce the "filename" option, but it
> would be nice to be able to do so.
> 
> The attached patch would change the behaviour so that the validator
> function
> is always called.
> 
> 
> If that change meets with approval, should file_fdw be changed so that
> it
> requires "filename" at table creation time?

I think this proposal is reasonable.  IMHO this fix is enough trivial to
be merged into 9.1 release.

I attached a patch which fixes file_fdw to check required option
(filename) in its validator function.  I think that such requirement
should be checked again in PlanForeignScan(), as it had been so far.
Note that this patch requires fdw.patch has been applied.

With this patch, file_fdw rejects commands which eliminate filename
option as result.  Please see attached sample.txt for detail.

BTW, I noticed that current document says just "the validator function
is called for CREATE FDW/SERVER/FOREIGN TABLE", and doesn't mention
ALTER command or USER MAPPING.  I'll post another patch for this issue
later.

Regards,
-- 
Shigeru Hanada

diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 466c015..57e522f 100644
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 215,220 
--- 215,230 
 */
ProcessCopyOptions(NULL, true, other_options);
  
+   /*
+* filename is a required option.  Validity of other options including
+* relative ones have been checked in ProcessCopyOptions().
+* Note: We don't care its value even though it might be empty, because
+* COPY comand doesn't.
+*/
+   if (catalog == ForeignTableRelationId && filename == NULL)
+   ereport(ERROR,
+   (errmsg("filename is required for file_fdw 
foreign tables")));
+ 
PG_RETURN_VOID();
  }
  
*** fileGetOptions(Oid foreigntableid,
*** 286,295 
}
prev = lc;
}
-   if (*filename == NULL)
-   ereport(ERROR,
-   (errcode(ERRCODE_FDW_UNABLE_TO_CREATE_REPLY),
-errmsg("filename is required for file_fdw 
foreign tables")));
*other_options = options;
  }
  
--- 296,301 
*** filePlanForeignScan(Oid foreigntableid,
*** 308,313 
--- 314,323 
  
/* Fetch options --- we only need filename at this point */
fileGetOptions(foreigntableid, &filename, &options);
+   if (filename == NULL)
+   ereport(ERROR,
+   (errcode(ERRCODE_FDW_UNABLE_TO_CREATE_REPLY),
+errmsg("filename is required for file_fdw 
foreign tables")));
  
/* Construct FdwPlan with cost estimates */
fdwplan = makeNode(FdwPlan);
diff --git a/contrib/file_fdw/input/file_fdw.source 
b/contrib/file_fdw/input/file_fdw.source
index 9ff7235..8d6dfa3 100644
*** a/contrib/file_fdw/input/file_fdw.source
--- b/contrib/file_fdw/input/file_fdw.source
*** CREATE FOREIGN TABLE tbl () SERVER file_
*** 59,64 
--- 59,65 
  ');   -- ERROR
  CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'csv', null '
  ');   -- ERROR
+ CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'csv');
   -- ERROR
  
  CREATE FOREIGN TABLE agg_text (
a   int2,
diff --git a/contrib/file_fdw/output/file_fdw.source 
b/contrib/file_fdw/output/file_fdw.source
index 2ba36c9..6cc6746 100644
*** a/contrib/file_fdw/output/file_fdw.source
--- b/contrib/file_fdw/output/file_fdw.source
*** ERROR:  COPY delimiter cannot be newline
*** 75,80 
--- 75,82 
  CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'csv', null '
  ');   -- ERROR
  ERROR:  COPY null representation cannot use newline or carriage return
+ CREATE FOREIGN TABLE tbl () SERVER file_server OPTIONS (format 'csv');
   -- ERROR
+ ERROR:  filename is required for file_fdw foreign tables
  CREATE FOREIGN TABLE agg_text (
a   int2,
b   float4
postgres=# CREATE FOREIGN TABLE foo () SERVER file;
ERROR:  filename is required for file_fdw for

Re: [HACKERS] Latch implementation that wakes on postmaster death on both win32 and Unix

2011-06-30 Thread Peter Geoghegan
On 30 June 2011 08:58, Heikki Linnakangas
 wrote:

> Here's a WIP patch with some mostly cosmetic changes I've done this far. I
> haven't tested the Windows code at all yet. It seems that no-one is
> objecting to removing silent_mode altogether, so I'm going to do that before
> committing this patch.

I'm mostly happy with the changes you've made, but I note:

Fujii is of course correct in pointing out that
InitPostmasterDeathWatchHandle() should be a static function.

s/the implementation, as described in unix_latch.c/the implementation/
 - This is my mistake. I see no reason to mention the .c file. Use
ctags.

Minor niggle, but there is a little errant whitespace at the top of
fork_process.c.

(wakeEvents & WL_TIMEOUT) != 0 -- I was going to note that this was
redundant, but then I remembered that stupid MSVC warning, which I
wouldn't have seen here because I didn't use it for my Windows build
due to an infuriating issue with winflex (Our own Cygwin built version
of flex for windows). I wouldn't have expected that when it was set to
build C though. I'm not sure exactly why it isn't necessary in other
places where we're (arguably) doing the same thing.


On 30 June 2011 07:36, Fujii Masao  wrote:

> ReleasePostmasterDeathWatchHandle() can call ereport(FATAL) before
> StartChildProcess() or BackendStartup() calls on_exit_reset() and resets
> MyProcPid. This looks unsafe. If that ereport(FATAL) is unfortunately
> called, a process other than postmaster would perform the postmaster's
> proc-exit handlers. And that ereport(FATAL) would report wrong pid
> when %p is specified in log_line_prefix. What about closing the pipe in
> ClosePostmasterPorts() and removing ReleasePostmasterDeathWatchHandle()?

Hmm. That is a valid criticism. I'd rather move the
ReleasePostmasterDeathWatchHandle() call into ClosePostmasterPorts()
though.

> http://developer.postgresql.org/pgdocs/postgres/error-style-guide.html
> According to the error style guide, I think that it's better to change the
> following messages:

I don't think that the way I've phrased my error messages is
inconsistent with that style guide, excepty perhaps the pipe()
reference, but if you feel it's important to try and use "could not",
I have no objections.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] time-delayed standbys

2011-06-30 Thread Simon Riggs
On Thu, Jun 30, 2011 at 2:56 AM, Robert Haas  wrote:
> On Wed, Jun 29, 2011 at 9:54 PM, Josh Berkus  wrote:
>>> I am not sure exactly how walreceiver handles it if the disk is full.
>>> I assume it craps out and eventually retries, so probably what will
>>> happen is that, after the standby's pg_xlog directory fills up,
>>> walreceiver will sit there and error out until replay advances enough
>>> to remove a WAL file and thus permit some more data to be streamed.
>>
>> Nope, it gets stuck and stops there.  Replay doesn't advance unless you
>> can somehow clear out some space manually; if the disk is full, the disk
>> is full, and PostgreSQL doesn't remove WAL files without being able to
>> write files first.
>>
>> Manual (or scripted) intervention is always necessary if you reach disk
>> 100% full.
>
> Wow, that's a pretty crappy failure mode... but I don't think we need
> to fix it just on account of this patch.  It would be nice to fix, of
> course.

How is that different to running out of space in the main database?

If I try to pour a pint of milk into a small cup, I don't blame the cup.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011-06-30 Thread Hitoshi Harada
2011/6/30 Yeb Havinga :
> On 2011-06-29 19:22, Hitoshi Harada wrote:
>>
>> Other things are all good points. Thanks for elaborate review!
>> More than anything, I'm going to fix the 6) issue, at least to find the
>> cause.
>>
> Some more questions:
> 8) why are cheapest start path and cheapest total path in
> best_inner_subqueryscan the same?

Because best_inner_indexscan has the two. Actually one of them is
enough so far. I aligned it as the existing interface but they might
be one.

> 10) I have a hard time imagining use cases that will actually result in a
> alternative plan, especially since not all subqueries are allowed to have
> quals pushed down into, and like Simon Riggs pointed out that many users
> will write queries like this with the subqueries pulled up. If it is the
> case that the subqueries that can't be pulled up have a large overlap with
> the ones that are not pushdown safe (limit, set operations etc), there might
> be little actual use cases for this patch.

I have seen many cases that this planner hack would help
significantly, which were difficult to rewrite. Why were they
difficult to write? Because, quals on size_m (and they have quals on
size_l in fact) are usually very complicated (5-10 op clauses) and the
join+agg part itself is kind of subquery in other big query. Of course
there were workaround like split the statement to two, filtering
size_m then aggregate size_l by the result of the first statement.
However, it's against instinct. The reason why planner is in RDBMS is
to let users to write simple (as needed) statements. I don't know if
the example I raise here is common or not, but I believe the example
represents "one to many" relation simply, therefore there should be
many users who just don't find themselves currently in the slow query
performance.

> I think the most important thing for this patch to go forward is to have a
> few examples, from which it's clear that the patch is beneficial.

What will be good examples to show benefit of the patch? I guess the
test case of size_m/size_l shows it. What lacks on the case, do you
think?

Regards,


-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Review of patch Bugfix for XPATH() if expression returns a scalar value

2011-06-30 Thread Radosław Smogura

On Wed, 29 Jun 2011 22:26:39 +0200, Florian Pflug wrote:

On Jun29, 2011, at 19:57 , Radosław Smogura wrote:

This is review of patch
https://commitfest.postgresql.org/action/patch_view?id=565
"Bugfix for XPATH() if expression returns a scalar value"

SELECT XMLELEMENT(name root, XMLATTRIBUTES(foo.namespace AS sth)) 
FROM (SELECT
(XPATH('namespace-uri(/*)', x))[1] AS namespace FROM (VALUES 
(XMLELEMENT(name
"root", XMLATTRIBUTES('foo;


  xmlelement
-


	It's clearly visible that value from attribute is "' in attribute values. So if
XMLATTRIBUTES, like XMLELEMENT never escaped values which are
already of type XML, the following piece of code

  XMLELEMENT(name "a", XMLATTRIBUTES(''::xml as v))

would produce

  

which, alas, is not well-formed :-(

The correct solution, I believe, is to teach XMLATTRIBUTES to
only escape those characters which are invalid in attribute
values but valid in XML (Probably only '<' and '>' but I'll
check). If there are no objects, I'll prepare a separate patch
for that. I don't want to include it in this patch because it's
really a distinct issue (even modifies a different function).

best regards,
Florian Pflug


You may manually enter invalid xml too, You don't need xpath for this. 
Much more

In PostgreSQL 9.0.1, compiled by Visual C++ build 1500, 64-bit
I executed
SELECT XMLELEMENT(name "a", XMLATTRIBUTES(''::xml as v))
and it's looks like I got correct output ""
when I looked with text editor into table files I saw same value.

I will check on last master if I can reproduce this.

But indeed, now PostgreSQL is not type safe against XML type. See 
SELECT XMLELEMENT(name "root", '<', (xpath('//text()', 
'<'))[1])).


You found some problem with escaping, but solution is bad, I think 
problem lies with XML type, which may be used to hold strings and proper 
xml. For example above query can't distinguish if (xpath('//text()', 
'<'))[1] is xml text, or xml element.


For me adding support for scalar is really good and needed, but 
escaping is not.


Regards,
Radosław Smogura


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Latch implementation that wakes on postmaster death on both win32 and Unix

2011-06-30 Thread Heikki Linnakangas

On 30.06.2011 09:36, Fujii Masao wrote:

On Sat, Jun 25, 2011 at 10:41 AM, Peter Geoghegan  wrote:

Attached is patch that addresses Fujii's third and most recent set of concerns.


Thanks for updating the patch!


I think that Heikki is currently taking another look at my work,
because he indicates in a new message to the list a short time ago
that while reviewing my patch, he realised that there may be an
independent problem with silent_mode. I will wait for his remarks
before producing another version of the patch that incorporates those
two small changes.


Yes, we should wait for the comments from Heikki. But, I have another
comments;


Here's a WIP patch with some mostly cosmetic changes I've done this far. 
I haven't tested the Windows code at all yet. It seems that no-one is 
objecting to removing silent_mode altogether, so I'm going to do that 
before committing this patch.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7f5373..155acea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10165,7 +10165,7 @@ retry:
 	/*
 	 * Wait for more WAL to arrive, or timeout to be reached
 	 */
-	WaitLatch(&XLogCtl->recoveryWakeupLatch, 500L);
+	WaitLatch(&XLogCtl->recoveryWakeupLatch, WL_LATCH_SET | WL_TIMEOUT, 500L);
 	ResetLatch(&XLogCtl->recoveryWakeupLatch);
 }
 else
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 6dae7c9..1a2e141 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -93,6 +93,7 @@
 #endif
 
 #include "miscadmin.h"
+#include "postmaster/postmaster.h"
 #include "storage/latch.h"
 #include "storage/shmem.h"
 
@@ -179,31 +180,32 @@ DisownLatch(volatile Latch *latch)
  * Wait for given latch to be set or until timeout is exceeded.
  * If the latch is already set, the function returns immediately.
  *
- * The 'timeout' is given in microseconds, and -1 means wait forever.
- * On some platforms, signals cause the timeout to be restarted, so beware
- * that the function can sleep for several times longer than the specified
- * timeout.
+ * The 'timeout' is given in microseconds. It must be >= 0 if WL_TIMEOUT
+ * event is given, otherwise it is ignored. On some platforms, signals cause
+ * the timeout to be restarted, so beware that the function can sleep for
+ * several times longer than the specified timeout.
  *
  * The latch must be owned by the current process, ie. it must be a
  * backend-local latch initialized with InitLatch, or a shared latch
  * associated with the current process by calling OwnLatch.
  *
- * Returns 'true' if the latch was set, or 'false' if timeout was reached.
+ * Returns bit field indicating which condition(s) caused the wake-up. Note
+ * that if multiple wake-up conditions are true, there is no guarantee that
+ * we return all of them in one call, but we will return at least one.
  */
-bool
-WaitLatch(volatile Latch *latch, long timeout)
+int
+WaitLatch(volatile Latch *latch, int wakeEvents, long timeout)
 {
-	return WaitLatchOrSocket(latch, PGINVALID_SOCKET, false, false, timeout) > 0;
+	return WaitLatchOrSocket(latch, wakeEvents, PGINVALID_SOCKET, timeout);
 }
 
 /*
- * Like WaitLatch, but will also return when there's data available in
- * 'sock' for reading or writing. Returns 0 if timeout was reached,
- * 1 if the latch was set, 2 if the socket became readable or writable.
+ * Like WaitLatch, but with an extra socket argument for WL_SOCKET_*
+ * conditions.
  */
 int
-WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, bool forRead,
-  bool forWrite, long timeout)
+WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
+  long timeout)
 {
 	struct timeval tv,
 			   *tvp = NULL;
@@ -212,19 +214,26 @@ WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, bool forRead,
 	int			rc;
 	int			result = 0;
 
+	Assert(wakeEvents != 0);
+
+	/* Ignore WL_SOCKET_* events if no valid socket is given */
+	if (sock == PGINVALID_SOCKET)
+		wakeEvents &= ~(WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE);
+
 	if (latch->owner_pid != MyProcPid)
 		elog(ERROR, "cannot wait on a latch owned by another process");
 
 	/* Initialize timeout */
-	if (timeout >= 0)
+	if (wakeEvents & WL_TIMEOUT)
 	{
+		Assert(timeout >= 0);
 		tv.tv_sec = timeout / 100L;
 		tv.tv_usec = timeout % 100L;
 		tvp = &tv;
 	}
 
 	waiting = true;
-	for (;;)
+	do
 	{
 		int			hifd;
 
@@ -235,16 +244,28 @@ WaitLatchOrSocket(volatile Latch *latch, pgsocket sock, bool forRead,
 		 * do that), and the select() will return immediately.
 		 */
 		drainSelfPipe();
-		if (latch->is_set)
+		if (latch->is_set && (wakeEvents & WL_LATCH_SET))
 		{
-			result = 1;
+			result |= WL_LATCH_SET;
+			/*
+			 * Leave loop immediately, avoid blocking again. We don't attempt
+			 * to report any other events that might also be satisfied.
+			 */
 		

Re: [HACKERS] Parameterized aggregate subquery (was: Pull up aggregate subquery)

2011-06-30 Thread Yeb Havinga

On 2011-06-29 19:22, Hitoshi Harada wrote:

Other things are all good points. Thanks for elaborate review!
More than anything, I'm going to fix the 6) issue, at least to find the cause.


Some more questions:
8) why are cheapest start path and cheapest total path in 
best_inner_subqueryscan the same?


9) as remarked up a different thread, more than one clause could be 
pushed down a subquery. The current patch only considers alternative 
paths that have at most one clause pushed down. Is this because of the 
call site of best_inner_subqueryscan, i.e. under one join rel? Would it 
be an idea to consider, for each subquery, only a single alternative 
plan that tries to have all suitable clauses pushed down?


10) I have a hard time imagining use cases that will actually result in 
a alternative plan, especially since not all subqueries are allowed to 
have quals pushed down into, and like Simon Riggs pointed out that many 
users will write queries like this with the subqueries pulled up. If it 
is the case that the subqueries that can't be pulled up have a large 
overlap with the ones that are not pushdown safe (limit, set operations 
etc), there might be little actual use cases for this patch.


I think the most important thing for this patch to go forward is to have 
a few examples, from which it's clear that the patch is beneficial.


regards,

--

Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in SQL/MED?

2011-06-30 Thread Albe Laurenz
I wrote:

> If you invoke any of the SQL/MED CREATE or ALTER commands,
> the validator function is only called if an option list was given.

[...]
> Example:
[...]

The example is misleading. Here a better one:

CREATE SERVER myoradb FOREIGN DATA WRAPPER oracle_fdw OPTIONS (foo
'bar');
ERROR:  invalid option "foo"
HINT:  Valid options in this context are: dbserver, user, password

but:
CREATE SERVER myoradb FOREIGN DATA WRAPPER oracle_fdw;
gives no error.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Florian Pflug
On Jun30, 2011, at 09:05 , Peter Eisentraut wrote:
> On tor, 2011-06-30 at 08:45 +0200, Florian Pflug wrote:
>> I don't think it will - as it stands, there isn't a single collatable
>> type RANGE but instead one *distinct* type per combination of base
>> type, btree opclass and collation. The reasons for that were discussed
>> at length - the basic argument for doing it that way was to make a
>> range represent a fixed set of values.
> 
> How would the system catalogs be initialized under that theory: surely
> you're not going to seed (nr. of types) * (nr. of collations) * (nr. of
> opclasses) range types in initdb?

There's CREATE RANGE. By default, no range types would exists I believe.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Range Types, constructors, and the type system

2011-06-30 Thread Peter Eisentraut
On tor, 2011-06-30 at 08:45 +0200, Florian Pflug wrote:
> I don't think it will - as it stands, there isn't a single collatable
> type RANGE but instead one *distinct* type per combination of base
> type, btree opclass and collation. The reasons for that were discussed
> at length - the basic argument for doing it that way was to make a
> range represent a fixed set of values.

How would the system catalogs be initialized under that theory: surely
you're not going to seed (nr. of types) * (nr. of collations) * (nr. of
opclasses) range types in initdb?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Adding Japanese README

2011-06-30 Thread Peter Eisentraut
On tor, 2011-06-30 at 09:42 +0900, Tatsuo Ishii wrote:
> Why don't you worry about message translations then? Translation is a
> human process and there's no way to guaranteer the translation is
> perfect.

At least for message translations we have a process and sophisticated
tools that ensure that the translation has been done and when and by
whom.  If you create additional translation maintenance effort, then if
you could integrate that into the existing PO-based system, I'm sure the
objections would be less (or different).

And another important point: the PO-based system never shows outdated
translations to the user.

But if we're adding another step to the release preparation where
someone has to chase down half a dozen people to update some file, that
creates a lot more burden and bottlenecks, including the possibility to
hold up minor releases or having to remove files in minor releases.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers