date:20110808

Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Shigeru Hanada

(2011/08/09 7:01), Josh Kupershmidt wrote:
>> I am a bit confused as to why we have both \det and \dE.  They seem
>> redundant.  Shouldn't we rip one of those out?  IMHO, \det should be
>> the one to go, as it could be useful to do, e.g. \dtvE, which isn't
>> going to work with the \det syntax.
> 
> Yeah, I was wondering that myself. At first I thought maybe someone
> added in one without being aware of the other, but it looks like both
> \dE and \det got added in by commit
> 0d692a0dc9f0e532c67c577187fe5d7d323cb95b. They are using different
> queries internally (pg_class vs. pg_foreign_table), but I doubt end
> users would care about that. Or perhaps the author just wanted a
> command name similar to \dew and \des.

I'm the author of that patch, sorry for confusion.  May I explain the
background of implementing those command? :)

Basically, during implementing foreign table support, I tried to follow
the existing design.

I found two backslash command groups in psql, \de[wsu] and \d[tvsT],
which are relevant to foreign table.  Former is a group for listing
relation (table, view, sequence and type), and they have common output
format.  Users would use them (sometimes with combination like \dtv) to
list only specified type of relations.  OTOH latter is a group for
listing SQL/MED objects (wrapper, server and user mapping), and commands
in that group have different output columns.  Users would use them to
see detail of FDW options, maybe mainly FDW options.

They had been implemented in different ways then, and I kept
implementation similar to existing codes for each.  But now, as you say,
they seem redundant and inconsistent each other.

> +1 for getting rid of one of them; I don't really care which one gets
> the axe. Though we should make sure to be able to show all possible
> columns in whichever command we keep (i.e. add "Options" column to
> \dE+ if we keep that one).

I'm not sure whether getting rid of one of them is better.
But maybe \dE is useless than \det, because former doesn't shows more
information than simple \d.

> BTW, I haven't bothered setting up
> functioning foreign tables yet, but is "Size" always going to be 0
> bytes in \dE+?

Yes, "Size" is total file size calculated with pg_table_size(), but
foreign table has no data file.

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selecting user-defined CASTs

2011-08-08 Thread Tom Lane

Joe Abbate  writes:
> I'm trying to query the catalogs to select only the user-defined CASTs

This is rather difficult to do, actually, because pg_cast stores
neither an owner nor a schema for casts, which eliminates all of the
principled ways in which you might decide that a cast belongs to "the
system" or "the user".

An easy but unprincipled way to do it is

select ... from pg_cast c where c.oid >= 16384;

What that really does is eliminate the casts that were installed during
initdb, which are at least a subset of the "system" ones, and might be
all of them depending on what you feel a "system" cast is.  The main
shortcoming of it is that there's no very good way to eliminate casts
installed by extensions, should you want to not consider those "user"
casts.

Another approach is to check pg_depend.  A cast installed by initdb will
match a "pin" entry in pg_depend (refclassid = pg_cast, refobjid =
cast's OID, deptype = 'p').  You're still out of luck for distinguishing
extension members in existing releases, but in 9.1 and up it'll be
possible to identify casts belonging to extensions from pg_depend
entries.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Selecting user-defined CASTs

2011-08-08 Thread Joe Abbate

On 08/08/2011 06:31 PM, Joe Abbate wrote:
> It seems the only way out is to do something like a 9-way join between
> pg_cast, pg_type, pg_proc and pg_namespace to test the source, target
> and function namespaces much as dumpCast() does in pg_dump.c.  Before I
> go that route, I'd thought I'd check with -hackers to see if there's a
> simpler way.

Well, for my specific example (both source and target are pg_catalog
types and the function is in the public schema), the following query
does the trick:

SELECT castsource::regtype AS source,
   casttarget::regtype AS target,
   castfunc::regprocedure AS function,
   castcontext AS context, castmethod AS method,
   description
FROM pg_cast c
 JOIN pg_type s ON (castsource = s.oid)
  JOIN pg_namespace sn ON (s.typnamespace = sn.oid)
 JOIN pg_type t ON (casttarget = t.oid)
  JOIN pg_namespace tn ON (t.typnamespace = tn.oid)
 LEFT JOIN pg_proc p ON (castfunc = p.oid)
  LEFT JOIN pg_namespace pn ON (p.pronamespace = pn.oid)
 LEFT JOIN pg_description d
  ON (c.oid = d.objoid AND d.objsubid = 0)
WHERE (substring(sn.nspname for 3) = 'pg_'
   AND substring(tn.nspname for 3) = 'pg_'
   AND castfunc != 0 AND substring(pn.nspname for 3) != 'pg_')
ORDER BY castsource, casttarget;

I realize that for the general case, the WHERE clause has to be expanded
(and may look much, much uglier). Nevertheless, if somebody has some
simplifications, I'd be glad to hear them.

Joe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Selecting user-defined CASTs

2011-08-08 Thread Joe Abbate

Hi,

I'm trying to query the catalogs to select only the user-defined CASTs
(my test db only has one such CAST).  Looking at pg_dump.c, I've come up
with the following so far:

SELECT castsource::regtype AS source,
   casttarget::regtype AS target,
   castfunc::regprocedure AS function,
   castcontext AS context, castmethod AS method,
   description
FROM pg_cast c
 LEFT JOIN pg_description d
  ON (c.oid = d.objoid AND d.objsubid = 0)
WHERE (castfunc != 0
   AND substring(castfunc::regproc::text for 3) != 'pg_')
ORDER BY castsource, casttarget;

This excludes the vast majority of internal casts (172 of them) where
castfunc is 0 or castfunc::regproc causes castfunc to show up with
"pg_catalog." prepended to the function name.  However, this still pulls
19 other rows, as shown in the excerpt output below (after setting
search_path to pg_catalog):

   source   |target |  function
  | context | method |   description
---+---+-+-++-
 bigint| regproc   | oid(bigint)
 | i   | f  |
 bigint| oid   | oid(bigint)
 | i   | f  |
...
smallint  | boolean   | public.int2_bool(smallint)
| e   | f  | Test comment for cast 1
 integer   | boolean   | bool(integer)
 | e   | f  |
...
 interval  | reltime   | reltime(interval)
 | a   | f  |
 bit varying   | bit varying   | varbit(bit varying,integer,boolean)
| i   | f  |
(20 rows)

The smallint AS boolean CAST is mine and is the only one I want to retrieve.

It seems the only way out is to do something like a 9-way join between
pg_cast, pg_type, pg_proc and pg_namespace to test the source, target
and function namespaces much as dumpCast() does in pg_dump.c.  Before I
go that route, I'd thought I'd check with -hackers to see if there's a
simpler way.

Regards,

Joe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Josh Kupershmidt

On Mon, Aug 8, 2011 at 6:01 PM, Josh Kupershmidt  wrote:
> (i.e. add "Options" column to \dE+ if we keep that one).

Oh nevermind, "Options" is displayed by \d+ foreign_table_name.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Josh Kupershmidt

On Mon, Aug 8, 2011 at 4:34 PM, Robert Haas  wrote:
> OK, I've now committed most of this, with some additions to the
> documentation.  Remaining bits attached.

Looks good, thanks.

> I am a bit confused as to why we have both \det and \dE.  They seem
> redundant.  Shouldn't we rip one of those out?  IMHO, \det should be
> the one to go, as it could be useful to do, e.g. \dtvE, which isn't
> going to work with the \det syntax.

Yeah, I was wondering that myself. At first I thought maybe someone
added in one without being aware of the other, but it looks like both
\dE and \det got added in by commit
0d692a0dc9f0e532c67c577187fe5d7d323cb95b. They are using different
queries internally (pg_class vs. pg_foreign_table), but I doubt end
users would care about that. Or perhaps the author just wanted a
command name similar to \dew and \des.

+1 for getting rid of one of them; I don't really care which one gets
the axe. Though we should make sure to be able to show all possible
columns in whichever command we keep (i.e. add "Options" column to
\dE+ if we keep that one). BTW, I haven't bothered setting up
functioning foreign tables yet, but is "Size" always going to be 0
bytes in \dE+?

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: display of object comments

2011-08-08 Thread Robert Haas

On Fri, Aug 5, 2011 at 7:25 PM, Josh Kupershmidt  wrote:
> On Fri, Aug 5, 2011 at 8:32 AM, Robert Haas  wrote:
>> I guess my vote is to make the SQL/MED stuff show the description only
>> in verbose mode, and always at the end; and revise what we did with
>> \dL to put the description at the end even in verbose mode.
>
> Yeah, that sounds fine to me. I've revised the residual patch to do
> so, as well as incorporated your earlier suggestion about having \dD
> and \dc only display descriptions in verbose-mode.
>
> I did more testing, and found and fixed some brokenness related to
> d.objsubid I had introduced into the listConversions() query, as well
> as improved the version checking in objectDescription(). Updated patch
> attached.

OK, I've now committed most of this, with some additions to the
documentation.  Remaining bits attached.

I am a bit confused as to why we have both \det and \dE.  They seem
redundant.  Shouldn't we rip one of those out?  IMHO, \det should be
the one to go, as it could be useful to do, e.g. \dtvE, which isn't
going to work with the \det syntax.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


describe_comments.v5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP fix proposal for bug #6123

2011-08-08 Thread Florian Pflug

On Aug8, 2011, at 17:35 , Florian Pflug wrote:
> I think it would be helpful if we had a more precise idea about the
> intended use-cases. So far, the only use-case that has been described in
> detail is the one which made Kevin aware of the problem. But if I
> understood Kevin correctly, that fact that they use BEFORE and not AFTER
> triggers it more of an accident than the result of a conscious design
> decision. Though he did also mention that there might actually be advantages
> to using BEFORE instead of AFTER triggers, because that prevents other
> triggers from seeing a non-consistent state.

I pondered this some more, and came up with the following use-case of
same-row modification from a BEFORE trigger which are not easily moved
to an AFTER trigger.

Assume you have a partitioned table T with two partitions. Partition
Tcompleted contains records which are "completed" in some sense, while
Tcommencing contains records which are not. A record's state is indicated
by a boolean column "completed", and CHECK constraints guarantee that
"completed" is indeed false for all records in Tcommencing and true for all
records in Tcompleted. Now, if the partitioning is supposed to be transparent
to applications, a record must automatically be moved from Tcommencing to
Tcompleted if it's "completed" flag is changed from false to true. The
following BEFORE UPDATE trigger on Tcommencing accomplishes that.

  IF NEW.completed THEN
DELETE FROM Tcommencing WHERE id = OLD.id;
INSERT INTO Tcompleted VALUES (NEW.*); -- Probably a syntax error,
   -- but intent is clear
RETURN NULL;
  ELSE
RETURN NEW;
  END IF;

Doing this from within an AFTER trigger doesn't work because the CHECK
constraint on Tcommencing would prevent the UPDATE from occurring.

This would also fail if we did as I suggested, and forbade any modifications
to a row for which a BEFORE trigger was in progress. I take this as evidence
that my suggestion was miss-guided and we should in fact only report an error
if the row was modified since the before trigger started *and* the trigger
returns non-NULL.

I still believe we shouldn't distinguish between UPDATES and DELETES, so
a DELETE should fail if the row is updated from within a BEFORE DELETE trigger
which returns non-NULL.

BTW, the case of multiple BEFORE triggers also raises some interesting 
questions.
Assume there are two BEFORE UPDATE triggers T1 and T2 defined on a table T, and 
T1
causes a same-row UPDATE and returns non-NULL. Also assume that T1 is coded in
a way that prevents infinite recursion in that case (i.e., the same-row UPDATE
is skipped if is T1 invocation already as a result of a same-row UPDATE). The
sequence of invocations after an update of a row would be.

1) T1 (for the original UPDATE)
2) T1 (for the same-row UPDATE initiated in (1))
3) T2 (for the same-row UPDATE initiated in (1))
4) T2 (for the original UPDATE)

Now, T2 in invoked in (4) for a row which isn't visible anymore. So I'm thinking
that maybe we should check after every trigger invocation whether or not a 
same-row
UPDATE occurred, and immediately raise an error if it did and the trigger didn't
return NULL, and not delay the check until all triggers have run. It seems 
unlikely
that delaying the check until all triggers have run adds any significant 
functionality.
For that to be the case, T2 would have to know whether T1 did a same-row 
update, and
return NULL, since otherwise we'd raise an error anyway. That beauty of doing 
the
check after every trigger lies in the fact that it makes the coding rule "Thou
shall return NULL if thou dared to modify thine own row" local (i.e. 
per-trigger)
instead of global (i.e. per trigger-chain), and thus less likely to be violated 
by
some strange interactions of triggers serving distinct purposes.

To summarize, here's what I currently believe would be a sensible approach:

  After every BEFORE trigger invocation, if the trigger returned non-NULL, 
check if
  latest row version is still the same as when the trigger started. If not, 
complain.

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alvaro Herrera

Excerpts from Alexander Korotkov's message of lun ago 08 13:21:17 -0400 2011:
> On Mon, Aug 8, 2011 at 8:27 PM, Alvaro Herrera
> wrote:
> 
> > An array of relopt_string?  Isn't that a bit strange?  If I recall
> > correctly, the point of this was to be able to allocate the
> > relopt_string struct and the char array itself as a single palloc unit,
> > in a single call somewhere in the reloptions API (which was convoluted
> > in some points precisely to let the string case work).  I don't have the
> > details of this fresh in my mind though.  It certainly worked with more
> > than one string option when I committed it, IIRC.
> >
> Yes, it seems strange. But it also seems that both were added by your commit
> of table-based parser:
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ba748f7a11ef884277b61d1708a17a44acfd1736

Oh, you're adding options directly to stringRelOpts?  Hmm, I can't
remember whether I tried to do that; I have memory of testing string
options via add_string_reloption ... and in reflection, it seems obvious
that it would fail.

Perhaps the easiest way to fix it is as you suggest, by declaring the
struct to take a pointer rather than the value directly.  Not sure how
to make both cases work sanely; the add_string_reloption path will need
updates.  I don't have time to work on it right now though.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 1:31 PM, Andres Freund  wrote:
> If its ok I will write a mail to lkml referencing this thread and your numbers
> inline (with attribution obviously).

That would be great.  Please go ahead.

> I don't think it will be that hard to convince them. But I constantly surprise
> myself with naivity so I may be wrong.

Heh, heh, open source is fun.

>> > My largest machine I can reboot often enough to test such a thing has only
>> > two sockets (4cores E5520). I guess you cannot reboot your loaned machine
>> > with a new kernel easily?
>>Not really.  I do have root access to a 64-core box at the moment, and
>>I could probably get permission to reboot it, but if it didn't come
>>back on-line that would be awkward.
> As I feared. Any chance that the person lending you the machine can give you a
> hand?

Uh, maybe, but considering my relative inexperience in compiling the
Linux kernel, I'd be a little worried about having to iterate too many
times.

> Although I don't know how that could be after reading the code it would be
> disappointing to wait for 3.2 with the llseek fixes appearing in $distribution
> just to notice fstat is still faster for $unobvious_reason...

Well, the good thing here is that we are really only concerned with
gross effects.  It's pretty obvious from the numbers I posted upthread
that the problem is related to lock contention.  If that gets fixed,
and lseek is still 20% slower under some set of circumstances, it's
not clear that we're really gonna care.  I mean, maybe it would be
nice to avoid going to the kernel at all here just so we're immune to
possible inefficiencies in other operating systems (it would be nice
if someone could repeat these tests on a big SMP box running Windows
and/or one of BSD systems) and to save the overhead of a system call,
but those effects are pretty tiny.  We could spend a lot of time
optimizing other things before that one percolated up to the top of
the heap, at least based on what I've seen so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] index sizes: single table vs partitioned

2011-08-08 Thread Andrew Hammond

For a large table, should there be a difference in index sizes between a
single table representation and representation based on multiple partitions
with identical indexes?

A

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 12:49 PM, Alvaro Herrera
 wrote:
> Excerpts from Robert Haas's message of lun ago 08 12:33:45 -0400 2011:
>> On Mon, Aug 8, 2011 at 12:22 PM, Alvaro Herrera
>>  wrote:
>> > Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:
>> >> We could do that, but what the heck is the point?   What benefit are
>> >> we trying to get by not returning a pointer to the structure?  I feel
>> >> like we're making this ludicrously complicated with no real
>> >> justification of why all of this complexity is adding any value.
>> >
>> > I don't see that it's complicated in any way.  Seems pretty simple to me.
>>
>> It's quite complicated.  Every time someone adds a new member to the
>> struct, they need to adjust the code all over the place.  The call
>> sequence of the accessor function will constantly be changing.
>
> You are rehashing an argument that I already rebutted: the way to solve
> this problem is with the macros, the format of which we were discussing
> in this sub-thread.  I am not sure why you bring it up again.

Well, the proposed macro design seems like it doesn't go as far in
that direction as I would have hoped.  I thought you were proposing to
have one function that returns a pointer to the struct, and then a
bunch of macros that return particular elements out of the returned
struct.  If we did that, the macro definitions wouldn't need to change
when you added new members to the struct.  On the other hand, if you
have one function with N out parameters, and the macros are just
calling that function with mostly NULL arguments, then the calls have
to be updated every time you add a new struct member.  IMHO, that's a
pain.  YMMV.

> You say the reason for not exporting the struct definition is that you
> can later change it without having to touch the callers.  That seems a
> good argument, and it seems to me to work equally for pg_amop as for the
> new header KaiGai is creating here.

In the case of pg_amop, there's an additional thing to worry about,
which is that the result is allocated in some memory context, and you
now have to worry about how long it sticks around in that context vs.
how long the caller needs it for.  Copying the data into out
parameters (or a pg_amop struct in the caller's memory context, if we
had preferred to go that route) addresses that problem.  However, here
we're just dealing with static data, so that's not a problem.

Also, if get_op_opfamily_properties() had as many out parameters as
this function would likely need to, I'd favor changing that, too.
Three is reasonable.  Ten is pushing it, and any more than that is
kind of nuts.  I think we're going to end up with quite a few in this
case, because we're rapidly accreting a set of operations that work on
"any old SQL object" (COMMENT, SECURITY LABEL, ALTER EXTENSION ..
ADD/DROP, ALTER FOO .. SET SCHEMA, etc.).  I think that's a good
thing, but I don't want to assume that the number of struct members is
going to remain small.

> In any case, at the start of the thread KaiGai was asking for opinions,
> and I gave mine.  I will now shut up.

Sorry to have given offense.  My goal was to understand why we were
having this argument, not to piss you off.  However, I seem to have
failed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Andres Freund

On Monday, August 08, 2011 13:19:13 Robert Haas wrote:
> On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund  wrote:
> > There doesn't seem to have been any activity to inlude it in 3.1. The
> > merge window for 3.1 just ended. The next one will open for about a
> > week after the release.
> > Its also not yet included in linux-next which is a "preview" for the
> > currently worked on release + 1. A release takes roughly 3 months.
> 
> OK.  If it doesn't get into Linux 3.2 we had better start thinking
> hard about a workaround on our side. 
If its ok I will write a mail to lkml referencing this thread and your numbers 
inline (with attribution obviously).
I don't think it will be that hard to convince them. But I constantly surprise 
myself with naivity so I may be wrong.


> > My largest machine I can reboot often enough to test such a thing has only
> > two sockets (4cores E5520). I guess you cannot reboot your loaned machine
> > with a new kernel easily?
>Not really.  I do have root access to a 64-core box at the moment, and
>I could probably get permission to reboot it, but if it didn't come
>back on-line that would be awkward.
As I feared. Any chance that the person lending you the machine can give you a 
hand?
Although I don't know how that could be after reading the code it would be 
disappointing to wait for 3.2 with the llseek fixes appearing in $distribution 
just to notice fstat is still faster for $unobvious_reason...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Tom Lane

Robert Haas  writes:
> Not really.  I do have root access to a 64-core box at the moment, and
> I could probably get permission to reboot it, but if it didn't come
> back on-line that would be awkward.

Red Hat has some test hardware that I can use (... pokes around ...)
Hmm, this one looks promising:

Memory  NUMA Nodes
64348 MB 4

  Cpu
Vendor  Model Name  Family  Model   
SteppingSpeed   Processors  Cores   Sockets Hyper
GenuineIntel Intel(R) Xeon(R) CPU E7- 4860 @ 2.27GHz 6   47 
 2   1064.0  80  40  4   True

If you can wrap something up to the point where someone else can
run it, I'll give it a shot.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 1:16 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> We could do that, but what the heck is the point?   What benefit are
>> we trying to get by not returning a pointer to the structure?
>
> Not having an ABI break if we find it necessary to add members to the
> struct ... which I grant is unlikely to happen in a minor version
> update, but it could happen.

I think that would only be a problem if we added members to the middle
of the struct, rather than at the end.

However...

> Having said that, the proposed coding with a single function returning N
> pointer parameters seems like it's been written in the ugliest, least
> efficient possible way, and it fails to resolve the ABI-break objection
> anyway.  (If you did need another struct member, you'd also need another
> return parameter, thus forcing recompiles.)  The form I had in mind was
> N actual functions (not macros) that internally look like
>
>        sometype
>        get_object_property_foo(...)
>        {
>                structptr *p = get_object_property_struct(...);
>
>                return p->foo;
>        }
>
> The only objection I can see to this is that somebody who needs several
> different attributes of the same object would have to pay the lookup
> cost several times.  That may or may not be an adequate reason to allow
> people to fetch the struct directly; without seeing a list of places
> where this would happen, it's impossible to evaluate the performance
> costs.

...doing it this way seems fine, because IIRC this is all only being
used to support DDL operations, and the overhead of a few extra
function calls isn't going to matter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alexander Korotkov

On Mon, Aug 8, 2011 at 8:27 PM, Alvaro Herrera
wrote:

> An array of relopt_string?  Isn't that a bit strange?  If I recall
> correctly, the point of this was to be able to allocate the
> relopt_string struct and the char array itself as a single palloc unit,
> in a single call somewhere in the reloptions API (which was convoluted
> in some points precisely to let the string case work).  I don't have the
> details of this fresh in my mind though.  It certainly worked with more
> than one string option when I committed it, IIRC.
>
Yes, it seems strange. But it also seems that both were added by your commit
of table-based parser:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=ba748f7a11ef884277b61d1708a17a44acfd1736

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 1:10 PM, Andres Freund  wrote:
> There doesn't seem to have been any activity to inlude it in 3.1. The merge
> window for 3.1 just ended. The next one will open for about a week after the
> release.
> Its also not yet included in linux-next which is a "preview" for the currently
> worked on release + 1. A release takes roughly 3 months.

OK.  If it doesn't get into Linux 3.2 we had better start thinking
hard about a workaround on our side.  I am not too concerned about
people hitting this with PostgreSQL 9.1 or prior, because you'd
basically need a workload targeted to exercise the problem, which
workload is not that similar to the way people actually do things in
real life.  However, in PostgreSQL 9.2devel, it's going to be much
more of a real-world problem, so I'd hate to wait until after our
feature freeze and then decide we've got a problem we have to fix.

> For upstreaming somebody needs to be persistent enough to convince one of the
> maintainers of the particular area to include the code so that linus then can
> pull that.
> I guess citing your numbers would go a long way in that direction. Naturally
> it would be even better to inlcude results with the patch applied.
> My largest machine I can reboot often enough to test such a thing has only two
> sockets (4cores E5520). I guess you cannot reboot your loaned machine with a
> new kernel easily?

Not really.  I do have root access to a 64-core box at the moment, and
I could probably get permission to reboot it, but if it didn't come
back on-line that would be awkward.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Tom Lane

Robert Haas  writes:
> We could do that, but what the heck is the point?   What benefit are
> we trying to get by not returning a pointer to the structure?

Not having an ABI break if we find it necessary to add members to the
struct ... which I grant is unlikely to happen in a minor version
update, but it could happen.

Having said that, the proposed coding with a single function returning N
pointer parameters seems like it's been written in the ugliest, least
efficient possible way, and it fails to resolve the ABI-break objection
anyway.  (If you did need another struct member, you'd also need another
return parameter, thus forcing recompiles.)  The form I had in mind was
N actual functions (not macros) that internally look like

sometype
get_object_property_foo(...)
{
structptr *p = get_object_property_struct(...);

return p->foo;
}

The only objection I can see to this is that somebody who needs several
different attributes of the same object would have to pay the lookup
cost several times.  That may or may not be an adequate reason to allow
people to fetch the struct directly; without seeing a list of places
where this would happen, it's impossible to evaluate the performance
costs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Andres Freund

On Monday, August 08, 2011 11:33:29 Robert Haas wrote:

> On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund  wrote:
> > I don't think its a good idea to replace lseek with fstat in the long
> > run. The likelihood that the lockless generic_file_llseek will get
> > included seems rather high to me. In contrast to that fstat will always
> > be more expensive than that as its going through a security check and
> > then the fs' getattr implementation (which actually takes a lock on
> > some fs).
> *scratches head*  I understand that stat() would need a security
> check, but why would fstat()?
That I am not totally sure of either. I guess Kaigai might know more about 
that.
I guess it might be that a forked process possibly is not allowed anymore to 
access the information from an inherited file handle? Also I think a process 
can change its permissions during runtime.

> I think both of you raise good points.  I wasn't too enthusiastic
> about this approach either.  It's not very appealing to adopt an
> approach where the right performance decision is going to depend on
> operating system, file system, kernel version, core count, and
> workload.  We could add a GUC, but it would be pretty annoying to have
> a setting that won't matter for most people at all, except
> occasionally when it makes a huge difference.
> 
> I wasn't aware that was any current activity around this on the Linux
> side.  But Andres' comments made me Google it again, and now I see
> this:
> 
> https://lkml.org/lkml/2011/6/16/800
> 
> Andes, any idea what the status of that patch is?  I'm not clear on
> how Linux works in terms of things getting upstreamed.
There doesn't seem to have been any activity to inlude it in 3.1. The merge 
window for 3.1 just ended. The next one will open for about a week after the 
release.
Its also not yet included in linux-next which is a "preview" for the currently 
worked on release + 1. A release takes roughly 3 months.

For upstreaming somebody needs to be persistent enough to convince one of the 
maintainers of the particular area to include the code so that linus then can 
pull that.
I guess citing your numbers would go a long way in that direction. Naturally 
it would be even better to inlcude results with the patch applied.
My largest machine I can reboot often enough to test such a thing has only two 
sockets (4cores E5520). I guess you cannot reboot your loaned machine with a 
new kernel easily?

Greetings, 
Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera

Excerpts from Robert Haas's message of lun ago 08 12:33:45 -0400 2011:
> On Mon, Aug 8, 2011 at 12:22 PM, Alvaro Herrera
>  wrote:
> > Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:
> >> We could do that, but what the heck is the point?   What benefit are
> >> we trying to get by not returning a pointer to the structure?  I feel
> >> like we're making this ludicrously complicated with no real
> >> justification of why all of this complexity is adding any value.
> >
> > I don't see that it's complicated in any way.  Seems pretty simple to me.
> 
> It's quite complicated.  Every time someone adds a new member to the
> struct, they need to adjust the code all over the place.  The call
> sequence of the accessor function will constantly be changing.

You are rehashing an argument that I already rebutted: the way to solve
this problem is with the macros, the format of which we were discussing
in this sub-thread.  I am not sure why you bring it up again.

You say the reason for not exporting the struct definition is that you
can later change it without having to touch the callers.  That seems a
good argument, and it seems to me to work equally for pg_amop as for the
new header KaiGai is creating here.

In any case, at the start of the thread KaiGai was asking for opinions,
and I gave mine.  I will now shut up.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Kohei KaiGai

2011/8/8 Robert Haas :
> On Mon, Aug 8, 2011 at 11:57 AM, Alvaro Herrera
>  wrote:
>> Excerpts from Kohei KaiGai's message of lun ago 08 03:12:20 -0400 2011:
>>
>>> Thanks for your suggestion.
>>> So, it seems to me the interface should return a pointer to the entry
>>> of array being specified, rather than above approach.
>>>
>>> E.g, the above macro could be probably rewritten as follows:
>>>   #define get_object_property_attnum_name(objtype) \
>>>       (get_object_property(objtype)->attnum_name)
>>
>> I don't understand why don't you just do something like
>>
>>   #define get_object_property_attnum_name(objtype, attnum_name_value) \
>>       (get_object_property((objtype), NULL, NULL, (attnum_name_value), NULL, 
>> NULL)))
>>
>> and the caller does
>>
>> AttrNumber      attnum_name;
>> get_object_property_attnum_name(OBJTYPE_TABLE, &attnum_name);
>>
>> i.e. the caller must still pass pointers, instead of expecting the
>> values to be returned.
>
> We could do that, but what the heck is the point?   What benefit are
> we trying to get by not returning a pointer to the structure?  I feel
> like we're making this ludicrously complicated with no real
> justification of why all of this complexity is adding any value.
>
I agree with Robert's opinion. It seems to me we have little benefit to
keep the structure condidential to other components.

Thanks,
-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Heikki Linnakangas


On 08.08.2011 19:39, Tom Lane wrote:

Robert Haas  writes:

On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way.  Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch.  Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.


Heikki argued the same thing in
http://archives.postgresql.org/message-id/4cea5a0f.1030...@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe.  So [ ... warms up grep ... ]


Thanks for double-checking.


(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code.  Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.


I agree we definitely should add the memory barrier instruction there 
once we have them.



(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking.  I think SyncRepWaitForLSN could use a warning
comment, too.  I will go do that.


Ok, thanks.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Tom Lane

Robert Haas  writes:
> On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane  wrote:
>> Any protocol of that sort has *obviously* got a race condition between
>> changes of the latch state and changes of the separate flag state, if run
>> in a weak-memory-ordering machine.

> On the flip side, I'm not sure that anyone ever really expected that a
> latch could be safely used this way.  Normally, I'd expect the flag to
> be some piece of state protected by an LWLock, and I think that ought
> to be OK provided that the lwlock is released before setting or
> checking the latch.  Maybe we should try to document the contract here
> a little better; I think it's just that there must be a full memory
> barrier (such as LWLockRelease) in both processes involved in the
> interaction.

Heikki argued the same thing in
http://archives.postgresql.org/message-id/4cea5a0f.1030...@enterprisedb.com
but I don't think anyone has actually done the legwork to verify that
current uses are safe.  So [ ... warms up grep ... ]

Currrent callers of WaitLatch(OrSocket):

XLogPageRead waits on XLogCtl->recoveryWakeupLatch

latch is set by WakeupRecovery, which is called by process'
own signal handlers (OK) and XLogWalRcvFlush.  The latter is OK
because there's a spinlock protecting the transmitted data.

pgarch_MainLoop waits on mainloop_latch

non-issue because latch is process-local

WalSndLoop waits on MyWalSnd->latch

latch is set by signal handlers and WalSndWakeup.  The latter is
OK because it's called right after XLogFlush (which certainly
locks whatever it touches) or a spinlock release.

SyncRepWaitForLSN waits on MyProc->waitLatch

latch is set by signal handlers and SyncRepWakeQueue.  The
latter appears unsafe at first glance, because it sets the
shared variable (thisproc->syncRepState) and immediately
sets the latch.  However, the receiver does this curious dance:

syncRepState = MyProc->syncRepState;
if (syncRepState == SYNC_REP_WAITING)
{
LWLockAcquire(SyncRepLock, LW_SHARED);
syncRepState = MyProc->syncRepState;
LWLockRelease(SyncRepLock);
}

which I believe makes it safe, though rather underdocumented;
if a race does occur, the receiver will acquire the lock and
recheck, and the lock acquisition will block until the caller of
SyncRepWakeQueue has released SyncRepLock, and that release
will flush any pending writes to shared memory.

Conclusions:

(1) 9.1 does not have a problem of this sort.

(2) Breathing hard on any of this code could break it.

(3) I'm not going to hold still for not inserting a memory barrier
into SetLatch, once we have the code.  Unsubstantiated performance
worries are not a sufficient argument not to --- as a wise man once
said, you can make the code arbitrarily fast if it doesn't have to
give the right answer.

(4) In the meantime, we'd better document that it's caller's
responsibility to ensure that the flag variable is adequately
protected by locking.  I think SyncRepWaitForLSN could use a warning
comment, too.  I will go do that.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera

Excerpts from Kohei KaiGai's message of lun ago 08 12:18:47 -0400 2011:
> 2011/8/8 Robert Haas :

> > We could do that, but what the heck is the point?   What benefit are
> > we trying to get by not returning a pointer to the structure?  I feel
> > like we're making this ludicrously complicated with no real
> > justification of why all of this complexity is adding any value.
> >
> I agree with Robert's opinion. It seems to me we have little benefit to
> keep the structure condidential to other components.

So you coded my suggestion in an extremely awkward way just to be able
to dismiss it?

We use that pattern in a lot of places.  See get_op_opfamily_properties
for the first example I found (took my 15 secs to find it).  I don't
understand why you think it's so complicated or horrible.

Please tell me, why don't we just return Form_pg_amop in that function?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 12:22 PM, Alvaro Herrera
 wrote:
> Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:
>> We could do that, but what the heck is the point?   What benefit are
>> we trying to get by not returning a pointer to the structure?  I feel
>> like we're making this ludicrously complicated with no real
>> justification of why all of this complexity is adding any value.
>
> I don't see that it's complicated in any way.  Seems pretty simple to me.

It's quite complicated.  Every time someone adds a new member to the
struct, they need to adjust the code all over the place.  The call
sequence of the accessor function will constantly be changing.  We
generally try to structure our code base so that patches don't need to
touch more call sites than necessary, and we regularly get very
nervous about patches that touch too many call sites and ask them to
be changed so that they don't.  Here, you're proposing a design that
is practically guaranteed to have that problem every time someone
makes a change in that area.

> The justification is simple too: don't export struct definitions without
> need.

That's a reasonable guiding principle, but I believe that the
underlying reason to do that is information hiding.  In other words,
if the details of what's in the struct don't matter to other modules,
then they shouldn't peek into it.  That way, when you want to change
something, you don't have to worry about whether anyone else cares
about the encapsulated data, because you already know they don't.  In
this case, however, the struct - by design - isn't going to contain
any private data.  If callers are going to potentially need access to
every member of the struct, then you may as well export it and be done
with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alvaro Herrera

Excerpts from Alexander Korotkov's message of lun ago 08 11:50:53 -0400 2011:
> On Mon, Aug 8, 2011 at 7:43 PM, Alvaro Herrera
> wrote:
> 
> > Maybe this needs to use the new FLEXIBLE_ARRAY_MEMBER stuff.  Can you try
> > that please?
> 
> 
> typedef struct relopt_string
> {
> relopt_gen gen;
> int default_len;
>  bool default_isnull;
> validate_string_relopt validate_cb;
>  char default_val[1]; /* variable length, zero-terminated */
> } relopt_string;
> 
> static relopt_string stringRelOpts[] =
> ...
> 
> I doubt variable-length data structure is possible in this case, because we
> don't have array of pointers to relopt_string, but just array
> of relopt_string. May be just
> char *default_val;
> is possible?

An array of relopt_string?  Isn't that a bit strange?  If I recall
correctly, the point of this was to be able to allocate the
relopt_string struct and the char array itself as a single palloc unit,
in a single call somewhere in the reloptions API (which was convoluted
in some points precisely to let the string case work).  I don't have the
details of this fresh in my mind though.  It certainly worked with more
than one string option when I committed it, IIRC.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera

Excerpts from Robert Haas's message of lun ago 08 12:05:05 -0400 2011:

> We could do that, but what the heck is the point?   What benefit are
> we trying to get by not returning a pointer to the structure?  I feel
> like we're making this ludicrously complicated with no real
> justification of why all of this complexity is adding any value.

I don't see that it's complicated in any way.  Seems pretty simple to me.
The justification is simple too: don't export struct definitions without
need.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql document fix about showing FDW options

2011-08-08 Thread Robert Haas

2011/8/8 Shigeru Hanada :
> I noticed that psql document wrongly says that \d+ command shows
> per-table FDW options of a foreign table, but in fact, per-table FDW
> options are shown only in the result of \det+ command.  Attached patch
> removes this wrong description.
>
> This fix should be applied to 9.1 too.

I think we ought to fix the behavior, rather than the documentation.
It's just weird to have something show up in the list view that
doesn't appear in the detail view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] per-column FDW options, v5

2011-08-08 Thread Robert Haas

2011/8/8 Shigeru Hanada :
>>> Currently table-level options are showin in result of \det+ command
>>> (only verbose mode), in same style as fdw and foreign servers.
>>>
>>> But \d is more popular for table describing, so moving table-level
>>> options from \det+ to \d might be better.  Thoughts?
>
> (2011/08/06 9:26), Robert Haas wrote:
>> I'd show it both places.
>
> After taking a look at describe.c, I think some styles are applicable to
> FDW options in result of \d command.
>
> (1) simple array literal style
> Show table-level FDW options like other FDW options.  It is  simply a
> result of array_out(); each options is shown as "key=value" with quoting
> if necessary and delimited by ','.  Whole line is surrounded by { and }.
>  If an element includes any character which need to be escaped, such
> element is quoted with double-quotation.
>
>    Ex)
>    FDW Options: {"delimiter=,","quote=\""}
>    #delimiter is a comma, and qutoe is a double-quote
>
> (2) reloptions style
> Show FDW options like reloptions of \d+ result.  Each options is shown
> as "key=value" without quoting.  Some special characters might make it
> little illegible.
>
>    Ex)
>    FDW Options: delimiter=,, quote="
>    #delimiter is a comma, and qutoe is a double-quote
>
> (3) OPTIONS clause style
> Show FDW options as they were in OPTIONS clause.  Each option is shown
> as "key 'value'", and delimited with ','.
>
>    Ex)
>    FDW Options: delimiter ',', quote 
>    #delimiter is a comma, and qutoe is a single-quote
>
> (1) can be implemented with minimum change, and it also keeps the
> behavior consistent with other existing FDW objects.  But IMHO (3) is
> most readable, though it breaks backward compatibility about the format
> of FDW options used in the result of \d* command.  Thoughts?

I'm against #2, but I could go either way on #1 vs. #3.  If you pick
#3, would you also change the column options to be displayed that way,
or would we end up with table and column options displayed
differently?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 11:57 AM, Alvaro Herrera
 wrote:
> Excerpts from Kohei KaiGai's message of lun ago 08 03:12:20 -0400 2011:
>
>> Thanks for your suggestion.
>> So, it seems to me the interface should return a pointer to the entry
>> of array being specified, rather than above approach.
>>
>> E.g, the above macro could be probably rewritten as follows:
>>   #define get_object_property_attnum_name(objtype) \
>>       (get_object_property(objtype)->attnum_name)
>
> I don't understand why don't you just do something like
>
>   #define get_object_property_attnum_name(objtype, attnum_name_value) \
>       (get_object_property((objtype), NULL, NULL, (attnum_name_value), NULL, 
> NULL)))
>
> and the caller does
>
> AttrNumber      attnum_name;
> get_object_property_attnum_name(OBJTYPE_TABLE, &attnum_name);
>
> i.e. the caller must still pass pointers, instead of expecting the
> values to be returned.

We could do that, but what the heck is the point?   What benefit are
we trying to get by not returning a pointer to the structure?  I feel
like we're making this ludicrously complicated with no real
justification of why all of this complexity is adding any value.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Alvaro Herrera

Excerpts from Kohei KaiGai's message of lun ago 08 03:12:20 -0400 2011:

> Thanks for your suggestion.
> So, it seems to me the interface should return a pointer to the entry
> of array being specified, rather than above approach.
> 
> E.g, the above macro could be probably rewritten as follows:
>   #define get_object_property_attnum_name(objtype) \
>   (get_object_property(objtype)->attnum_name)

I don't understand why don't you just do something like

   #define get_object_property_attnum_name(objtype, attnum_name_value) \
   (get_object_property((objtype), NULL, NULL, (attnum_name_value), NULL, 
NULL)))

and the caller does

AttrNumber  attnum_name;
get_object_property_attnum_name(OBJTYPE_TABLE, &attnum_name);

i.e. the caller must still pass pointers, instead of expecting the
values to be returned.

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 11:28 AM, Peter Geoghegan  wrote:
>> Maybe we should try to document the contract here
>> a little better; I think it's just that there must be a full memory
>> barrier (such as LWLockRelease) in both processes involved in the
>> interaction.
>
> Or, maybe we should think about pushing that into the latch
> implementation, while providing an interface that is easy to use
> correctly and hard to use incorrectly.

That would probably result in unnecessary double-synchronization in
many cases, though.  Memory barriers are expensive, and I think we'd
better adopt a policy of trying to figure out where they are truly
needed and using them only in those places.  Or to put that another
way, having spent the last couple of months trying to reduce our
synchronization overhead, I'm not real keen on adding more unless we
really need it.

>> I've been thinking about this too and actually went so far as to do
>> some research and put together something that I hope covers most of
>> the interesting cases.  The attached patch is pretty much entirely
>> untested, but reflects my present belief about how things ought to
>> work.
>
> That's interesting. Nice work.

Thanks!

> It seems likely that Windows will support ARM in some way in the next
> couple of years, so it's good that you're handling this using a win32
> abstraction where available. Of course, Itanic is currently supported
> on Windows, though not by us, and it is obviously never going to be
> worth pursuing such support. The point is that it generally isn't safe
> to assume that we'll only ever support Windows on x86 and x86_64.

Agreed.  I saw an MSDN article that referred to "architectures on
which Windows is supported".  It didn't say which those were, but I
figured I'd better not assume they were all platforms with strong
memory ordering.

> I'd like to see a #warning where you fall back on acquiring and
> releasing a spinlock, or at least a #warning if that in turn falls
> back on the semaphore-based spinlock implementation. Granted, that
> directive isn't in the C standard, but it's pretty portable in
> practice. If it breaks the build on some very esoteric platform, that
> may not be a bad thing - in fact, maybe you should use the portable
> #error directive to make sure it breaks. I'd rather have someone
> complain about that than lots more people complain about Postgres
> being inexplicably slow, or worse not complain and dismiss Postgres
> out of hand for their use-case.

I was thinking about whether and how we could expose that information.
 I was almost tempted to add a read-only GUC that would output the
details of what kind of synchronization we're using.  So the bad case
would look something like this:

spinlock=semaphore,memory_barrier=spinlock,read_barrier=spinlock,write_barrier=spinlock,compiler_barrier=spinlock

And the good case would look something like this:

spinlock=gcc-inline-asm,memory_barrier=gcc-inline-asm,read_barrier=gcc-inline-asm,write_barrier=gcc-inline-asm,compiler_barrier=gcc-inline-asm

And you can imagine other values, like compiler-intrinsic.  Of course,
jamming a whole CSV file into a GUC value is a bit unappealing.  Maybe
it would be better to have a system view with two columns,
synchronization_primitive and implementation.

Either way, I'd prefer this to a #warning, because that's only going
to be visible at compile-time, which means that when $PGCOMPANY gets
called in to figure out why the system is dog slow, it's going to take
a bit of forensics to figure out where the build came from and what
options are in use.  Is that a huge problem?  Maybe not, but it seems
like it would be a nice thing to make it easy for people to log into a
running system and immediately be able to determine which forms of
wizardry we're using there.

> By the way, I don't think that the comment "Won't work on Visual
> Studio 2003" is accurate. Everything you do for the
> defined(WIN32_ONLY_COMPILER) case is documented as working with 2003.

Oh, really?  I thought I ran across something that said otherwise, but
it might have been wrong, or maybe I got confused somewhere along the
line.  I haven't tested anything more than that this compiles without
erroring out on MacOS X, so the probability of errors and omissions is
not small.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alexander Korotkov

On Mon, Aug 8, 2011 at 7:43 PM, Alvaro Herrera
wrote:

> Maybe this needs to use the new FLEXIBLE_ARRAY_MEMBER stuff.  Can you try
> that please?

typedef struct relopt_string
{
relopt_gen gen;
int default_len;
 bool default_isnull;
validate_string_relopt validate_cb;
 char default_val[1]; /* variable length, zero-terminated */
} relopt_string;

static relopt_string stringRelOpts[] =
...

I doubt variable-length data structure is possible in this case, because we
don't have array of pointers to relopt_string, but just array
of relopt_string. May be just
char *default_val;
is possible?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alvaro Herrera

Excerpts from Alexander Korotkov's message of lun ago 08 06:27:33 -0400 2011:
> String-formatted relopts was never used before, but I've used it in
> buffering GiST index build patch and encountered with following compiler
> warnings:
> 
> reloptions.c:259: warning: initializer-string for array of chars is too long
> reloptions.c:259: warning: (near initialization for
> ‘stringRelOpts[0].default_val’**)
> 
> It is caused by definition of default field of relopt_string structure as
> 1-length character array. This seems to be a design flaw in the reloptions.c
> code. Any thoughts?

Maybe this needs to use the new FLEXIBLE_ARRAY_MEMBER stuff.  Can you try that 
please?

-- 
Álvaro Herrera 
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP fix proposal for bug #6123

2011-08-08 Thread Florian Pflug

On Aug8, 2011, at 17:02 , Kevin Grittner wrote:
> So, we have three options on the table here:
> 
> (1) We (the Wisconsin Courts) are using a very simple fix to work
> around the issue so we can move forward with conversion to
> PostgreSQL triggers.  A DELETE is allowed to complete if the BEFORE
> trigger doesn't return NULL, even if the row was updated while the
> trigger was executing.  An UPDATE fails if the BEFORE trigger
> doesn't return NULL and any other update is made to the row while
> the trigger was executing.
> 
> (2) You (Robert) have proposed (as I understand it) modifying that
> approach to allow some UPDATE cases to work, where there are not
> conflicting updates to any one column within the row.
> 
> (3) Florian has proposed banning all updates to a row which is being
> processed by a BEFORE UPDATE or BEFORE DELETE trigger.  As I
> understand it, this would be similar to the approach taken by
> Oracle, although less strict.
Yeah, though much, much less strict.

I think it would be helpful if we had a more precise idea about the
intended use-cases. So far, the only use-case that has been described in
detail is the one which made Kevin aware of the problem. But if I
understood Kevin correctly, that fact that they use BEFORE and not AFTER
triggers it more of an accident than the result of a conscious design
decision. Though he did also mention that there might actually be advantages
to using BEFORE instead of AFTER triggers, because that prevents other
triggers from seeing a non-consistent state.

What I can add is that AFAIR all instances of same-row UPDATES from BEFORE
triggers I ever encountered where replaceable by a simple assignment to NEW.
(That, however, is more of an anti-usecase)

best regards,
Florian Pflug

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas

On Mon, Aug 8, 2011 at 10:45 AM, Tom Lane  wrote:
> I'm a bit concerned by the fact that you've only tested this on one
> operating system, and thus the performance characteristics could be
> quite different elsewhere.  The comment in mdextend also points out
> a way in which this might not be a win --- did you test anything besides
> read-only scenarios?

Nope.

On Mon, Aug 8, 2011 at 10:49 AM, Andres Freund  wrote:
> I don't think its a good idea to replace lseek with fstat in the long run. The
> likelihood that the lockless generic_file_llseek will get included seems 
> rather
> high to me. In contrast to that fstat will always be more expensive than that
> as its going through a security check and then the fs' getattr implementation
> (which actually takes a lock on some fs).

*scratches head*  I understand that stat() would need a security
check, but why would fstat()?

I think both of you raise good points.  I wasn't too enthusiastic
about this approach either.  It's not very appealing to adopt an
approach where the right performance decision is going to depend on
operating system, file system, kernel version, core count, and
workload.  We could add a GUC, but it would be pretty annoying to have
a setting that won't matter for most people at all, except
occasionally when it makes a huge difference.

I wasn't aware that was any current activity around this on the Linux
side.  But Andres' comments made me Google it again, and now I see
this:

https://lkml.org/lkml/2011/6/16/800

Andes, any idea what the status of that patch is?  I'm not clear on
how Linux works in terms of things getting upstreamed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Peter Geoghegan

On 8 August 2011 13:47, Robert Haas  wrote:
> On the flip side, I'm not sure that anyone ever really expected that a
> latch could be safely used this way.  Normally, I'd expect the flag to
> be some piece of state protected by an LWLock, and I think that ought
> to be OK provided that the lwlock is released before setting or
> checking the latch.

I'm inclined to agree. FWIW I'll point out that in addition to your
point, it's also worth remembering that a shared latch isn't always
necessary, as in the case of the archiver, so this shouldn't
fundamentally shake our confidence in the latch implementation.

> Maybe we should try to document the contract here
> a little better; I think it's just that there must be a full memory
> barrier (such as LWLockRelease) in both processes involved in the
> interaction.

Or, maybe we should think about pushing that into the latch
implementation, while providing an interface that is easy to use
correctly and hard to use incorrectly. The latch code already has a
pretty complicated contract, and I had to bend over backwards using it
in the AVLauncher patch that I submitted to the upcoming commitfest.
Weakening or equivocating the contract any further should be
considered a last resort.

> I've been thinking about this too and actually went so far as to do
> some research and put together something that I hope covers most of
> the interesting cases.  The attached patch is pretty much entirely
> untested, but reflects my present belief about how things ought to
> work.

That's interesting. Nice work.

It seems likely that Windows will support ARM in some way in the next
couple of years, so it's good that you're handling this using a win32
abstraction where available. Of course, Itanic is currently supported
on Windows, though not by us, and it is obviously never going to be
worth pursuing such support. The point is that it generally isn't safe
to assume that we'll only ever support Windows on x86 and x86_64.

I'd like to see a #warning where you fall back on acquiring and
releasing a spinlock, or at least a #warning if that in turn falls
back on the semaphore-based spinlock implementation. Granted, that
directive isn't in the C standard, but it's pretty portable in
practice. If it breaks the build on some very esoteric platform, that
may not be a bad thing - in fact, maybe you should use the portable
#error directive to make sure it breaks. I'd rather have someone
complain about that than lots more people complain about Postgres
being inexplicably slow, or worse not complain and dismiss Postgres
out of hand for their use-case.

By the way, I don't think that the comment "Won't work on Visual
Studio 2003" is accurate. Everything you do for the
defined(WIN32_ONLY_COMPILER) case is documented as working with 2003.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mosbench revisited

2011-08-08 Thread Jesper Krogh


On 2011-08-08 15:29, Robert Haas wrote:

On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine  wrote:

Robert Haas  writes:

It would be nice if the Linux guys would fix this problem for us, but
I'm not sure whether they will.  For those who may be curious, the
problem is in generic_file_llseek() in fs/read-write.c.  On a platform
with 8-byte atomic reads, it seems like it ought to be very possible
to read inode->i_size without taking a spinlock.  A little Googling
around suggests that some patches along these lines have been proposed
and - for reasons that I don't fully understand - rejected.  That now
seems unfortunate.  Barring a kernel-level fix, we could try to
implement our own cache to work around this problem.  However, any
such cache would need to be darn cheap to check and update (since we
can't assume that relation extension is an infrequent event) and must
somehow having the same sort of mutex contention that's killing the
kernel in this workload.

What about making the relation extension much less frequent?  It's been
talked about before here, that instead of extending 8kB at a time we
could (should) extend by much larger chunks.  I would go as far as
preallocating the whole next segment (1GB) (in the background) as soon
as the current is more than half full, or such a policy.

Then you have the problem that you can't really use lseek() anymore to
guess'timate a relation size, but Tom said in this thread that the
planner certainly doesn't need something that accurate.  Maybe the
reltuples would do?  If not, it could be that some adapting of its
accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

Preextending in bigger chuncks has other benefits
as well, since it helps the filsystem (if it supports extends) to get
the data from the relation layed out in sequential order on disk.

On a well filled relation doing filefrag on an ext4 filesystem reveals
that data loaded during initial creation gives 10-11 extends per 1GB
file. Whereas a relation filled over time gives as much as 128 extends.

I would suggest 5% of current relation size or 25-100MB whatever being
the smallest of it. That would still keep the size down on small relations.

--
Jesper


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP fix proposal for bug #6123

2011-08-08 Thread Kevin Grittner

Robert Haas  wrote:
> Florian Pflug  wrote:
>> Going down that road opens the door to a *lot* of subtle semantic
>> differences between currently equivalent queries. For example,
>>
>>  UPDATE T SET f=f, a=1
>>
>> would behave differently then
>>
>>  UPDATE T SET a=1
>>
>> because in the first case the new row would depend on the old
>> row's value of "f", while in the second case it doesn't.
> 
> True.  But I'm not really bothered by that.  If the user gets an
> error message that says:
> 
> ERROR: updated column "f" has already been modified by a BEFORE
> trigger
> 
> ...the user will realize the error of their ways.
> 
>>> There's no way to get the same result as if you'd done either
>>> one of them first, because >>> they are inextricably
>>> intertwined.
>>>
>>> In practice, my hand-wavy reference to "reconciling the updates"
>>> is a problem because of the way the trigger interface works.  It
>>> feels pretty impossible to decide that we're going to do the
>>> update, but with some other random values we dredged up from
>>> some other update replacing some of the ones the user explicitly
>>> handed to us. But if the trigger could return an indication of
>>> which column values it wished to override, then it seems to me
>>> that we could piece together a reasonable set of semantics. 
>>> It's not exactly clear how to make that work, though.
>>
>> I dunno, that all still feels awfully complex. As you said
>> yourself, this case is quite similar to a serialization anomaly.
>> Taking that correspondence further, that reconciliation of
>> updates is pretty much what the EPQ machinery does in READ
>> COMMITTED mode. Now, we ourselves have warned users in the past
>> to *not* use READ COMMITTED mode if they do complex updates
>> (e.g., use UPDATE ... FROM ...), because the behaviour of that
>> reconciliation machinery in the present of concurrent updates is
>> extremely hard to predict. I thus don't believe that it's a good
>> idea to introduce similarly complex behaviour in other parts of
>> the system - and particularly not if you cannot disable it by
>> switching to another isolation level.
>>
>> Simply throwing an error, on the other hand, makes the behaviour
>> simple to understand and explain.
> 
> True.  I'm coming around to liking that behavior better than I did
> on first hearing, but I'm still not crazy about it, because as an
> app developer I would really like to have at least the
> unproblematic cases actually work.  Throwing an error at least
> makes it clear that you've done something which isn't supported,
> and that's probably an improvement over the current, somewhat-
> baffling behavior. However, it's not even 25% as nice as having it
> actually work as intended.  That's why, even if we can't make all
> the cases work sanely, I'd be a lot more enthusiastic about it if
> we could find a way to make at least some of them work sanely. 
> The mind-bending cases are unlikely to be the ones people do on
> purpose.
 
So, we have three options on the table here:
 
(1) We (the Wisconsin Courts) are using a very simple fix to work
around the issue so we can move forward with conversion to
PostgreSQL triggers.  A DELETE is allowed to complete if the BEFORE
trigger doesn't return NULL, even if the row was updated while the
trigger was executing.  An UPDATE fails if the BEFORE trigger
doesn't return NULL and any other update is made to the row while
the trigger was executing.
 
(2) You (Robert) have proposed (as I understand it) modifying that
approach to allow some UPDATE cases to work, where there are not
conflicting updates to any one column within the row.
 
(3) Florian has proposed banning all updates to a row which is being
processed by a BEFORE UPDATE or BEFORE DELETE trigger.  As I
understand it, this would be similar to the approach taken by
Oracle, although less strict.
 
I can live with any of these solutions.  How do we move forward to
consensus so that a patch can be written?  Does anyone else have a
preference for one of these approaches versus another?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Andres Freund

On Monday, August 08, 2011 10:30:38 Robert Haas wrote:
> In response to my blog post on lseek contention, someone posted a
> comment wherein they proposed using fstat() rather than lseek() to get
> file sizes.
> 
> Thoughts?
I don't think its a good idea to replace lseek with fstat in the long run. The 
likelihood that the lockless generic_file_llseek will get included seems rather 
high to me. In contrast to that fstat will always be more expensive than that 
as its going through a security check and then the fs' getattr implementation 
(which actually takes a lock on some fs).
On the other hand its currently lockless if the security subsystem is compiled 
out (i.e. no selinux et al) for some common fs (ext3/4, xfs).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fstat vs. lseek

2011-08-08 Thread Tom Lane

Robert Haas  writes:
> In response to my blog post on lseek contention, someone posted a
> comment wherein they proposed using fstat() rather than lseek() to get
> file sizes.
> Patch and test results are attached.  Test runs are 5-minute runs with
> scale factor 100 and shared_buffers=8GB.

> Thoughts?

I'm a bit concerned by the fact that you've only tested this on one
operating system, and thus the performance characteristics could be
quite different elsewhere.  The comment in mdextend also points out
a way in which this might not be a win --- did you test anything besides
read-only scenarios?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] fstat vs. lseek

2011-08-08 Thread Robert Haas

In response to my blog post on lseek contention, someone posted a
comment wherein they proposed using fstat() rather than lseek() to get
file sizes.

http://rhaas.blogspot.com/2011/08/linux-and-glibc-scalability.html

I tried that on a RHEL 6.1 machine with 64-cores running
2.6.32-131.6.1.el6.x86_64, and it's pretty clear that the locking
characteristics are completely different.  At 1 client, the lseek
method appears to be slightly faster, although it's not beyond belief
that the difference could be in the noise.  Above 40 cores, however,
the fstat method thumps the lseek method up one side and down the
other.

Patch and test results are attached.  Test runs are 5-minute runs with
scale factor 100 and shared_buffers=8GB.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


filesize.patch
Description: Binary data


pgbench -S at 32 cores - fstat patch comparison.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas

On Sat, Aug 6, 2011 at 3:30 PM, Tom Lane  wrote:
> Jeff Janes  writes:
>> My experiments have shown that the freelist proper is not
>> substantially faster than the freelist clocksweep--and that is even
>> under the assumption that putting things back into the freelist is
>> absolutely free.
>
> The freelist isn't there to make buffer allocation faster, though;
> it's there for allocation efficiency.  The point is that when some
> buffers have become completely useless (eg, because we dropped the table
> they were for), they'll be recycled in preference to reclaiming buffers
> that contain still-possibly-useful data.  It would certainly be simple
> to get rid of the freelist and only recycle dead buffers when the clock
> sweep reaches them, but I think we'd be paying for that in extra,
> unnecessary I/O.

This is an interesting point, because I think it really gets at the
heart of the trade-off we're facing here: quality of buffer allocation
vs. concurrency.

Assuming no free buffers are present, we'd ideally like to evict the
buffer that won't be used until the latest possible future time.  LRU
is an approximation of this.  Our clock-sweep algorithm is a
less-accurate approximation of this that pays for itself with less
locking.  If we really wanted true LRU, we'd have to move things
around in the list every time a buffer was pinned.  That would be
really expensive.  Our clock sweep algorithm only requires fighting
over a piece of shared state when we want to kick a buffer out, rather
than every time we want to do anything to a buffer.  And it's probably
only slightly worse in terms of buffer allocation efficiency than true
LRU, so on balance it appears to be a good trade-off.

Even so, I think it's pretty trivial to construct a workload where
performance is throttled by BufFreelistLock.  I seriously doubt we're
going to be able to solve that problem by optimizing the code that
runs while holding that lock.  We might by ourselves a few more
percentage points here or there, but if we really want to bust this
bottleneck we're going to need to move to some algorithm that is an
even-less-perfect approximation of LRU but which doesn't involve a
single cluster-wide lock that throttles all buffer allocation.  No
matter how fast you make the critical section protected by that lock,
given enough CPUs, it will eventually not be fast enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas

On Sat, Aug 6, 2011 at 2:16 PM, Dimitri Fontaine  wrote:
> Robert Haas  writes:
>> It would be nice if the Linux guys would fix this problem for us, but
>> I'm not sure whether they will.  For those who may be curious, the
>> problem is in generic_file_llseek() in fs/read-write.c.  On a platform
>> with 8-byte atomic reads, it seems like it ought to be very possible
>> to read inode->i_size without taking a spinlock.  A little Googling
>> around suggests that some patches along these lines have been proposed
>> and - for reasons that I don't fully understand - rejected.  That now
>> seems unfortunate.  Barring a kernel-level fix, we could try to
>> implement our own cache to work around this problem.  However, any
>> such cache would need to be darn cheap to check and update (since we
>> can't assume that relation extension is an infrequent event) and must
>> somehow having the same sort of mutex contention that's killing the
>> kernel in this workload.
>
> What about making the relation extension much less frequent?  It's been
> talked about before here, that instead of extending 8kB at a time we
> could (should) extend by much larger chunks.  I would go as far as
> preallocating the whole next segment (1GB) (in the background) as soon
> as the current is more than half full, or such a policy.
>
> Then you have the problem that you can't really use lseek() anymore to
> guess'timate a relation size, but Tom said in this thread that the
> planner certainly doesn't need something that accurate.  Maybe the
> reltuples would do?  If not, it could be that some adapting of its
> accuracy could be done?

I think that pre-extending relations or extending them in larger
increments is probably a good idea, although I think the AMOUNT of
preallocation you just proposed would be severe overkill.  If we
extended the relation in 1MB chunks, we'd reduce the number of
relation extensions by more than 99%, and with far less space wastage
than the approach you are proposing.

However, it doesn't really do anything to solve this problem.  The
problem here is not that the size of the relation is changing too
frequently - indeed, it's not changing at all in this test case.  The
problem is rather that testing whether or not the size has in fact
changed is costing too much.

The reason why we are testing the size of the relation here rather
than just using reltuples is because the relation might have been
extended since it was last analyzed.  We can't count on analyze to run
often enough to avoid looking at the actual file size.  If the file's
grown, we have to scale the number of tuples up proportional to the
growth in relpages.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] mosbench revisited

2011-08-08 Thread Robert Haas

On Sat, Aug 6, 2011 at 1:43 PM, Jeff Janes  wrote:
> The approach is to move the important things from a LWLock to a
> spinlock, and to not do any locking for increments to clock-hand
> increment and numBufferAllocs.
> That means that some buffers might occasionally get inspected twice
> and some might not get inspected at all during any given clock cycle,
> but this should not lead to any correctness problems.   (Disclosure:
> Tom didn't like this approach when it was last discussed.)
>
> I just offer this for whatever it is worth to you--I'm not proposing
> it as an actual patch to be applied.

Interesting approach.

> When data fits in RAM but not shared_buffers, maybe the easiest fix is
> to increase shared_buffers.  Which brings up the other question I had
> for you about your work with Nate's celebrated loaner machine.  Have
> you tried to reproduce the performance problems that have been
> reported (but without public disclosure of how to reproduce) with
> shared_buffers > 8GB on machines with RAM >>8GB ?

No.  That's on my list, but thus far has not made it to the top of
said list.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] vacuumlo patch

2011-08-08 Thread Robert Haas

On Sun, Aug 7, 2011 at 7:53 PM, Tim  wrote:
> Thanks Josh,
> I like the ability to bail out on PQTRANS_INERROR, and I think it's a small
> enough fix to be appropriate to include in this patch.
> I did consider it before but did not implement it because I am still new to
> pgsql-hackers and did not know how off-the-cuff.
> So thanks for the big improvement.

I've committed this patch with some changes, mostly cosmetic.  One
not-quite-so-cosmetic change is that I removed the suggestion that -l
should be used with a limit one lower than whatever provoked the
previous failure.  That might be true on a completely idle system, but
is an oversimplification in real life.  Maybe some kind of hint is
appropriate here, but I think if we're going to have one it ought to
be more generically worded.  I also updated the failure error message
to avoid saying that we "failed to remove NNN objects".  Instead, it
now says how many objects it wanted to remove and how far it had
gotten when it failed, which I think is more useful.

Please feel free to propose further patches if you don't like what
I've done here, or want to further build on it or fine-tune...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

2011-08-08 Thread Robert Haas

On Sun, Aug 7, 2011 at 1:47 PM, Tom Lane  wrote:
> Any protocol of that sort has *obviously* got a race condition between
> changes of the latch state and changes of the separate flag state, if run
> in a weak-memory-ordering machine.
>
> At least on the hardware I'm testing, it seems that the key requirement
> for a failure to be seen is that there's a lot of contention for the cache
> line containing the flag, but not for the cache line containing the latch.
> This allows the write to the flag to take longer to propagate to main
> memory than the write to the latch.

This is interesting research, especially because it provides sobering
evidence of how easily a real problem could be overlooked in testing.
If it took several minutes for this to fail on an 8-CPU system, it's
easy to imagine a less egregious example sitting in our code-base for
years undetected.  (Indeed, I think we probably have a few of those
already...)

On the flip side, I'm not sure that anyone ever really expected that a
latch could be safely used this way.  Normally, I'd expect the flag to
be some piece of state protected by an LWLock, and I think that ought
to be OK provided that the lwlock is released before setting or
checking the latch.  Maybe we should try to document the contract here
a little better; I think it's just that there must be a full memory
barrier (such as LWLockRelease) in both processes involved in the
interaction.

> I'm not sure whether we need to do anything about this for 9.1; it may be
> that none of the existing uses of latches are subject to the kind of
> contention that would create a risk.  But I'm entirely convinced that we
> need to insert some memory-barrier instructions into the latch code before
> we expand our uses of latches much more.  Since we were already finding
> that memory barriers will be needed for performance reasons elsewhere,
> I think it's time to bite the bullet and develop that infrastructure.

I've been thinking about this too and actually went so far as to do
some research and put together something that I hope covers most of
the interesting cases.  The attached patch is pretty much entirely
untested, but reflects my present belief about how things ought to
work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

barrier-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Transient plans versus the SPI API

2011-08-08 Thread Anssi Kääriäinen


On 08/08/2011 01:07 PM, Hannu Krosing wrote:

That is why I think it is best done in the main parser - it has to parse
and analyse the query anyway and likely knows which constants are
"arguments" to the query
As far as I understand the problem, the parsing must transform table 
references to schema-qualified references. The table_foobar in "SELECT * 
FROM table_foobar WHERE id = ?" is not enough to identify a table. Using 
search_path, query_str as a key is one possibility, but the search_path 
is likely to be different for each user, and this could result in the 
same query being cached multiple times.


By the way, I checked current Git HEAD and pg_stat_statements seems to 
not handle search_path correctly. For pg_stat_statements this is not 
critical, but if the raw query string is used as plan cache key things 
will obviously break...


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Compiler warnings with stringRelOpts (was WIP: Fast GiST index build)

2011-08-08 Thread Alexander Korotkov

String-formatted relopts was never used before, but I've used it in
buffering GiST index build patch and encountered with following compiler
warnings:

reloptions.c:259: warning: initializer-string for array of chars is too long
reloptions.c:259: warning: (near initialization for
‘stringRelOpts[0].default_val’**)

It is caused by definition of default field of relopt_string structure as
1-length character array. This seems to be a design flaw in the reloptions.c
code. Any thoughts?

--
With best regards,
Alexander Korotkov.

Re: [HACKERS] WIP: Fast GiST index build

2011-08-08 Thread Alexander Korotkov

On Mon, Aug 8, 2011 at 1:23 PM, Heikki Linnakangas <
heikki.linnakan...@enterprisedb.com> wrote:
>
> 2) Neighbor relocation was added.
>>
>
> Ok. I think we're going to need some sort of a heuristic on when to enable
> neighbor relocation. If I remember the performance tests correctly, it
> improves the quality of the resulting index, but incurs a significant CPU
> overhead.
>
> Actually, looking at the performance numbers on the wiki page again (
> http://wiki.postgresql.org/**wiki/Fast_GiST_index_build_**GSoC_2011),
> it looks like neighbor relocation doesn't help very much with the index
> quality - sometimes it even results in a slightly worse index. Based on
> those results, shouldn't we just remove it? Or is there some other data set
> where it helps significantly?

Oh, actually I didn't add some results with neighborrelocation = off. I
would like to rerun some tests with current version of patch.


>  3) Subtree prefetching was added.
>>
>
> I'm inclined to leave out the prefetching code for now. Unless you have
> some performance numbers that show that it's critical for the overall
> performance. But I don't think that was the case, it's just an additional
> optimization for servers with big RAID arrays. So, please separate that into
> an add-on patch. It needs to be performance tests and reviewed separately.

I though that prefetch helps even on separate hard disks by ordering of IOs.


>  4) Final emptying algorithm was reverted to the original one. My
>> experiments shows that typical number of passes in final emptying in
>> your version of patch is about 5. It may be significant itself. Also I
>> haven't estimate of number of passes and haven't guarantees that it will
>> not be high in some corner cases. I.e. I prefer more predictable
>> single-pass algorithm in spite of it's a little more complex.
>>
>
> I was trying to get rid of that complexity during index build. Some extra
> code in the final pass would be easier to understand than extra work that
> needs to be done through the index build. It's not a huge amount of code,
> but still.
>
> I'm not worried about the extra CPU overhead of scanning the data
> structures at the final pass. I guess in my patch you had to do extra I/O as
> well, because the buffers were not emptied in strict top-down order, so
> let's avoid that. How about:
>
> Track all buffers in the lists, not only those that are non-empty. Add the
> buffer to the right list at getNodeBuffer(). That way in the final stage,
> you need to scan through all buffers instead of just the non-empty ones. But
> the overhead of that to should be minimal in practice, scanning some
> in-memory data structures is pretty cheap compared to building an index.
> That way you wouldn't need to maintain the lists during the index build,
> except for adding each buffer to correct lists in getNodeBuffer().
>
> BTW, please use List for the linked lists. No need to re-implement the
> wheel.

Ok.


>  5) Unloading even last page of node buffer from main memory to the disk.
>> Imagine that that with levelstep = 1 each inner node has buffer. It
>> seems to me that keeping one page of each buffer in memory may be memory
>> consuming.
>>
>> Open items I see at this moment:
>> 1) I dislike my switching to buffering build method because it's based
>> on very unproven assumptions. And I didn't more reliable assumptions in
>> scientific papers while. I would like to replace it with something much
>> simplier. For example, switching to buffering build when regular build
>> actually starts to produce a lot of IO. For this approach implementation
>> I need to somehow detect actual IO (not just buffer read but miss of OS
>> cache).
>>
>
> Yeah, that's a surprisingly hard problem. I don't much like the method used
> in the patch either.

Is it possible to make buffering build a user defined option until we have a
better idea?


>  2) I'm worrying about possible size of nodeBuffersTab and path items. If
>> we imagine extremely large tree with levelstep = 1 size of this
>> datastructures will be singnificant. And it's hard to predict that size
>> without knowing of tree size.
>>
>
> I'm not very worried about that in practice. If you have a very large
> index, you presumably have a fair amount of memory too. Otherwise the
> machine is horrendously underpowered to build or do anything useful with the
> index anyway. Nevertheless it would nice to have some figures on that. If
> you have, say, an index of 1 TB in size, how much memory will building the
> index need?
>
I think with points it would be about 1 million of buffers and about 100-300
megabytes of RAM depending on space utilization. It may be ok, because 1 TB
is really huge index. But if maintenance_work_mem is low we can run out of
it. Though maintenance_work_mem is quite strange for system with 1 TB
indexes.


> Miscellaneous observations:
>
> * Please run pgindent over the code, there's a lot

Re: [HACKERS] Transient plans versus the SPI API

2011-08-08 Thread Hannu Krosing

On Mon, 2011-08-08 at 11:39 +0300, Anssi Kääriäinen wrote:
> On 08/07/2011 12:25 PM, Hannu Krosing wrote:
> > On Sun, 2011-08-07 at 11:15 +0200, Hannu Krosing wrote:
> >> On Wed, 2011-08-03 at 15:19 -0400, Tom Lane wrote:
> >>> Hm, you mean reverse-engineering the parameterization of the query?
> >> Yes, basically re-generate the query after (or while) parsing, replacing
> >> constants and arguments with another set of generated arguments and
> >> printing the list of these arguments at the end. It may be easiest to do
> >> This in parallel with parsing.
> >>
> >>> Interesting thought, but I really don't see a way to make it practical.
> >> Another place where this could be really useful is logging&  monitoring
> >>
> >> If there were an option to log the above queries as
> >>
> >> "SELECT a, b FROM foo WHERE c = $1", (123)
> >> "SELECT a, b FROM foo WHERE c = $1", (97)
> >> "SELECT a, b FROM foo WHERE c = $1", (236)
> > The main monitoring use_case would be pg_stat_statements,
> > http://developer.postgresql.org/pgdocs/postgres/pgstatstatements.html
> > which is currently pretty useless for queries without parameters
> 
> I was trying to implement something similar for pgpool-II. The user 
> could configure queries for which cached plans are wanted. The 
> configuration would have been a file containing lines in format "SELECT 
> * FROM foo WHERE id = ?". I did not get anything implemented, as there 
> were some problems. The problems were mainly with DEALLOCATE ALL called 
> without pgpool-II knowing it, issues with search_path and the amount of 
> work needed to implement parse tree matching.
> 
> It would be interesting if pg_stat_statements would be globally 
> available with queries using generic arguments. First, there would be an 
> obvious heuristic for when to cache the plan: If the average runtime of 
> the query is much larger than the average planning time, there is no 
> point in caching the plan. This would also give one option for cache hit 
> estimation. The hit_percent is directly available. On the other hand 
> pg_stat_statements could easily become a choke-point.
> 
> I would love to work on this, but I lack the needed skills. Maybe I 
> could take another try for writing a proof-of-concept parse tree 
> transformer and matcher, but I doubt I can produce anything useful.

That is why I think it is best done in the main parser - it has to parse
and analyse the query anyway and likely knows which constants are
"arguments" to the query.

If doing it outside the main backend parser, it would be best to still
use the postgreSQL lex/bison files as much as possible for this.


-- 
---
Hannu Krosing
PostgreSQL Unlimited Scalability and Performance Consultant
2ndQuadrant Nordic
PG Admin Book: http://www.2ndQuadrant.com/books/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Fast GiST index build

2011-08-08 Thread Heikki Linnakangas


On 07.08.2011 22:28, Alexander Korotkov wrote:

There is last version of patch. There is the list of most significant
changes in comparison with your version of patch:
1) Reference counting of path items was added. It should helps against
possible accumulation of path items.


Ok.


2) Neighbor relocation was added.


Ok. I think we're going to need some sort of a heuristic on when to 
enable neighbor relocation. If I remember the performance tests 
correctly, it improves the quality of the resulting index, but incurs a 
significant CPU overhead.


Actually, looking at the performance numbers on the wiki page again 
(http://wiki.postgresql.org/wiki/Fast_GiST_index_build_GSoC_2011), it 
looks like neighbor relocation doesn't help very much with the index 
quality - sometimes it even results in a slightly worse index. Based on 
those results, shouldn't we just remove it? Or is there some other data 
set where it helps significantly?



3) Subtree prefetching was added.


I'm inclined to leave out the prefetching code for now. Unless you have 
some performance numbers that show that it's critical for the overall 
performance. But I don't think that was the case, it's just an 
additional optimization for servers with big RAID arrays. So, please 
separate that into an add-on patch. It needs to be performance tests and 
reviewed separately.



4) Final emptying algorithm was reverted to the original one. My
experiments shows that typical number of passes in final emptying in
your version of patch is about 5. It may be significant itself. Also I
haven't estimate of number of passes and haven't guarantees that it will
not be high in some corner cases. I.e. I prefer more predictable
single-pass algorithm in spite of it's a little more complex.


I was trying to get rid of that complexity during index build. Some 
extra code in the final pass would be easier to understand than extra 
work that needs to be done through the index build. It's not a huge 
amount of code, but still.


I'm not worried about the extra CPU overhead of scanning the data 
structures at the final pass. I guess in my patch you had to do extra 
I/O as well, because the buffers were not emptied in strict top-down 
order, so let's avoid that. How about:


Track all buffers in the lists, not only those that are non-empty. Add 
the buffer to the right list at getNodeBuffer(). That way in the final 
stage, you need to scan through all buffers instead of just the 
non-empty ones. But the overhead of that to should be minimal in 
practice, scanning some in-memory data structures is pretty cheap 
compared to building an index. That way you wouldn't need to maintain 
the lists during the index build, except for adding each buffer to 
correct lists in getNodeBuffer().


BTW, please use List for the linked lists. No need to re-implement the 
wheel.



5) Unloading even last page of node buffer from main memory to the disk.
Imagine that that with levelstep = 1 each inner node has buffer. It
seems to me that keeping one page of each buffer in memory may be memory
consuming.

Open items I see at this moment:
1) I dislike my switching to buffering build method because it's based
on very unproven assumptions. And I didn't more reliable assumptions in
scientific papers while. I would like to replace it with something much
simplier. For example, switching to buffering build when regular build
actually starts to produce a lot of IO. For this approach implementation
I need to somehow detect actual IO (not just buffer read but miss of OS
cache).


Yeah, that's a surprisingly hard problem. I don't much like the method 
used in the patch either.



2) I'm worrying about possible size of nodeBuffersTab and path items. If
we imagine extremely large tree with levelstep = 1 size of this
datastructures will be singnificant. And it's hard to predict that size
without knowing of tree size.


I'm not very worried about that in practice. If you have a very large 
index, you presumably have a fair amount of memory too. Otherwise the 
machine is horrendously underpowered to build or do anything useful with 
the index anyway. Nevertheless it would nice to have some figures on 
that. If you have, say, an index of 1 TB in size, how much memory will 
building the index need?


Miscellaneous observations:

* Please run pgindent over the code, there's a lot of spurious 
whitespace in the patch.
* How about renaming GISTLoadedPartItem to something like 
GISTBulkInsertStack, to resemble the GISTInsertStack struct used in the 
normal insertion code. The "loaded part" nomenclature is obsolete, as 
the patch doesn't explicitly load parts of the tree into memory anymore. 
Think about the names of other structs, variables and functions too, 
GISTLoadedPartItem just caught my eye first but there's probably others 
that could have better names.

* Any user-visible options need to be documented in the user manual.
* And of course, make sure comments and the readme are up-to-date.
* Compiler wa

Re: [HACKERS] Transient plans versus the SPI API

2011-08-08 Thread Anssi Kääriäinen


On 08/07/2011 12:25 PM, Hannu Krosing wrote:

On Sun, 2011-08-07 at 11:15 +0200, Hannu Krosing wrote:

On Wed, 2011-08-03 at 15:19 -0400, Tom Lane wrote:

Hm, you mean reverse-engineering the parameterization of the query?

Yes, basically re-generate the query after (or while) parsing, replacing
constants and arguments with another set of generated arguments and
printing the list of these arguments at the end. It may be easiest to do
This in parallel with parsing.


Interesting thought, but I really don't see a way to make it practical.

Another place where this could be really useful is logging&  monitoring

If there were an option to log the above queries as

"SELECT a, b FROM foo WHERE c = $1", (123)
"SELECT a, b FROM foo WHERE c = $1", (97)
"SELECT a, b FROM foo WHERE c = $1", (236)

The main monitoring use_case would be pg_stat_statements,
http://developer.postgresql.org/pgdocs/postgres/pgstatstatements.html
which is currently pretty useless for queries without parameters


I was trying to implement something similar for pgpool-II. The user 
could configure queries for which cached plans are wanted. The 
configuration would have been a file containing lines in format "SELECT 
* FROM foo WHERE id = ?". I did not get anything implemented, as there 
were some problems. The problems were mainly with DEALLOCATE ALL called 
without pgpool-II knowing it, issues with search_path and the amount of 
work needed to implement parse tree matching.


It would be interesting if pg_stat_statements would be globally 
available with queries using generic arguments. First, there would be an 
obvious heuristic for when to cache the plan: If the average runtime of 
the query is much larger than the average planning time, there is no 
point in caching the plan. This would also give one option for cache hit 
estimation. The hit_percent is directly available. On the other hand 
pg_stat_statements could easily become a choke-point.


I would love to work on this, but I lack the needed skills. Maybe I 
could take another try for writing a proof-of-concept parse tree 
transformer and matcher, but I doubt I can produce anything useful.


 - Anssi

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperl crash with Debian 6 (64 bit), pl/perlu, libwww and https

2011-08-08 Thread Tim Bunce

On Mon, Aug 08, 2011 at 01:23:08AM -0600, Alex Hunsaker wrote:
> On Sun, Aug 7, 2011 at 17:06, Tim Bunce  wrote:
> 
> > Localizing an individual element of %SIG works fine.
> > In C that's something like this (untested):
> >
> >    hv = gv_fetchpv("SIG", 0, SVt_PVHV);
> >    keysv = ...SV containing "ALRM"...
> >    he = hv_fetch_ent(hv, keysv, 0, 0);
> >    if (he) {  /* arrange to restore existing elem */
> >        save_helem_flags(hv, keysv, &HeVAL(he), SAVEf_SETMAGIC);
> >    }
> >    else {     /* arrange to delete a new elem */
> >        SAVEHDELETE(hv, keysv);
> >    }
> 
> I played with this a bit... and found yes, it locals them but no it
> does not fix the reported problem. After playing with things a bit
> more I found even "local $SIG{'ALRM'} = .,..; alarm(1);" still results
> in postgres crashing. To wit, local does squat. AFAICT it just resets
> the signal handler back to the default with SIG_DFL. (Which in
> hindsight I don't know what else I expected it to-do...)

Ah, yes. Hindsight is great. I should have spotted that. Sorry.

> So I think for this to be robust we would have to detect what signals
> they set and then reset those back to what postgres wants. Doable, but
> is it worth it? Anyone else have any bright ideas?

I'm only considering ALRM. At least that's the only one that seems worth
offering some limited support for. The others fall under "don't do that".

After giving it some more thought it seems reasonable to simply force the
SIGALRM handler back to postgres when a plperlu function returns:

pqsignal(SIGALRM, handle_sig_alarm);

Tim.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] plperl crash with Debian 6 (64 bit), pl/perlu, libwww and https

2011-08-08 Thread Alex Hunsaker

On Sun, Aug 7, 2011 at 17:06, Tim Bunce  wrote:

> On Sat, Aug 06, 2011 at 12:37:28PM -0600, Alex Hunsaker wrote:
>> ...
>> Find attached a version that does the equivalent of local %SIG for
>> each pl/perl(u) call.
>
>> +     gv = gv_fetchpv("SIG", 0, SVt_PVHV);
>> +     save_hash(gv);                  /* local %SIG */
>
> ... [ local %SIG dosn't work ] The %SIG does become empty but the OS
> level handlers, even those installed by perl, *aren't changed*:

Looks like I trusted in $SIG{'ALRM'} being undef after it had been set
in a different scope too much :-( Thanks for pointing this out.

> That sure seems like a bug (I'll check with the perl5-porters list).

Well even if it was deemed a bug, it dont do us any good.

> Localizing an individual element of %SIG works fine.
> In C that's something like this (untested):
>
>    hv = gv_fetchpv("SIG", 0, SVt_PVHV);
>    keysv = ...SV containing "ALRM"...
>    he = hv_fetch_ent(hv, keysv, 0, 0);
>    if (he) {  /* arrange to restore existing elem */
>        save_helem_flags(hv, keysv, &HeVAL(he), SAVEf_SETMAGIC);
>    }
>    else {     /* arrange to delete a new elem */
>        SAVEHDELETE(hv, keysv);
>    }

I played with this a bit... and found yes, it locals them but no it
does not fix the reported problem. After playing with things a bit
more I found even "local $SIG{'ALRM'} = .,..; alarm(1);" still results
in postgres crashing. To wit, local does squat. AFAICT it just resets
the signal handler back to the default with SIG_DFL. (Which in
hindsight I don't know what else I expected it to-do...)

So I think for this to be robust we would have to detect what signals
they set and then reset those back to what postgres wants. Doable, but
is it worth it? Anyone else have any bright ideas?

Find below my test case and attached a patch that locals individual
%SIG elements the way mentioned above.

=> set statement_timeout to '5s';
SET

=> create or replace function test_alarm() returns void as $$ local
$SIG{'ALRM'} = sub { warn "alarm"; }; alarm(1); sleep 2; $$ language
plperlu;
CREATE FUNCTION

=> select test_alarm();
WARNING:  alarm at line 1.
CONTEXT:  PL/Perl function "test_alarm"
 test_alarm


(1 row)

=> select pg_sleep(6);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Server Log:
WARNING:  alarm at line 1.
CONTEXT:  PL/Perl function "test_alarm"
LOG:  server process (PID 32659) was terminated by signal 14: Alarm clock
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
FATAL:  the database system is in recovery mode
*** a/src/pl/plperl/expected/plperl.out
--- b/src/pl/plperl/expected/plperl.out
***
*** 639,641  CONTEXT:  PL/Perl anonymous code block
--- 639,643 
  DO $do$ use warnings FATAL => qw(void) ; my @y; my $x = sort @y; 1; $do$ LANGUAGE plperl;
  ERROR:  Useless use of sort in scalar context at line 1.
  CONTEXT:  PL/Perl anonymous code block
+ DO $do$ die "SIG ALRM is set: $SIG{'ALRM'}" if($SIG{'ALRM'}); $SIG{'ALRM'} = sub { print "alarm!\n"}; $do$ LANGUAGE plperl;
+ DO $do$ die "SIG ALRM is set: $SIG{'ALRM'}" if($SIG{'ALRM'}); $do$ LANGUAGE plperl;
*** a/src/pl/plperl/plperl.c
--- b/src/pl/plperl/plperl.c
***
*** 268,273  static void plperl_inline_callback(void *arg);
--- 268,275 
  static char *strip_trailing_ws(const char *msg);
  static OP  *pp_require_safe(pTHX);
  static void activate_interpreter(plperl_interp_desc *interp_desc);
+ static void local_sigs(void);
+ static void local_sig(HV *hv, SV *tmpsv, const char *signame);
  
  #ifdef WIN32
  static char *setlocale_perl(int category, char *locale);
***
*** 1901,1906  plperl_call_perl_func(plperl_proc_desc *desc, FunctionCallInfo fcinfo)
--- 1903,1910 
  	ENTER;
  	SAVETMPS;
  
+ 	local_sigs();
+ 
  	PUSHMARK(SP);
  	EXTEND(sp, desc->nargs);
  
***
*** 1968,1973  plperl_call_perl_func(plperl_proc_desc *desc, FunctionCallInfo fcinfo)
--- 1972,2028 
  	return retval;
  }
  
+ /*
+  * local all of our sig handlers some modules like LWP like to set an alarm sig
+  * handler for things like network timeouts, this can cause bad stuff to happen
+  * (not to mention what happens if someone sets USR1)
+  *
+  * for now we just local() them all so they should get reset back to what
+  * postgres expects when their pl function is done
+  */
+ static void
+ local_sigs(void)
+ {
+ 	HV	*hv;
+ 	SV	*sv = newSV(9);
+ 	int i;
+ 
+ 	hv = get_hv("SIG", 0);
+ 	if (!hv)
+ 		elog(ERROR, "couldn't fetch

Re: [HACKERS] per-column FDW options, v5

2011-08-08 Thread Shigeru Hanada

(2011/07/29 17:37), Shigeru Hanada wrote:
> I also attached a rebased version of force_not_null patch, which adds
> force_not_null option support to file_fdw.  This is a use case of
> per-column FDW option.

[just for redirection]

Robert has committed only per_column_option patch. So I posted
force_not_null patch in a new thread, and added it to CF 2011-09 as an
independent new item.

https://commitfest.postgresql.org/action/patch_view?id=615

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] Common object property boards

2011-08-08 Thread Kohei KaiGai

2011/8/7 Tom Lane :
> Kohei KaiGai  writes:
>> I'm under implementation of this code according to the suggestion.
>> However, I'm not sure whether it is really portable way (at least, GCC 
>> accepts),
>> and whether the interface is simpler than as Robert suggested at first.
>
>> #define get_object_property_attnum_name(objtype)                        \
>>     ({  AttrNumber temp;                                            \
>>         get_object_property((objtype), NULL, NULL, NULL, NULL,          \
>>                             &temp, NULL, NULL);                     \
>>         temp; })
>
> Blocks within expressions are a gcc-ism and will fail on any other
> compiler, so you can't do it that way.
>
Thanks for your suggestion.
So, it seems to me the interface should return a pointer to the entry
of array being specified, rather than above approach.

E.g, the above macro could be probably rewritten as follows:
  #define get_object_property_attnum_name(objtype) \
  (get_object_property(objtype)->attnum_name)

-- 
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] force_not_null option support for file_fdw

2011-08-08 Thread Shigeru Hanada

Hi,

I propose to support force_not_null option for file_fdw too.

In 9.1 development cycle, file_fdw had been implemented with exported
COPY FROM routines, but only force_not_null option has not been　
supported yet.

Originally, in COPY FROM, force_not_null is specified as a list of
column which is not matched against null-string.  Now per-column FDW
option is available, so I implemented force_not_null options as boolean
value for each column to avoid parsing column list in file_fdw.

True means that the column is not matched against null-string, and it's
equivalent to specify the column's name in force_not_null option of COPY
FROM.  Default value is false.

The patch includes changes for code, document and regression tests.  Any
comments/questions are welcome.

Regards,
-- 
Shigeru Hanada
diff --git a/contrib/file_fdw/data/text.csv b/contrib/file_fdw/data/text.csv
index ...bd4c327 .
*** a/contrib/file_fdw/data/text.csv
--- b/contrib/file_fdw/data/text.csv
***
*** 0 
--- 1,4 
+ 123,123
+ abc,abc
+ NULL,NULL
+ ABC,ABC
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 224e74f..e846176 100644
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 23,30 
--- 23,32 
  #include "foreign/fdwapi.h"
  #include "foreign/foreign.h"
  #include "miscadmin.h"
+ #include "nodes/makefuncs.h"
  #include "optimizer/cost.h"
  #include "utils/rel.h"
+ #include "utils/syscache.h"
  
  PG_MODULE_MAGIC;
  
*** static struct FileFdwOption valid_option
*** 57,72 
{"escape", ForeignTableRelationId},
{"null", ForeignTableRelationId},
{"encoding", ForeignTableRelationId},
  
/*
 * force_quote is not supported by file_fdw because it's for COPY TO.
 */
  
-   /*
-* force_not_null is not supported by file_fdw.  It would need a parser
-* for list of columns, not to mention a way to check the column list
-* against the table.
-*/
  
/* Sentinel */
{NULL, InvalidOid}
--- 59,70 
{"escape", ForeignTableRelationId},
{"null", ForeignTableRelationId},
{"encoding", ForeignTableRelationId},
+   {"force_not_null", AttributeRelationId},/* specified as boolean 
value */
  
/*
 * force_quote is not supported by file_fdw because it's for COPY TO.
 */
  
  
/* Sentinel */
{NULL, InvalidOid}
*** static void fileGetOptions(Oid foreignta
*** 112,117 
--- 110,116 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
+ static List * get_force_not_null(Oid relid);
  
  
  /*
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 145,150 
--- 144,150 
List   *options_list = untransformRelOptions(PG_GETARG_DATUM(0));
Oid catalog = PG_GETARG_OID(1);
char   *filename = NULL;
+   char   *force_not_null = NULL;
List   *other_options = NIL;
ListCell   *cell;
  
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 198,204 
 buf.data)));
}
  
!   /* Separate out filename, since ProcessCopyOptions won't allow 
it */
if (strcmp(def->defname, "filename") == 0)
{
if (filename)
--- 198,207 
 buf.data)));
}
  
!   /*
!* Separate out filename and force_not_null, since 
ProcessCopyOptions
!* won't allow them.
!*/
if (strcmp(def->defname, "filename") == 0)
{
if (filename)
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 207,212 
--- 210,229 
 errmsg("conflicting or 
redundant options")));
filename = defGetString(def);
}
+   else if (strcmp(def->defname, "force_not_null") == 0)
+   {
+   if (force_not_null)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("conflicting or 
redundant options")));
+ 
+   force_not_null = defGetString(def);
+   if (strcmp(force_not_null, "true") != 0 &&
+   strcmp(force_not_null, "false") != 0)
+   ereport(ERROR,
+   (errcode(ERRCODE_SYNTAX_ERROR),
+errmsg("force_not_null must be 
true or false")));
+   }
else

58 matches

Mail list logo