Re: [HACKERS] Query Plan Columns

2010-11-09 Thread Dimitri Fontaine
David E. Wheeler da...@kineticode.com writes:
 I'm wondering if there's any reason why we couldn't have EXPLAIN do something 
 like this itself in core:

 EXPLAIN (format table) SELECT * FROM bar;

+1 from me here, as it happens parsing a table-like resultset is
already implemented everywhere :)

I'd add another +1 if it were easily usable as a normal relation (or
SRF) from inside a query, e.g. in a WITH table expression.

WITH plan AS (
  EXPLAIN (format table) SELECT * FROM bar
)
SELECT some_plan_analysis_query_here;

Or even:

WITH plan AS (
  EXPLAIN (format table) SELECT * FROM bar
)
INSERT INTO plan_audit
SELECT * FROM plan
 WHERE actual_total_time  12 * interval '100 ms';

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Query Plan Columns

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 12:12 AM, Dimitri Fontaine wrote:

 WITH plan AS (
  EXPLAIN (format table) SELECT * FROM bar
 )
 INSERT INTO plan_audit
 SELECT * FROM plan
 WHERE actual_total_time  12 * interval '100 ms';

Yeah, that would be nice, but my current implementation has a row for each 
node, and a single explain can have many nodes. With this, you'd only get the 
top-level node (and not even that, as you didn't do EXPLAIN ANALYZE, so 
actual_total_time would be blank!).

But I do like the idea…

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLUSTER can change t_len

2010-11-09 Thread Itagaki Takahiro
On Tue, Nov 9, 2010 at 12:44 PM, Jeff Davis pg...@j-davis.com wrote:
 See case below. After the item length gets changed, then when reading
 the tuple later you get a t_len that includes padding.

We can easily find it with pageinspect:

\i pageinspect.sql
create table foo(i int4);
insert into foo values(1);
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
 lp | lp_len
+
  1 | 28
VACUUM FULL foo;
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
 lp | lp_len
+
  1 | 32

 We should document in a comment that t_len can mean multiple things. Or,
 we should fix raw_heap_insert() to be consistent with the rest of the
 code, which doesn't MAXALIGN the t_len.

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.

diff --git a/src/backend/access/heap/rewriteheap.c
b/src/backend/access/heap/rewriteheap.c
index 0bd1865..0ed94ef 100644
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
*** raw_heap_insert(RewriteState state, Heap
*** 586,592 
else
heaptup = tup;

!   len = MAXALIGN(heaptup-t_len); /* be conservative */

/*
 * If we're gonna fail for oversize tuple, do it right away
--- 586,592 
else
heaptup = tup;

!   len = heaptup-t_len;

/*
 * If we're gonna fail for oversize tuple, do it right away


-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Dmitriy Igrishin
Hey Pavel,

2010/11/9 Pavel Stehule pavel.steh...@gmail.com

  a) transformation to common type
 
  + simple - it is one day job - function record_to_array,
  array_to_record, and fieldnames_to_array
   - lost of type info, hidden problems with IO cast - int a := 10.0/2.0
  is a problem
 
  using a plperl, pltcl is same like @a
 

 I thinking about some simple API, that can be based on transformation
 to text. It can be enough for almost all.

 * text[] = record_to_array(record)
 * table(id, key, datatype, value) = record_to_table(record)
 * text = record_get_field(record, text)
 * record = record_set_field(record, text, anyelement)

 ??

I personally like it. But I propose to add as well:
  integer := record_nkeys();
  text := record_get_field(record, integer);


 Pavel

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
// Dmitriy.


Re: [HACKERS] CLUSTER can change t_len

2010-11-09 Thread Heikki Linnakangas

On 09.11.2010 11:11, Itagaki Takahiro wrote:

On Tue, Nov 9, 2010 at 12:44 PM, Jeff Davispg...@j-davis.com  wrote:

See case below. After the item length gets changed, then when reading
the tuple later you get a t_len that includes padding.


We can easily find it with pageinspect:

\i pageinspect.sql
create table foo(i int4);
insert into foo values(1);
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
  lp | lp_len
+
   1 | 28
VACUUM FULL foo;
SELECT lp, lp_len FROM heap_page_items(get_raw_page('foo', 0));
  lp | lp_len
+
   1 | 32


We should document in a comment that t_len can mean multiple things. Or,
we should fix raw_heap_insert() to be consistent with the rest of the
code, which doesn't MAXALIGN the t_len.


We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.


Hmm, the conservatism at that point affects the free space calculations. 
I'm not sure if it makes any difference in practice, but I'm also not 
sure it doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.


This would be more in line with what the main heap_insert code does:

--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -641,7 +641,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
}

/* And now we can insert the tuple into the page */
-   newoff = PageAddItem(page, (Item) heaptup-t_data, len,
+   newoff = PageAddItem(page, (Item) heaptup-t_data, heaptup-t_len,
 InvalidOffsetNumber, false, 
true);
if (newoff == InvalidOffsetNumber)
elog(ERROR, failed to add tuple);

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] security hooks on object creation

2010-11-09 Thread KaiGai Kohei
The attached patch provides plugin modules a hook just after object
creation time. In typical use cases, it enables to assign default
security labels on object creation by the external security providers.

As Robert suggested before, it provides a generic purpose main hook.
It takes an enum of ObjectAccessType which informs plugins what kind
of accesses are required, and identifier of the object to be referenced.
But, in this version, no additional information, such as new name in
ALTER xxx RENAME TO, are not supported.

The ObjectAccessType is defined as follows:

  typedef enum ObjectAccessType {
OAT_POST_CREATE,/* Post creation fixups; such as security labeling */
  } ObjectAccessType;

We will support more complete kind of access types in the future version,
however, we focus on default labeling rather than DDL permissions right
now, so only OAT_POST_CREATE is defined here.
Perhaps, we will add OAT_ALTER, OAT_DROP, OAT_COMMENT and so on.

In this patch, I put hooks on the place just after creation of database
objects that we can assign security labels. (schema, relation, attribute,
procedure, language, type, large object)

However, I didn't touch or move CommandCounterIncrement() yet, although
we had a long discussion MVCC visibility of new object.
Because I'm not clear whether it is really preferable to inject CCIs
onto random points such as TypeCreate() or ProcedureCreate() under
development of the version killed by myself.
(In other words, it was simply ugly...)

At least, we can see the new entries with SnapshotSelf, although we will
pay performance penalty. If so, it is an idea not to touch anything
related to CCIs.
The purpose of post creation hooks are assignment of default security
labels, not DDL permissions. So, it is not a bad idea not to touch
routines related to CCIs in the earlier version of external security
provider.

In this patch, we put InvokeObjectAccessHook0 on the following functions.

- heap_create_with_catalog() for relations/attributes
- ATExecAddColumn() for attributes
- NamespaceCreate() for schemas
- ProcedureCreate() for aggregates/functions
- TypeCreate() and TypeShellMake() for types
- create_proc_lang() for procedural languages
- inv_create() for large objects

Thanks,
-- 
KaiGai Kohei kai...@ak.jp.nec.com
*** a/src/backend/catalog/heap.c
--- b/src/backend/catalog/heap.c
***
*** 61,66 
--- 61,67 
  #include storage/freespace.h
  #include storage/smgr.h
  #include utils/acl.h
+ #include utils/esp.h
  #include utils/builtins.h
  #include utils/fmgroids.h
  #include utils/inval.h
***
*** 1189,1194  heap_create_with_catalog(const char *relname,
--- 1190,1202 
  	}
  
  	/*
+ 	 * If installed, ESP can assign initial properties (such as security
+ 	 * labels) of the relation.
+ 	 */
+ 	InvokeObjectAccessHook0(OAT_POST_CREATE,
+ 			RelationRelationId, relid, 0);
+ 
+ 	/*
  	 * Store any supplied constraints and defaults.
  	 *
  	 * NB: this may do a CommandCounterIncrement and rebuild the relcache
*** a/src/backend/catalog/pg_namespace.c
--- b/src/backend/catalog/pg_namespace.c
***
*** 19,24 
--- 19,25 
  #include catalog/indexing.h
  #include catalog/pg_namespace.h
  #include utils/builtins.h
+ #include utils/esp.h
  #include utils/rel.h
  #include utils/syscache.h
  
***
*** 75,79  NamespaceCreate(const char *nspName, Oid ownerId)
--- 76,87 
  	/* Record dependency on owner */
  	recordDependencyOnOwner(NamespaceRelationId, nspoid, ownerId);
  
+ 	/*
+ 	 * If installed, ESP can assign initial properties (such as security
+ 	 * labels) of the new namespace.
+ 	 */
+ 	InvokeObjectAccessHook0(OAT_POST_CREATE,
+ 			NamespaceRelationId, nspoid, 0);
+ 
  	return nspoid;
  }
*** a/src/backend/catalog/pg_proc.c
--- b/src/backend/catalog/pg_proc.c
***
*** 32,37 
--- 32,38 
  #include tcop/pquery.h
  #include tcop/tcopprot.h
  #include utils/acl.h
+ #include utils/esp.h
  #include utils/builtins.h
  #include utils/lsyscache.h
  #include utils/syscache.h
***
*** 614,619  ProcedureCreate(const char *procedureName,
--- 615,627 
  			  nnewmembers, newmembers);
  	}
  
+ 	/*
+ 	 * If installed, ESP can assign initial properties (such as security
+ 	 * labels) of the new function.
+ 	 */
+ 	InvokeObjectAccessHook0(OAT_POST_CREATE,
+ 			ProcedureRelationId, retval, 0);
+ 
  	heap_freetuple(tup);
  
  	heap_close(rel, RowExclusiveLock);
*** a/src/backend/catalog/pg_type.c
--- b/src/backend/catalog/pg_type.c
***
*** 26,31 
--- 26,32 
  #include miscadmin.h
  #include parser/scansup.h
  #include utils/acl.h
+ #include utils/esp.h
  #include utils/builtins.h
  #include utils/fmgroids.h
  #include utils/lsyscache.h
***
*** 156,161  TypeShellMake(const char *typeName, Oid typeNamespace, Oid ownerId)
--- 157,169 
   false);
  
  	/*
+ 	 * If installed, ESP can assign initial properties (such as security
+ 	 

Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Sam Mason
On Mon, Nov 08, 2010 at 12:55:22PM -0300, Alvaro Herrera wrote:
 Excerpts from Charles Pritchard's message of sáb nov 06 23:20:13 -0300 2010:
 
  Simple async sql sub-set (the spec in trouble):
  http://dev.w3.org/html5/webdatabase/
 
 This is insane.  This spec allows the server to run arbitrary SQL
 commands on the client, AFAICT.  That seems like infinite joy for
 malicious people running webservers.  The more powerful the dialect of
 SQL the client implements, the more dangerous it is.

How is this different from the server asking the client to run an
infinite loop in javascript?

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Avoid memory leaks during ANALYZE's compute_index_stats() ?

2010-11-09 Thread Jakub Ouhrabka

Hi Tom,

thanks for brilliant analysis - now we know how to avoid the problem.

As a side note: from the user's point of view it would be really nice to 
know that the error was caused by auto-ANALYZE - at least on 8.2 it's 
not that obvious from the server log. It was the first message with 
given backend PID so it seemed to me as it's problem during backend 
startup - we have log_connections to on...


Thanks,

Kuba

Dne 9.11.2010 2:04, Tom Lane napsal(a):

I looked into the out-of-memory problem reported by Jakub Ouhrabka here:
http://archives.postgresql.org/pgsql-general/2010-11/msg00353.php

It's pretty simple to reproduce, even in HEAD; what you need is an index
expression that computes a bulky intermediate result.  His example is

md5(array_to_string(f1, ''::text))

where f1 is a bytea array occupying typically 15kB per row.  Even
though the final result of md5() is only 32 bytes, evaluation of this
expression will eat about 15kB for the detoasted value of f1, roughly
double that for the results of the per-element output function calls
done inside array_to_string, and another 30k for the final result string
of array_to_string.  And *none of that gets freed* until
compute_index_stats() is all done.  In my testing, with the default
stats target of 100, this gets repeated for 30k sample rows, requiring
something in excess of 2GB in transient space.  Jakub was using stats
target 500 so it'd be closer to 10GB for him.

AFAICS the only practical fix for this is to have the inner loop of
compute_index_stats() copy each index expression value out of the
per-tuple memory context and into the per-index Analyze Index context.
That would allow it to reset the per-tuple memory context after each
FormIndexDatum call and thus clean up whatever intermediate result trash
the evaluation left behind.  The extra copying is a bit annoying, since
it would add cycles while accomplishing nothing useful for index
expressions with no intermediate results, but I'm thinking this is a
must-fix.

Comments?

regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Andrew Dunstan



On 11/09/2010 09:59 AM, Alvaro Herrera wrote:

And this makes me think that SQLite is indeed the right tool for the job
here, and not PostgreSQL.  If someone intrudes, it's going to be in the
same process running the web browser, not in some server running under
another user identity in the machine.  That seems like a feature to me,
not a bug.



Right. Then it has some chance to be run in a sandbox. This doesn't 
strike me at all as a good fit for Postgres.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] improved parallel make support

2010-11-09 Thread Dave Page
On Sat, Nov 6, 2010 at 4:35 AM, Peter Eisentraut pete...@gmx.net wrote:
 On ons, 2010-11-03 at 16:34 +0200, Peter Eisentraut wrote:
 On tis, 2010-11-02 at 10:21 -0400, Tom Lane wrote:
  Do we have a handle on how many buildfarm members this will break?

 I suppose we don't.  One way to find out would be to commit just this
 bit

 +# We need the $(eval) function, which is available in GNU make 3.80.
 +# That also happens to be the version where the .VARIABLES variable
 +# was introduced, so this is a simple check.
 +ifndef .VARIABLES
 +$(error GNU make 3.80 or newer is required)
 +endif

 with a $(warning) instead, and let it run for a bit.

 So far, two machines have reported an older make version:

 dawn_bat
 narwhal

 both of the mingw type.  Andrew, Dave, could you see about upgrading the
 GNU make installation there?

Narwhal should be OK now.

/D

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLUSTER can change t_len

2010-11-09 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 On 09.11.2010 11:11, Itagaki Takahiro wrote:
 We have a comment /* be conservative */ in the function, but I'm not sure
 we actually need the MAXALIGN. However, there would be almost no benefits
 to keep t_len in small value because we often treat memory in MAXALIGN unit.

 Hmm, the conservatism at that point affects the free space calculations. 
 I'm not sure if it makes any difference in practice, but I'm also not 
 sure it doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

I tend to agree with Jeff's original point that the behavior should
match regular tuple insertion exactly.  This isn't about saving space,
because it won't; it's about not confusing readers by doing the same
thing in randomly different ways.  I will also note that the regular
path is FAR better tested than raw_heap_insert.  If there are any bugs
here, it's 1000:1 they're in raw_heap_insert not the regular path.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-11-09 Thread Simon Riggs
On Tue, 2010-11-09 at 13:34 +0200, Heikki Linnakangas wrote:
 (cleaning up my inbox, and bumped into this..)
 
 On 22.04.2010 12:31, Simon Riggs wrote:
  On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:
  Simon Riggs wrote:
  On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:
 
  If none of the removed heap tuples were present anymore, we currently
  return InvalidTransactionId, which kills/waits out all read-only
  queries. But if none of the tuples were present anymore, the 
  read-only
  queries wouldn't have seen them anyway, so ISTM that we should treat
  InvalidTransactionId return value as we don't need to kill anyone.
  That's not the point. The tuples were not themselves the sole focus,
  Yes, they were. We're replaying a b-tree deletion record, which removes
  pointers to some heap tuples, making them unreachable to any read-only
  queries. If any of them still need to be visible to read-only queries,
  we have a conflict. But if all of the heap tuples are gone already,
  removing the index pointers to them can'ẗ change the situation for any
  query. If any of them should've been visible to a query, the damage was
  done already by whoever pruned the heap tuples leaving just the
  tombstone LP_DEAD item pointers (in the heap) behind.
  You're missing my point. Those tuples are indicators of what may lie
  elsewhere in the database, completely unreferenced by this WAL record.
  Just because these referenced tuples are gone doesn't imply that all
  tuple versions written by the as yet-unknown-xids are also gone. We
  can't infer anything about the whole database just from one small group
  of records.
  Have you got an example of that?
 
  I don't need one, I have suggested the safe route. In order to infer
  anything, and thereby further optimise things, we would need proof that
  no cases can exist, which I don't have. Perhaps we can add yet, not
  sure about that either.
 
  It's good to be safe rather than sorry, but I'd still like to know
  because I'm quite surprised by that, and got me worried that I don't
  understand how hot standby works as well as I thought I did. I thought
  the point of stopping replay/killing queries at a b-tree deletion record
  is precisely that it makes some heap tuples invisible to running
  read-only queries. If it doesn't make any tuples invisible, why do any
  queries need to be killed? And why was it OK for them to be running just
  before replaying the b-tree deletion record?
 
  I'm sorry but I'm too busy to talk further on this today. Since we are
  discussing a further optimisation rather than a bug, I hope it is OK to
  come back to this again later.
 
 Would now be a good time to revisit this? I still don't see why a b-tree 
 deletion record should conflict with anything, if all the removed index 
 tuples point to just LP_DEAD tombstones in the heap.

I want what you say to be true. The question is: is it? We just need to
explain why that will never be a problem.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Mon, Nov 8, 2010 at 5:59 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 The problem that putting checksums in a different place solves is the
 page layout (binary upgrade) problem.  You're still doing to need to
 buffer the page as you calculate the checksum and write it out.
 buffering that page is absolutely necessary no mater where you put the
 checksum, unless you've got an exclusive lock that blocks even hint
 updates on the page.

But buffering the page only means you've got some consistent view of
the page. It doesn't mean the checksum will actually match the data in
the page that gets written out. So when you read it back in the
checksum may be invalid.

I wonder if we could get by by having some global counter on the page
which you increment when you set a hint bit. That way when we you read
the page back in you could compare the counter on the page and the
counter for the checksum and if the checksum counter is behind ignore
the checksum? It would be nice to do better but I'm not sure we can.



 But if we can start using forks to put other data, that means that
 keeping the page layouts is easier, and thus binary upgrades are much
 more feasible.


The difficulty with the page layout didn't come from the checksum
itself. We can add 4 or 8 bytes to the page header easily enough. The
difficulty came from trying to move the hint bits for all the tuples
to a dedicated area. That means three resizable areas so either one of
them would have to be relocatable or some other solution (like not
checksumming the line pointers and putting the hint bits in the line
pointers). If we're willing to have invalid checksums whenever the
hint bits get set then this wouldn't be necessary.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Kevin Grittner
Alvaro Herrera  wrote:
 Excerpts from Charles Pritchard's message:
 
 I don't believe the webmaster is granted free rein:
 Disk quotas are enforced, data is separated per origin,
 hanging processes are up to the implementer, and postgres has
 plenty of settings for that.

 The day a privilege escalation is found and some webserver runs
 pg_read_file() on your browser, will be a sad one indeed.
 
Personally, I feel somewhat more safe about trusting PostgreSQL on
this than JavaScript, Java applets, a Flash plug-in, and cookies --
all of which are enabled in my browser.  Sure, I occasionally hit an
ill-behaved page and need to xkill my browser.  I don't visit that
site again.  And it really doesn't happen to me very often.
 
Can you can make a case that this proposal is more dangerous than
having all the above enabled?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] security hooks on object creation

2010-11-09 Thread KaiGai Kohei

(2010/11/09 20:34), Itagaki Takahiro wrote:

2010/11/9 KaiGai Koheikai...@ak.jp.nec.com:

The attached patch provides plugin modules a hook just after object
creation time. In typical use cases, it enables to assign default
security labels on object creation by the external security providers.


It looks like DDL Trigger on other database products.
Do we need to consider both security hooks and DDL triggers now?
Or, is it enough to design DLL triggers after the hooks are merged?
Low-level hooks might be better for security providers because
SQL-level triggers could be uninstall by superusers.


An interesting viewpoint. Does the DDL trigger allow us to do something
on CREATE/ALTER/DROP command?

One thing we need to pay attention is that CREATE command is an exception
from any other DDL commands, because the database object to be modified
does not exist before the actual works. So, I'm saying we need both of
prep/post creation hooks in the world of complete features.
Meanwhile, I don't think we need security hooks post ALTER/DROP commands.
Thus, we will put security hooks next to the existing permission checks,
not after the actual works of these commands.
Is it reasonable for DDL triggers (if it has something like BEFORE/AFTER)?

Thanks,
--
KaiGai Kohei kai...@kaigai.gr.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] timestamp of the last replayed transaction

2010-11-09 Thread Fujii Masao
On Tue, Nov 9, 2010 at 1:05 AM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Nov 8, 2010 at 6:00 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Nov 6, 2010 at 9:58 AM, Robert Haas robertmh...@gmail.com wrote:
 This looks good, but how about adding:

 if (!RecoveryInProgress())
    PG_RETURN_NULL();

 Otherwise, if we're in Hot Standby mode for a while and then enter
 normal running, wouldn't this still return a (stale) value?

 Yes, but isn't that (stale) value useful to check how far WAL records
 have been replayed, *after failover*?

 Oh, OK.  I guess that makes sense.  One other question - should we say
 pg_last_xact_replay_timestamp() rather than
 pg_xact_last_replay_timestamp(), for consistency with
 pg_last_xlog_replay_location()?

Yeah, pg_last_xact_replay_timestamp was my first idea. But, for me
(with poor English),
that sounded to return the timestamp of when the last transaction has
been replayed in
the standby, rather than the timestamp of the last replayed
transaction (i.e., when the
last replayed transaction has been committed in the master). So I
didn't choose that name.

But, pg_last_xact_replay_timestamp is more intuitive for many people?
If so, let's change
the name.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_ctl init doc bug

2010-11-09 Thread Euler Taveira de Oliveira
Hi,

While executing the following command I got:

$ pg_ctl init -D /tmp/foo -o -N 11
/home/euler/pg/bin/initdb: invalid option -- N
Try initdb --help for more information.
pg_ctl: database system initialization failed

I tried -N 11 (postgres option) after looking at the manual but the -o
option only accept initdb options so I think there is a documentation bug.
Patch is attached.

While at it, is it worth adding a pg_ctl init example?


-- 
  Euler Taveira de Oliveira
  http://www.timbira.com/
diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml
index 29bea24..0cc82cd 100644
*** a/doc/src/sgml/ref/pg_ctl-ref.sgml
--- b/doc/src/sgml/ref/pg_ctl-ref.sgml
*** PostgreSQL documentation
*** 25,31 
 arg choice=plaininit[db]/arg
 arg-s/arg
 arg-D replaceabledatadir/replaceable/arg
!arg-o replaceableoptions/replaceable/arg
/cmdsynopsis
  
cmdsynopsis
--- 25,31 
 arg choice=plaininit[db]/arg
 arg-s/arg
 arg-D replaceabledatadir/replaceable/arg
!arg-o replaceableinit-options/replaceable/arg
/cmdsynopsis
  
cmdsynopsis
*** PostgreSQL documentation
*** 263,268 
--- 263,282 
   /varlistentry
  
   varlistentry
+   termoption-o replaceable class=parameterinit-options/replaceable/option/term
+   listitem
+para
+ Specifies options to be passed directly to the
+ commandinitdb/command command.
+/para
+para
+ The options are usually surrounded by single or double
+ quotes to ensure that they are passed through as a group.
+/para
+   /listitem
+  /varlistentry
+ 
+  varlistentry
termoption-m replaceable class=parametermode/replaceable/option/term
listitem
 para

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can we tell how far behind the standby is?

2010-11-09 Thread Fujii Masao
On Tue, Nov 9, 2010 at 7:57 AM, Josh Berkus j...@agliodbs.com wrote:

 The patch which I'm proposing is helpful for you?
 http://archives.postgresql.org/pgsql-hackers/2010-11/msg00167.php

 Depends.  Is that the timestamp on the master (when it was synced), or
 the timestamp on the standby (when it was replayed)?  It's only helpful
 if it's the former.

It's the former. The function which I'm proposing returns the timestamp
of the last replayed commit/abort log record. That timestamp is given
when the commit/abort log record is generated in the master.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Aidan Van Dyk
On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark gsst...@mit.edu wrote:

 But buffering the page only means you've got some consistent view of
 the page. It doesn't mean the checksum will actually match the data in
 the page that gets written out. So when you read it back in the
 checksum may be invalid.

I was assuming that if the code went through the trouble to buffer the
shared page to get a stable, non-changing copy to use for
checksumming/writing it, it would write() the buffered copy it just
made, not the original in shared memory...  I'm not sure how that
write could be in-consistent.

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 2:28 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 On Tue, Nov 9, 2010 at 8:45 AM, Greg Stark gsst...@mit.edu wrote:

 But buffering the page only means you've got some consistent view of
 the page. It doesn't mean the checksum will actually match the data in
 the page that gets written out. So when you read it back in the
 checksum may be invalid.

 I was assuming that if the code went through the trouble to buffer the
 shared page to get a stable, non-changing copy to use for
 checksumming/writing it, it would write() the buffered copy it just
 made, not the original in shared memory...  I'm not sure how that
 write could be in-consistent.

Oh, I'm mistaken. The problem was that buffering the writes was
insufficient to deal with torn pages. Even if you buffer the writes if
the machine crashes while only having written half the buffer out then
the checksum won't match. If the only changes on the page were hint
bit updates then there will be no full page write in the WAL log to
repair the block.

It's possible that *that* situation is rare enough to let the checksum
raise a warning but not an error.

But personally I'm pretty loath to buffer every page write. The state
of the art are zero-copy processes and we should be looking to reduce
copies rather than increase them. Though I suppose if we did a
zero-copy CRC that might actually get us this buffered write for free.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:
 Oh, I'm mistaken. The problem was that buffering the writes was
 insufficient to deal with torn pages. Even if you buffer the writes if
 the machine crashes while only having written half the buffer out then
 the checksum won't match. If the only changes on the page were hint
 bit updates then there will be no full page write in the WAL log to
 repair the block.

Huh, this implies that if we did go through all the work of
segregating the hint bits and could arrange that they all appear on
the same 512-byte sector and if we buffered them so that we were
writing the same bits we checksummed then we actually *could* include
them in the CRC after all since even a torn page will almost certainly
not tear an individual sector.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLUSTER can change t_len

2010-11-09 Thread Heikki Linnakangas

On 09.11.2010 15:57, Greg Stark wrote:

On Tue, Nov 9, 2010 at 10:20 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN
unit.


Hmm, the conservatism at that point affects the free space calculations. I'm
not sure if it makes any difference in practice, but I'm also not sure it
doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

This would be more in line with what the main heap_insert code does:


Doesn't this cause assertion failures in heap_fill_tuple when the data
size isn't what's expected? I guess we never actually use the t_len
for later tuple reconstructions, we just recompute the needed size?


Right, the length from t_len or the item pointer is never passed to 
heap_fill_tuple.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLUSTER can change t_len

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 10:20 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 We have a comment /* be conservative */ in the function, but I'm not sure
 we actually need the MAXALIGN. However, there would be almost no benefits
 to keep t_len in small value because we often treat memory in MAXALIGN
 unit.

 Hmm, the conservatism at that point affects the free space calculations. I'm
 not sure if it makes any difference in practice, but I'm also not sure it
 doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.

 This would be more in line with what the main heap_insert code does:



Doesn't this cause assertion failures in heap_fill_tuple when the data
size isn't what's expected? I guess we never actually use the t_len
for later tuple reconstructions, we just recompute the needed size?
-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] TODO Alter Table Rename Constraint

2010-11-09 Thread Viktor Valy
Hi Everybody!

I looked up this todo, and figured out a plan, how the implementation could
be written.
The main challenge is to keep constraints and indexes (for unique and PK
constraints) consistent.
Fortunately this is already realized in the case of an index. So at ALTER
INDEX RENAME the consistency is given by an extra check in *tablecmds.c
(Line 2246)*, where a function is finally called,* RenameConstraintById
(pg_constraint.c Line 604)*.

My idea is, to do that analog for ALTER CONSTRAINT RENAME too. So renaming
the constraint is going to be done in *tablecmds.c* as for indexes, tables,
sequences, views. And after checking whether the renametype is constraint,
an extra rename has to be done for the index. Getting the index can be done
with the function *get_constraint_index (pg_depend.c Line  475)*. Now it
should be possible to do the same as in *RenameConstraintById.*

Is that so legal? Is anything else to be considered?

I appreciate any suggestion, and maybe some help too, as I'm not so familiar
with the source.
Thanks in advance,

Viktor
(Student of technical unverstity of Vienna)


Re: [HACKERS] CLUSTER can change t_len

2010-11-09 Thread Heikki Linnakangas

On 09.11.2010 17:14, Tom Lane wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

On 09.11.2010 11:11, Itagaki Takahiro wrote:

We have a comment /* be conservative */ in the function, but I'm not sure
we actually need the MAXALIGN. However, there would be almost no benefits
to keep t_len in small value because we often treat memory in MAXALIGN unit.



Hmm, the conservatism at that point affects the free space calculations.
I'm not sure if it makes any difference in practice, but I'm also not
sure it doesn't. pd_upper is always MAXALIGNed, but pd_lower is not.


I tend to agree with Jeff's original point that the behavior should
match regular tuple insertion exactly.  This isn't about saving space,
because it won't; it's about not confusing readers by doing the same
thing in randomly different ways.  I will also note that the regular
path is FAR better tested than raw_heap_insert.  If there are any bugs
here, it's 1000:1 they're in raw_heap_insert not the regular path.


Agreed. I've committed my patch to make it behave like heap_insert. 
Thank you, Itagaki, for the easy test case using pageinspect.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Jim Nasby
On Nov 9, 2010, at 9:27 AM, Greg Stark wrote:
 On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:
 Oh, I'm mistaken. The problem was that buffering the writes was
 insufficient to deal with torn pages. Even if you buffer the writes if
 the machine crashes while only having written half the buffer out then
 the checksum won't match. If the only changes on the page were hint
 bit updates then there will be no full page write in the WAL log to
 repair the block.
 
 Huh, this implies that if we did go through all the work of
 segregating the hint bits and could arrange that they all appear on
 the same 512-byte sector and if we buffered them so that we were
 writing the same bits we checksummed then we actually *could* include
 them in the CRC after all since even a torn page will almost certainly
 not tear an individual sector.

If there's a torn page then we've crashed, which means we go through crash 
recovery, which puts a valid page (with valid CRC) back in place from the WAL. 
What am I missing?

BTW, I agree that at minimum we need to leave the option of only raising a 
warning when we hit a checksum failure. Some people might want Postgres to 
treat it as an error by default, but most folks will at least want the option 
to look at their (corrupt) data.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Gurjeet Singh
On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 There are also crosschecks that you can apply: if it's a heap page, are
 there any index pages with pointers to it?  If it's an index page, are
 there downlink or sibling links to it from elsewhere in the index?
 A page that Postgres left as zeroes would not have any references to it.

 IMO there are a lot of methods that can separate filesystem misfeasance
 from Postgres errors, probably with greater reliability than this hack.
 I would also suggest that you don't really need to prove conclusively
 that any particular instance is one or the other --- a pattern across
 multiple instances will tell you what you want to know.


Doing this postmortem on a regular deployment and fixing the problem would
not be too difficult. But this platform, which Postgres is a part of,  would
be mostly left unattended once deployed (pardon me for not sharing the
details, as I am not sure if I can).

An external HA component is supposed to detect any problems (by querying
Postgres or by external means) and take an evasive action. It is this
automation of problem detection that we are seeking.

As Greg pointed out, even with this hack in place, we might still get zero
pages from the FS (say, when ext3 does metadata journaling but not block
journaling). In that case we'd rely on recovery's WAL replay of relation
extension to reintroduce the magic number in pages.


 What's more, if I did believe that this was a safe and
 reliable technique, I'd be unhappy about the opportunity cost of
 reserving it for zero-page testing rather than other purposes.


This is one of those times where you are a bit too terse for me. What does
zero-page imply that this hack wouldn't?

Regards,
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device


[HACKERS] DROP TABLESPACE needs crash-resistance

2010-11-09 Thread Gurjeet Singh
We are facing a problem in dropping a tablespace after crash recovery. The
recovery starts from the last checkpoint, but the tables that were created
by
a transaction in a tablespace before the checkpoint are still lying around;
the
transaction had not finished by the time of crash.

After recovery, when the app tries to drop the tablespace, the command fails
because the tablespace directory is not empty.

Solving this problem has become quite critical since the the platform where
Postgres is being used is supposed to run unattended. The problem is
currently
being solved by an application specific kluge, which is highly undesirable
as
this kluge might not work as the application evolves.

Has this problem been reported/discussed earlier? Any suggestions to avoid
this
situation?

I have a hackish idea of listing files created by yet-to-be-committed
transactions be
listed after every checkpoint so that the recovery code can remember to
remove
such files if the creating transaction's commit record is not encountered
until end of recovery. But this would require every smgrcreate() to be
communicated
to the BGWriter, and somehow make BGWriter forget this list when the
transaction
commits.

Regards,
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 4:26 PM, Jim Nasby j...@nasby.net wrote:
 On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:
 Oh, I'm mistaken. The problem was that buffering the writes was
 insufficient to deal with torn pages. Even if you buffer the writes if
 the machine crashes while only having written half the buffer out then
 the checksum won't match. If the only changes on the page were hint
 bit updates then there will be no full page write in the WAL log to
 repair the block.

 If there's a torn page then we've crashed, which means we go through crash 
 recovery, which puts a valid page (with valid CRC) back in place from the 
 WAL. What am I missing?

If the only changes on the page were hint bit updates then there will
be no full page write in the WAL to repair the block



-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Aidan Van Dyk
On Tue, Nov 9, 2010 at 11:26 AM, Jim Nasby j...@nasby.net wrote:

 Huh, this implies that if we did go through all the work of
 segregating the hint bits and could arrange that they all appear on
 the same 512-byte sector and if we buffered them so that we were
 writing the same bits we checksummed then we actually *could* include
 them in the CRC after all since even a torn page will almost certainly
 not tear an individual sector.

 If there's a torn page then we've crashed, which means we go through crash 
 recovery, which puts a valid page (with valid CRC) back in place from the 
 WAL. What am I missing?

The problem case is where hint-bits have been set.  Hint bits have
always been we don't really care, but we write them.

A torn-page on hint-bit-only writes is ok, because with a torn page
(assuming you dont' get zero-ed pages), you get the old or new chunks
of the complete 8K buffer, but they are identical except for only
hint-bits, which eiterh the old or new state is sufficient.

But with a check-sum, now, getting a torn page w/ only hint-bit
updates now becomes noticed.  Before, it might have happened, but we
wouldn't have noticed or cared.

So, for getting checksums, we have to offer up a few things:
1) zero-copy writes, we need to buffer the write to get a consistent
checksum (or lock the buffer tight)
2) saving hint-bits on an otherwise unchanged page.  We either need to
just not write that page, and loose the work the hint-bits did, or do
a full-page WAL of it, so the torn-page checksum is fixed

Both of these are theoretical performance tradeoffs.  How badly do we
want to verify on read that it is *exactly* what we thought we wrote?

a.


-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 1:38 AM, Dmitriy Igrishin wrote:

 * text[] = record_to_array(record)
 * table(id, key, datatype, value) = record_to_table(record)
 * text = record_get_field(record, text)
 * record = record_set_field(record, text, anyelement)
 
 ??
 I personally like it. But I propose to add as well:
   integer := record_nkeys();
   text := record_get_field(record, integer);

You realize you can pretty much do all this with hstore, right?

  hstore hash := hstore(record);
  arytext[] := hstore_to_matrix(hash);
  select * from each(hash); -- TABLE
  text foo := hash - somekey;
  hash := hash || 'key,value'::hstore;
  record := populate_record(record, hash);

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Tom Lane
Gurjeet Singh singh.gurj...@gmail.com writes:
 On Tue, Nov 9, 2010 at 12:32 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 IMO there are a lot of methods that can separate filesystem misfeasance
 from Postgres errors, probably with greater reliability than this hack.

 Doing this postmortem on a regular deployment and fixing the problem would
 not be too difficult. But this platform, which Postgres is a part of,  would
 be mostly left unattended once deployed (pardon me for not sharing the
 details, as I am not sure if I can).

 An external HA component is supposed to detect any problems (by querying
 Postgres or by external means) and take an evasive action. It is this
 automation of problem detection that we are seeking.

To be blunt, this argument is utter nonsense.  The changes you propose
would still require manual analysis of any detected issues in order to
do anything useful about them.  Once you know that there is, or isn't,
a filesystem-level error involved, what are you going to do next?
You're going to go try to debug the component you know is at fault,
that's what.  And that problem is still AI-complete.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Dmitriy Igrishin
Hey David,

2010/11/9 David E. Wheeler da...@kineticode.com

 On Nov 9, 2010, at 1:38 AM, Dmitriy Igrishin wrote:

  * text[] = record_to_array(record)
  * table(id, key, datatype, value) = record_to_table(record)
  * text = record_get_field(record, text)
  * record = record_set_field(record, text, anyelement)
 
  ??
  I personally like it. But I propose to add as well:
integer := record_nkeys();
text := record_get_field(record, integer);

 You realize you can pretty much do all this with hstore, right?

  hstore hash := hstore(record);
  arytext[] := hstore_to_matrix(hash);
  select * from each(hash); -- TABLE
  text foo := hash - somekey;
  hash := hash || 'key,value'::hstore;
  record := populate_record(record, hash);

 Yep, but hstore is an additional module. Although, its not a problem.


 Best,

 David


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
// Dmitriy.


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 9:18 AM, Dmitriy Igrishin wrote:

 Yep, but hstore is an additional module. Although, its not a problem.

Yeah, but JSON will be in core, and with luck, before long, it will have the 
same (or similar) capabilities.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP TABLESPACE needs crash-resistance

2010-11-09 Thread Tom Lane
Gurjeet Singh singh.gurj...@gmail.com writes:
 We are facing a problem in dropping a tablespace after crash recovery. The
 recovery starts from the last checkpoint, but the tables that were created
 by
 a transaction in a tablespace before the checkpoint are still lying around;
 the
 transaction had not finished by the time of crash.

 After recovery, when the app tries to drop the tablespace, the command fails
 because the tablespace directory is not empty.

Hmm.  The reason DROP TABLESPACE fails in that case, rather than just
arbitrarily rm -rf'ing the files, is fear of deleting valuable data by
accident.  I suppose we could have a mode that deletes the files without
any manual intervention, but personally I'd regard that as a foot-gun.

 Solving this problem has become quite critical since the the platform where
 Postgres is being used is supposed to run unattended.

I'm not entirely clear as to the use-case for unattended DROP TABLESPACE?
That doesn't really seem like an operation you should need on a routine
basis.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 So, for getting checksums, we have to offer up a few things:
 1) zero-copy writes, we need to buffer the write to get a consistent
 checksum (or lock the buffer tight)
 2) saving hint-bits on an otherwise unchanged page.  We either need to
 just not write that page, and loose the work the hint-bits did, or do
 a full-page WAL of it, so the torn-page checksum is fixed

Actually the consensus the last go-around on this topic was to
segregate the hint bits into a single area of the page and skip them
in the checksum. That way we don't have to do any of the above. It's
just that that's a lot of work.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 You realize you can pretty much do all this with hstore, right?

Yeah.  Anything that involves smashing all the fields to text is not
really an advance over (a) hstore or (b) using plperl or one of the
other weakly-typed PLs.

I think there's a fairly fundamental contradiction involved here.
One of the basic design attributes of plpgsql is that it's strongly
typed.  Sometimes that's a blessing, and sometimes it's not, but
it's a fact.  There really isn't a good way to deal with run-time
field selection while still maintaining strong typing.  I do not
believe that the answer to that problem is so let's break strong
typing.  Rather, the answer is that if that's what you need, you
need to use a different tool.  There's a reason we support multiple
PLs.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Pavel Stehule
2010/11/9 David E. Wheeler da...@kineticode.com:
 On Nov 9, 2010, at 1:38 AM, Dmitriy Igrishin wrote:

 * text[] = record_to_array(record)
 * table(id, key, datatype, value) = record_to_table(record)
 * text = record_get_field(record, text)
 * record = record_set_field(record, text, anyelement)

 ??
 I personally like it. But I propose to add as well:
   integer := record_nkeys();
   text := record_get_field(record, integer);

 You realize you can pretty much do all this with hstore, right?

hstore has similar functionality, but missing a some details and add
lot of other functionality - it doesn't identify type of field.
Personally - it is nothing what I like - but can be better than
nothing.

Pavel



  hstore hash := hstore(record);
  ary    text[] := hstore_to_matrix(hash);
  select * from each(hash); -- TABLE
  text foo := hash - somekey;
  hash := hash || 'key,value'::hstore;
  record := populate_record(record, hash);

 Best,

 David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 9:35 AM, Pavel Stehule wrote:

 You realize you can pretty much do all this with hstore, right?
 
 hstore has similar functionality, but missing a some details and add
 lot of other functionality - it doesn't identify type of field.
 Personally - it is nothing what I like - but can be better than
 nothing.

The JSON data type will give you some basic types (text, number, boolean).

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Pavel Stehule
2010/11/9 Tom Lane t...@sss.pgh.pa.us:
 David E. Wheeler da...@kineticode.com writes:
 You realize you can pretty much do all this with hstore, right?

 Yeah.  Anything that involves smashing all the fields to text is not
 really an advance over (a) hstore or (b) using plperl or one of the
 other weakly-typed PLs.

 I think there's a fairly fundamental contradiction involved here.
 One of the basic design attributes of plpgsql is that it's strongly
 typed.  Sometimes that's a blessing, and sometimes it's not, but
 it's a fact.  There really isn't a good way to deal with run-time
 field selection while still maintaining strong typing.  I do not
 believe that the answer to that problem is so let's break strong
 typing.  Rather, the answer is that if that's what you need, you
 need to use a different tool.  There's a reason we support multiple
 PLs.

yes - I know these arguments well. But you have to know so any
combination of PL increase a project complexity and increase a price
for maintaining, installation, Now It's relative safe to say to
somebody - you need a plpgsql. But it's more difficult to say same
about plperl, pltcl, plpython - I like plperl too much, but I would to
use it for untrusted  operation and not for some very simple and
general task.

Pavel



                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 9:34 AM, Tom Lane wrote:

 I think there's a fairly fundamental contradiction involved here.
 One of the basic design attributes of plpgsql is that it's strongly
 typed.  Sometimes that's a blessing, and sometimes it's not, but
 it's a fact.  There really isn't a good way to deal with run-time
 field selection while still maintaining strong typing.  I do not
 believe that the answer to that problem is so let's break strong
 typing.  Rather, the answer is that if that's what you need, you
 need to use a different tool.  There's a reason we support multiple
 PLs.

Yeah, I think that Pavel wants to get at the record values with their types in 
tact. Not sure how that'd work though, really. If you know the type of the 
record columns already, you can just get them. But I'm not sure how you could 
introspect the column names and their types, and then get those types out 
without casting, except perhaps via EXECUTE…

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 9:35 AM, Pavel Stehule wrote:

 hstore has similar functionality, but missing a some details and add
 lot of other functionality - it doesn't identify type of field.
 Personally - it is nothing what I like - but can be better than
 nothing.

What are you going to do with the type once you have it?

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Pavel Stehule
2010/11/9 David E. Wheeler da...@kineticode.com:
 On Nov 9, 2010, at 9:35 AM, Pavel Stehule wrote:

 You realize you can pretty much do all this with hstore, right?

 hstore has similar functionality, but missing a some details and add
 lot of other functionality - it doesn't identify type of field.
 Personally - it is nothing what I like - but can be better than
 nothing.

 The JSON data type will give you some basic types (text, number, boolean).


yes, but it's same - transformation via text, and still it's more than
less hack - when you like to iterate over record, then you need to
transform record (generic and basic type) to JSON and then to text.
It's nothing nice - and I don't see a difference between a use a
Plperl or JSON.

Pavel

 Best,

 David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Pavel Stehule
2010/11/9 David E. Wheeler da...@kineticode.com:
 On Nov 9, 2010, at 9:34 AM, Tom Lane wrote:

 I think there's a fairly fundamental contradiction involved here.
 One of the basic design attributes of plpgsql is that it's strongly
 typed.  Sometimes that's a blessing, and sometimes it's not, but
 it's a fact.  There really isn't a good way to deal with run-time
 field selection while still maintaining strong typing.  I do not
 believe that the answer to that problem is so let's break strong
 typing.  Rather, the answer is that if that's what you need, you
 need to use a different tool.  There's a reason we support multiple
 PLs.

 Yeah, I think that Pavel wants to get at the record values with their types 
 in tact. Not sure how that'd work though, really. If you know the type of the 
 record columns already, you can just get them. But I'm not sure how you could 
 introspect the column names and their types, and then get those types out 
 without casting, except perhaps via EXECUTE…

every variable in plpgsql has known type descriptor. So it's not a
problem. The problem is a access to fields in cycle - and then you
need a simple trick like polymorphic parameters.

Pavel


 Best,

 David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Pavel Stehule
2010/11/9 David E. Wheeler da...@kineticode.com:
 On Nov 9, 2010, at 9:35 AM, Pavel Stehule wrote:

 hstore has similar functionality, but missing a some details and add
 lot of other functionality - it doesn't identify type of field.
 Personally - it is nothing what I like - but can be better than
 nothing.

 What are you going to do with the type once you have it?


for example, you can use it for formatting, for explicit cast, for
different serialization type - like JSON - without knowledge of type,
you can't to build correct JSON value. So you can write a application
with knowledge of type and you don't need to detect type from value,
that isn't robust.

Pavel

 David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby b-tree delete records review

2010-11-09 Thread Heikki Linnakangas

(cleaning up my inbox, and bumped into this..)

On 22.04.2010 12:31, Simon Riggs wrote:

On Thu, 2010-04-22 at 12:18 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

On Thu, 2010-04-22 at 11:56 +0300, Heikki Linnakangas wrote:


If none of the removed heap tuples were present anymore, we currently
return InvalidTransactionId, which kills/waits out all read-only
queries. But if none of the tuples were present anymore, the read-only
queries wouldn't have seen them anyway, so ISTM that we should treat
InvalidTransactionId return value as we don't need to kill anyone.

That's not the point. The tuples were not themselves the sole focus,

Yes, they were. We're replaying a b-tree deletion record, which removes
pointers to some heap tuples, making them unreachable to any read-only
queries. If any of them still need to be visible to read-only queries,
we have a conflict. But if all of the heap tuples are gone already,
removing the index pointers to them can'ẗ change the situation for any
query. If any of them should've been visible to a query, the damage was
done already by whoever pruned the heap tuples leaving just the
tombstone LP_DEAD item pointers (in the heap) behind.

You're missing my point. Those tuples are indicators of what may lie
elsewhere in the database, completely unreferenced by this WAL record.
Just because these referenced tuples are gone doesn't imply that all
tuple versions written by the as yet-unknown-xids are also gone. We
can't infer anything about the whole database just from one small group
of records.

Have you got an example of that?


I don't need one, I have suggested the safe route. In order to infer
anything, and thereby further optimise things, we would need proof that
no cases can exist, which I don't have. Perhaps we can add yet, not
sure about that either.


It's good to be safe rather than sorry, but I'd still like to know
because I'm quite surprised by that, and got me worried that I don't
understand how hot standby works as well as I thought I did. I thought
the point of stopping replay/killing queries at a b-tree deletion record
is precisely that it makes some heap tuples invisible to running
read-only queries. If it doesn't make any tuples invisible, why do any
queries need to be killed? And why was it OK for them to be running just
before replaying the b-tree deletion record?


I'm sorry but I'm too busy to talk further on this today. Since we are
discussing a further optimisation rather than a bug, I hope it is OK to
come back to this again later.


Would now be a good time to revisit this? I still don't see why a b-tree 
deletion record should conflict with anything, if all the removed index 
tuples point to just LP_DEAD tombstones in the heap.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] security hooks on object creation

2010-11-09 Thread Itagaki Takahiro
2010/11/9 KaiGai Kohei kai...@ak.jp.nec.com:
 The attached patch provides plugin modules a hook just after object
 creation time. In typical use cases, it enables to assign default
 security labels on object creation by the external security providers.

It looks like DDL Trigger on other database products.
Do we need to consider both security hooks and DDL triggers now?
Or, is it enough to design DLL triggers after the hooks are merged?
Low-level hooks might be better for security providers because
SQL-level triggers could be uninstall by superusers.

-- 
Itagaki Takahiro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Pavel Stehule

 What are you going to do with the type once you have it?


 for example, you can use it for formatting, for explicit cast, for
 different serialization type - like JSON - without knowledge of type,
 you can't to build correct JSON value. So you can write a application
 with knowledge of type and you don't need to detect type from value,
 that isn't robust.


there is other disadvantage of access to fields via PL function (or
native function) based on transformation to text. It's a lost of
typmod.

Pavel

 Pavel

 David




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Alvaro Herrera
Excerpts from Sam Mason's message of mar nov 09 08:06:12 -0300 2010:
 On Mon, Nov 08, 2010 at 12:55:22PM -0300, Alvaro Herrera wrote:
  Excerpts from Charles Pritchard's message of sáb nov 06 23:20:13 -0300 2010:
  
   Simple async sql sub-set (the spec in trouble):
   http://dev.w3.org/html5/webdatabase/
  
  This is insane.  This spec allows the server to run arbitrary SQL
  commands on the client, AFAICT.  That seems like infinite joy for
  malicious people running webservers.  The more powerful the dialect of
  SQL the client implements, the more dangerous it is.
 
 How is this different from the server asking the client to run an
 infinite loop in javascript?

So we already failed :-)  It seems that being able to kill processes is
seen as good enough ... well, I guess I just don't visit many
malicious sites.

And this makes me think that SQLite is indeed the right tool for the job
here, and not PostgreSQL.  If someone intrudes, it's going to be in the
same process running the web browser, not in some server running under
another user identity in the machine.  That seems like a feature to me,
not a bug.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl init doc bug

2010-11-09 Thread Tom Lane
Euler Taveira de Oliveira eu...@timbira.com writes:
 I tried -N 11 (postgres option) after looking at the manual but the -o
 option only accept initdb options so I think there is a documentation bug.
 Patch is attached.

Yeah, the init option was squeezed in later with only minimal attention
to documentation.  Agree with your idea of treating its -o option as
taking a different kind of argument.  Will apply.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can we tell how far behind the standby is?

2010-11-09 Thread Josh Berkus

 It's the former. The function which I'm proposing returns the timestamp
 of the last replayed commit/abort log record. That timestamp is given
 when the commit/abort log record is generated in the master.

That would be *extremely* helpful for all kinds of monitoring tools.
Please complete/submit this.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ALTER TYPE recursion to typed tables

2010-11-09 Thread Peter Eisentraut
Here is the patch that adds [RESTRICT|CASCADE] to ALTER TYPE ...
ADD/ALTER/DROP/RENAME ATTRIBUTE, so that recurses to typed tables.
diff --git a/doc/src/sgml/ref/alter_type.sgml b/doc/src/sgml/ref/alter_type.sgml
index 90de2e8..04395c9 100644
--- a/doc/src/sgml/ref/alter_type.sgml
+++ b/doc/src/sgml/ref/alter_type.sgml
@@ -26,15 +26,15 @@ PostgreSQL documentation
 ALTER TYPE replaceable class=PARAMETERname/replaceable replaceable class=PARAMETERaction/replaceable [, ... ]
 ALTER TYPE replaceable class=PARAMETERname/replaceable OWNER TO replaceable class=PARAMETERnew_owner/replaceable
 ALTER TYPE replaceable class=PARAMETERname/replaceable RENAME ATTRIBUTE replaceable class=PARAMETERattribute_name/replaceable TO replaceable class=PARAMETERnew_attribute_name/replaceable
-ALTER TYPE replaceable class=PARAMETERname/replaceable RENAME TO replaceable class=PARAMETERnew_name/replaceable
+ALTER TYPE replaceable class=PARAMETERname/replaceable RENAME TO replaceable class=PARAMETERnew_name/replaceable [ CASCADE | RESTRICT ]
 ALTER TYPE replaceable class=PARAMETERname/replaceable SET SCHEMA replaceable class=PARAMETERnew_schema/replaceable
 ALTER TYPE replaceable class=PARAMETERname/replaceable ADD replaceable class=PARAMETERnew_enum_value/replaceable [ { BEFORE | AFTER } replaceable class=PARAMETERexisting_enum_value/replaceable ]
 
 phrasewhere replaceable class=PARAMETERaction/replaceable is one of:/phrase
 
-ADD ATTRIBUTE replaceable class=PARAMETERattribute_name/replaceable replaceable class=PARAMETERdata_type/replaceable
-DROP ATTRIBUTE [ IF EXISTS ] replaceable class=PARAMETERattribute_name/replaceable
-ALTER ATTRIBUTE replaceable class=PARAMETERattribute_name/replaceable [ SET DATA ] TYPE replaceable class=PARAMETERdata_type/replaceable
+ADD ATTRIBUTE replaceable class=PARAMETERattribute_name/replaceable replaceable class=PARAMETERdata_type/replaceable [ CASCADE | RESTRICT ]
+DROP ATTRIBUTE [ IF EXISTS ] replaceable class=PARAMETERattribute_name/replaceable [ CASCADE | RESTRICT ]
+ALTER ATTRIBUTE replaceable class=PARAMETERattribute_name/replaceable [ SET DATA ] TYPE replaceable class=PARAMETERdata_type/replaceable [ CASCADE | RESTRICT ]
 /synopsis
  /refsynopsisdiv
 
@@ -116,6 +116,26 @@ ALTER TYPE replaceable class=PARAMETERname/replaceable ADD replaceable cl
  /para
 /listitem
/varlistentry
+
+   varlistentry
+termliteralCASCADE/literal/term
+listitem
+ para
+  Automatically propagate the operation to typed tables of the
+  type being altered.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
+termliteralRESTRICT/literal/term
+listitem
+ para
+  Refuse the operation if the type being altered is the type of a
+  typed table.  This is the default.
+ /para
+/listitem
+   /varlistentry
   /variablelist
   /para
 
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 794d355..0d0227d 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -125,11 +125,7 @@ ExecRenameStmt(RenameStmt *stmt)
 		}
 	case OBJECT_COLUMN:
 	case OBJECT_ATTRIBUTE:
-		renameatt(relid,
-  stmt-subname,		/* old att name */
-  stmt-newname,		/* new att name */
-  interpretInhOption(stmt-relation-inhOpt),	/* recursive? */
-  0);	/* expected inhcount */
+		renameatt(relid, stmt);
 		break;
 	case OBJECT_TRIGGER:
 		renametrig(relid,
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 6ec8a85..2b35943 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -269,8 +269,11 @@ static void ATSimpleRecursion(List **wqueue, Relation rel,
   AlterTableCmd *cmd, bool recurse, LOCKMODE lockmode);
 static void ATOneLevelRecursion(List **wqueue, Relation rel,
 	AlterTableCmd *cmd, LOCKMODE lockmode);
-static void find_typed_table_dependencies(Oid typeOid, const char *typeName);
-static void ATPrepAddColumn(List **wqueue, Relation rel, bool recurse,
+static void ATTypedTableRecursion(List **wqueue, Relation rel, AlterTableCmd *cmd,
+  LOCKMODE lockmode);
+static List *find_typed_table_dependencies(Oid typeOid, const char *typeName,
+		   DropBehavior behavior);
+static void ATPrepAddColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
 AlterTableCmd *cmd, LOCKMODE lockmode);
 static void ATExecAddColumn(AlteredTableInfo *tab, Relation rel,
 ColumnDef *colDef, bool isOid, LOCKMODE lockmode);
@@ -290,7 +293,8 @@ static void ATExecSetOptions(Relation rel, const char *colName,
  Node *options, bool isReset, LOCKMODE lockmode);
 static void ATExecSetStorage(Relation rel, const char *colName,
  Node *newValue, LOCKMODE lockmode);
-static void ATPrepDropColumn(Relation rel, bool recurse, AlterTableCmd *cmd);
+static void ATPrepDropColumn(List **wqueue, Relation rel, bool recurse, bool recursing,
+			 AlterTableCmd *cmd, LOCKMODE 

Re: [HACKERS] timestamp of the last replayed transaction

2010-11-09 Thread Josh Berkus
On 11/9/10 5:44 AM, Fujii Masao wrote:
 But, pg_last_xact_replay_timestamp is more intuitive for many people?
 If so, let's change
 the name.

*None* of these names are intuitive.  So let's just go for consistency.

If you want an intuitive name, it would be:

pg_replication_log_timestamp()

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 So, for getting checksums, we have to offer up a few things:
 1) zero-copy writes, we need to buffer the write to get a consistent
 checksum (or lock the buffer tight)
 2) saving hint-bits on an otherwise unchanged page.  We either need to
 just not write that page, and loose the work the hint-bits did, or do
 a full-page WAL of it, so the torn-page checksum is fixed

 Actually the consensus the last go-around on this topic was to
 segregate the hint bits into a single area of the page and skip them
 in the checksum. That way we don't have to do any of the above. It's
 just that that's a lot of work.

And it still allows silent data corruption, because bogusly clearing a
hint bit is, at the moment, harmless, but bogusly setting one is not.
I really have to wonder how other products handle this.  PostgreSQL
isn't the only database product that uses MVCC - not by a long shot -
and the problem of detecting whether an XID is visible to the current
snapshot can't be ours alone.  So what do other people do about this?
They either don't cache the information about whether the XID is
committed in-page (in which case, are they just slower or do they have
some other means of avoiding the performance hit?) or they cache it in
the page (in which case, they either WAL log it or they don't checksum
it).  I mean, there aren't any other options, are there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Kenneth Marshall
On Tue, Nov 09, 2010 at 02:05:57PM -0500, Robert Haas wrote:
 On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark gsst...@mit.edu wrote:
  On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
  So, for getting checksums, we have to offer up a few things:
  1) zero-copy writes, we need to buffer the write to get a consistent
  checksum (or lock the buffer tight)
  2) saving hint-bits on an otherwise unchanged page. ?We either need to
  just not write that page, and loose the work the hint-bits did, or do
  a full-page WAL of it, so the torn-page checksum is fixed
 
  Actually the consensus the last go-around on this topic was to
  segregate the hint bits into a single area of the page and skip them
  in the checksum. That way we don't have to do any of the above. It's
  just that that's a lot of work.
 
 And it still allows silent data corruption, because bogusly clearing a
 hint bit is, at the moment, harmless, but bogusly setting one is not.
 I really have to wonder how other products handle this.  PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).  I mean, there aren't any other options, are there?
 
 -- 
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company
 

That would imply that we need to have a CRC for just the hint bit
section or some type of ECC calculation that can detect bad hint
bits independent of the CRC for the rest of the page.

Regards,
Ken

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Alvaro Herrera
Excerpts from Robert Haas's message of mar nov 09 16:05:57 -0300 2010:

 And it still allows silent data corruption, because bogusly clearing a
 hint bit is, at the moment, harmless, but bogusly setting one is not.
 I really have to wonder how other products handle this.  PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).  I mean, there aren't any other options, are there?

Maybe allocate enough shared memory for pg_clog buffers back to the
freeze horizon, and just don't use hint bits?  Maybe some intermediate
solution, i.e. allocate a large bunch of pg_clog buffers, and do
WAL-logged setting of hint bits only for tuples that go further back.

I remember someone had a patch to set all the bits in a page that passed
a threshold of some kind.  Ah, no, that was for freezing tuples.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO Alter Table Rename Constraint

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 10:50 AM, Viktor Valy vili0...@gmail.com wrote:
 Hi Everybody!
 I looked up this todo, and figured out a plan, how the implementation could
 be written.
 The main challenge is to keep constraints and indexes (for unique and PK
 constraints) consistent.
 Fortunately this is already realized in the case of an index. So at ALTER
 INDEX RENAME the consistency is given by an extra check in tablecmds.c (Line
 2246), where a function is finally called, RenameConstraintById
 (pg_constraint.c Line 604).
 My idea is, to do that analog for ALTER CONSTRAINT RENAME too. So renaming
 the constraint is going to be done in tablecmds.c as for indexes, tables,
 sequences, views. And after checking whether the renametype is constraint,
 an extra rename has to be done for the index. Getting the index can be done
 with the function get_constraint_index (pg_depend.c Line  475). Now it
 should be possible to do the same as in RenameConstraintById.
 Is that so legal? Is anything else to be considered?

I think the biggest problem is handling inherited tables properly,
especially in complex inheritance hierarchies where there are
multiple, separate paths from the top of the hierarchy to the bottom.

See here for a couple of relevant test cases:

http://archives.postgresql.org/pgsql-hackers/2010-07/msg01570.php

I believe that the rename needs to fail if any table in the
inheritance hierarchy rooted at the target table also inherits the
constraint from someplace outside that hierarchy; or if any table in
that hierarchy has a local copy of the constraint that got merged with
the inherited one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Josh Berkus

 PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).

Well, most of the other MVCC-in-table DBMSes simply don't deal with
large, on-disk databases.  In fact, I can't think of one which does,
currently; while MVCC has been popular for the New Databases, they're
all focused on in-memory databases.  Oracle and InnoDB use rollback
segments.

Might be worth asking the BDB folks.

Personally, I think we're headed inevitably towards having a set of
metadata bitmaps for each table, like we do currently for the FSM.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Merlin Moncure
On Tue, Nov 9, 2010 at 12:34 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 David E. Wheeler da...@kineticode.com writes:
 You realize you can pretty much do all this with hstore, right?

 Yeah.  Anything that involves smashing all the fields to text is not
 really an advance over (a) hstore or (b) using plperl or one of the
 other weakly-typed PLs.

 I think there's a fairly fundamental contradiction involved here.
 One of the basic design attributes of plpgsql is that it's strongly
 typed.  Sometimes that's a blessing, and sometimes it's not, but
 it's a fact.  There really isn't a good way to deal with run-time
 field selection while still maintaining strong typing.  I do not
 believe that the answer to that problem is so let's break strong
 typing.  Rather, the answer is that if that's what you need, you
 need to use a different tool.  There's a reason we support multiple
 PLs.


In cases where both the field layout and the field of interest are
known at plan time this not violating the static principles of
plpgsql.  Suppose we decided to access field by name via
recordvar{name} or recordvar{field pos}:

DECLARE
  r record;
  f foo;
  t text default 'id';
BEGIN
  some code
  r{'id'} = 5;  -- no good, r is dynamic record
  f{t} 5; -- no good, t is not immutable
  f{'id'} = 5; -- ok;

Iterating over fields of type foo is not interesting because fields
are already known to whoever is writing the function, and flatten to
text cases are already covered.  IOW, the above syntax is not really
useful because you can just do:
   f.id = 5;

The only exception I see is in trigger functions.  If the trigger
function plan is specific to the firing trigger, new and old are
defined at plan time, so something like:

  new{TG_FIELDNAMES[1]} = 5; -- is ok (at least IMO), since
TG_FIELDNAMES is immutable (at least to the plan).

I don't honestly know if it's worth it -- the main case is performance
(plus Pavel's secondary argument of loss of type information).
Something like this would address an awful lot of gripes about trigger
functions though.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CREATE CONSTRAINT TRIGGER

2010-11-09 Thread Alvaro Herrera
Excerpts from Richard Broersma's message of vie nov 05 18:54:54 -0300 2010:
 On Fri, Nov 5, 2010 at 2:37 PM, Alvaro Herrera alvhe...@alvh.no-ip.org 
 wrote:
 
  Recent developments have turned it back into non-deprecated mode; it's
  not going anywhere, and it needs to be fully documented.
 
 From what I recall, there isn't anything in the trigger documentation
 or CREATE CONSTRAINT TRIGGER documentation that says the trigger
 function must explicitly raise an exception to create the notification
 that the custom constraint was violated.
 
 
 Would this be a good place for it?

I added a sentence about this, and pushed it.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - iteration over fields of rec or row variable

2010-11-09 Thread Alvaro Herrera
Excerpts from Merlin Moncure's message of mar nov 09 16:41:32 -0300 2010:

 The only exception I see is in trigger functions.  If the trigger
 function plan is specific to the firing trigger, new and old are
 defined at plan time, so something like:
 
   new{TG_FIELDNAMES[1]} = 5; -- is ok (at least IMO), since
 TG_FIELDNAMES is immutable (at least to the plan).
 
 I don't honestly know if it's worth it -- the main case is performance
 (plus Pavel's secondary argument of loss of type information).
 Something like this would address an awful lot of gripes about trigger
 functions though.

I think the interesting bit (wrt the examples I've seen, that is) would
be to be able to use the TG_ARGS array as the element specifier.  Not
sure if this is any different from your example.  It's been some time
since I've been near this though, so maybe what I wanted is now possible
with USING tricks.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus j...@agliodbs.com wrote:
 Well, most of the other MVCC-in-table DBMSes simply don't deal with
 large, on-disk databases.  In fact, I can't think of one which does,
 currently; while MVCC has been popular for the New Databases, they're
 all focused on in-memory databases.  Oracle and InnoDB use rollback
 segments.

Well rollback segments are still MVCC. However Oracle's MVCC is
block-based. So they only have to do the visibility check once per
block, not once per row. Once they find the right block version they
can process all the rows on it.

Also Oracle's snapshots are just the log position. Instead of having
to check whether every transaction committed or not, they just find
the block version which was last modified before the log position for
when their transaction started.

 Might be worth asking the BDB folks.

 Personally, I think we're headed inevitably towards having a set of
 metadata bitmaps for each table, like we do currently for the FSM.

Well we already have a metadata bitmap for transaction visibility.
It's called the clog. There's no point in having one structured
differently around the table.

The whole point of the hint bits is that it's in the same place as the data.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Josh Berkus

 The whole point of the hint bits is that it's in the same place as the data.

Yes, but the hint bits are currently causing us trouble on several
features or potential features:

* page-level CRC checks
* eliminating vacuum freeze for cold data
* index-only access
* replication
* this patch
* etc.

At a certain point, it's worth the trouble to handle them differently
because of the other features that enables or makes much easier.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] multi-platform, multi-locale regression tests

2010-11-09 Thread Peter Eisentraut
I'm looking for some ideas on how to deal with the regression tests for
the per-column collation feature.  These are the issues:

* The feature only works on some platforms (tentatively: Linux,
Windows).

- Possible solution: like xml test

* The locale names are platform dependent, so there would need to be
different test files per locale.

* The test files need to use some non-ASCII characters.  So far, I have
encoded the test file in UTF-8 and run the tests with make check
MULTIBYTE=UTF8.

* Also, the allowed collations depend on the server encoding, so any
solution for the previous point that results in the server encoding of
the test database being variable will make the setup of the regression
test SQL file more interesting.

* Of course the actual sort orders could also be different on different
platforms, but that problem can likely be contained.

One possible way out is not to include these tests in the main test set
and instead require manual invocation.

Better ideas?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Greg Stark
On Tue, Nov 9, 2010 at 8:12 PM, Josh Berkus j...@agliodbs.com wrote:
 The whole point of the hint bits is that it's in the same place as the data.

 Yes, but the hint bits are currently causing us trouble on several
 features or potential features:

Then we might have to get rid of hint bits. But they're hint bits for
a metadata file that already exists, creating another metadata file
doesn't solve anything.

Though incidentally all of the other items you mentioned are generic
problems caused by with MVCC, not hint bits.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Charles Pritchard

On 11/8/2010 4:47 PM, Alvaro Herrera wrote:

Excerpts from Charles Pritchard's message of lun nov 08 20:25:21 -0300 2010:

On 11/8/2010 3:03 PM, Alvaro Herrera wrote:

Excerpts from Kevin Grittner's message of lun nov 08 19:30:54 -0300 2010:

David Fetterda...@fetter.org   wrote:

That's not proof against a DoS

What client API is?

This spec gives free rein into every web user's system to webmasters.
If this isn't terminally dangerous, I don't know what is.

DoS is more-or-less the responsibility of the host to send up alerts like:
This page is hanging, do you want to continue... or otherwise
automatically close hanging queries.

I classify that kind of approach to security as terminally dangerous, yes.


I don't believe the webmaster is granted free rein:
Disk quotas are enforced, data is separated per origin,
hanging processes are up to the implementer, and postgres has plenty of
settings for that.

The day a privilege escalation is found and some webserver runs
pg_read_file() on your browser, will be a sad one indeed.


The default disk quota per origin is generally 5megs; beyond that,
additional user interaction is requested.

So 5 megs to a.example.com, 5 megs to b.example.com, and so on?  Sounds,
eh, great.



I don't think it's fair to assume a privilege escalation will be found:
using that argument, no software should ever run on a client/server.

That said, NaCl and PNaCl are under active development and I've no doubt
that Postgres could be compiled by the tool set in the future.

http://code.google.com/p/nativeclient/

Still, that's a diversion from the topic: Postgres can run on workstations,
with an audience of browser-oriented implementations.

Postgres is more stable than Sqlite for enterprise-level activity, 
hardened/enterprise
browser distributions would choose Postgres over Sqlite for Web SQL 
implementations.


I don't think it's fair to assume a privilege escalation will be found:
using that argument, no software should ever run on a client/server.

That said, NaCl and PNaCl are under active development and I've no doubt
that Postgres could be compiled by the tool set in the future.

http://code.google.com/p/nativeclient/

Still, that's a diversion from the topic: Postgres can run on workstations,
with an audience of browser-oriented implementations.

Postgres is more stable than Sqlite for enterprise-level activity, 
hardened/enterprise
browser distributions would choose Postgres over Sqlite for Web SQL 
implementations.


And as for the quota issues: that's really up to the browser vendor. 
It's completely out of spec here.
And it's how the web currently works for hundreds of millions of users: 
it's not introducing a security issue,

as it reflects the current state of security.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] W3C Specs: Web SQL

2010-11-09 Thread Jochem van Dieten
On Tue, Nov 9, 2010 at 9:14 PM, Charles Pritchard wrote:
 Postgres is more stable than Sqlite for enterprise-level activity,
 hardened/enterprise
 browser distributions would choose Postgres over Sqlite for Web SQL
 implementations.

I find that very unlikely. Web SQL is to be an upgrade from cookies as
the client storage mechanism, it is not meant to be store a few TB in
GIS data. Implementors will choose based on much more practical
concerns such as portability (SQLite is not just available for Android
and iOS, it is included), filesystem layout (put the databases in one
folder for each domain just like Flash lays out its offline storage),
embeddability (5 processes just to start a DB), recovery speed (when a
mobile browser gets pushed from RAM and later fear back in, it has to
replay a 16 MB WAL file) and even just convenience (how many browsers
already use SQLite for bookmark storage?) will weight far heavier then
some perceived enterprise readiness,

Jochem

--
Jochem van Dieten
http://jochem.vandieten.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Aidan Van Dyk
On Tue, Nov 9, 2010 at 3:25 PM, Greg Stark gsst...@mit.edu wrote:

 Then we might have to get rid of hint bits. But they're hint bits for
 a metadata file that already exists, creating another metadata file
 doesn't solve anything.

Is there any way to instrument the writes of dirty buffers from the
share memory, and see how many of the pages normally being written are
not backed by WAL (hint-only updates)?  Just dumping those buffers
without writes would allow at least *checksums* to go throug without
loosing all the benifits of the hint bits.

I've got a hunch (with no proof) that the penalty of not writing them
will be born largely by small database installs.  Large OLTP databases
probably won't have pages without a WAL'ed change and hint-bits set,
and large data warehouse ones will probably vacuum freeze big tables
on load to avoid the huge write penalty the 1st time they scan the
tables...

/waving hands

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Josh Berkus

 Though incidentally all of the other items you mentioned are generic
 problems caused by with MVCC, not hint bits.

Yes, but the hint bits prevent us from implementing workarounds.


-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 2:05 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 9, 2010 at 12:31 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Nov 9, 2010 at 5:06 PM, Aidan Van Dyk ai...@highrise.ca wrote:
 So, for getting checksums, we have to offer up a few things:
 1) zero-copy writes, we need to buffer the write to get a consistent
 checksum (or lock the buffer tight)
 2) saving hint-bits on an otherwise unchanged page.  We either need to
 just not write that page, and loose the work the hint-bits did, or do
 a full-page WAL of it, so the torn-page checksum is fixed

 Actually the consensus the last go-around on this topic was to
 segregate the hint bits into a single area of the page and skip them
 in the checksum. That way we don't have to do any of the above. It's
 just that that's a lot of work.

 And it still allows silent data corruption, because bogusly clearing a
 hint bit is, at the moment, harmless, but bogusly setting one is not.
 I really have to wonder how other products handle this.  PostgreSQL
 isn't the only database product that uses MVCC - not by a long shot -
 and the problem of detecting whether an XID is visible to the current
 snapshot can't be ours alone.  So what do other people do about this?
 They either don't cache the information about whether the XID is
 committed in-page (in which case, are they just slower or do they have
 some other means of avoiding the performance hit?) or they cache it in
 the page (in which case, they either WAL log it or they don't checksum
 it).  I mean, there aren't any other options, are there?

An examination of the MySQL source code reveals their answer.  In
row_vers_build_for_semi_consistent_read(), which I can't swear is the
right place but seems to be, there is this comment:

/* We assume that a rolled-back transaction stays in
TRX_ACTIVE state until all the changes have been
rolled back and the transaction is removed from
the global list of transactions. */

Which makes sense.  If you never leave rows from aborted transactions
in the heap forever, then the list of aborted transactions that you
need to remember for MVCC purposes will remain relatively small and
you can just include those XIDs in your MVCC snapshot.  Our problem is
that we have no particular bound on the number of aborted transactions
whose XIDs may still be floating around, so we can't do it that way.

dons asbestos underpants

To impose a similar bound in PostgreSQL, you'd need to maintain the
set of aborted XIDs and the relations that need to be vacuumed for
each one.  As you vacuum, you prune any tuples with aborted xmins
(which is WAL-logged already anyway) and additionally WAL-log clearing
the xmax for each tuple with an aborted xmax.  Thus, when you
finishing vacuuming the relation, the aborted XID is no longer present
anywhere in it.  When you vacuum the last relation for a particular
XID, that XID no longer exists in the relation files anywhere and you
can remove it from the list of aborted XIDs.  I think that WAL logging
the list of XIDs and list of unvacuumed relations for each at each
checkpoint would be sufficient for crash safety.  If you did this, you
could then assume that any XID which precedes your snapshot's xmin is
committed.

1. When a big abort happens, you may have to carry that XID around in
every snapshot - and avoid advancing RecentGlobalXmin - for quite a
long time.
2. You have to WAL log marking the XMAX of an aborted transaction invalid.
3. You have to WAL log the not-yet-cleaned-up XIDs and the relations
each one needs vacuumed at each checkpoint.
4. There would presumably be some finite limit on the size of the
shared memory structure for aborted transactions.  I don't think
there'd be any reason to make it particularly small, but if you sat
there and aborted transactions at top speed you might eventually run
out of room, at which point any transactions you started wouldn't be
able to abort until vacuum made enough progress to free up an entry.
5. It would be pretty much impossible to run with autovacuum turned
off, and in fact you would likely need to make it a good deal more
aggressive in the specific case of aborted transactions, to mitigate
problems #1, #3, and #4.

I'm not sure how bad those things would be, or if there are more that
I'm missing (besides the obvious it would be a lot of work).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-platform, multi-locale regression tests

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 12:18 PM, Peter Eisentraut wrote:

 One possible way out is not to include these tests in the main test set
 and instead require manual invocation.
 
 Better ideas?

I've been talking to Nasby and Dunstan about adding a Test::More/pgTAP-based 
test suite to the core. It wouldn't run with the usual core suite used by 
developers, which would continue to use pg_regress. But they could run it if 
they wanted (and had the prerequisites), and the build farm animals would run 
them regularly.

The nice thing about using a TAP-based framework is that you can skip tests 
that don't meet platform requirements, and compare values within the tests, 
right where you write them, rather than in a separate file. You can also 
dynamically change how you compare things depending on the environment, such as 
the locales that vary on different platforms.

Thoughts?

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Josh Berkus
On 11/9/10 1:50 PM, Robert Haas wrote:
 5. It would be pretty much impossible to run with autovacuum turned
 off, and in fact you would likely need to make it a good deal more
 aggressive in the specific case of aborted transactions, to mitigate
 problems #1, #3, and #4.

6. This would require us to be more aggressive about VACUUMing old-cold
relations/page, e.g. VACUUM FREEZE.  This it would make one of our worst
issues for data warehousing even worse.

What about having this map (and other hintbits) be per-relation?  Hmmm.
 That wouldn't work for DDL, I suppose ...

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Kevin Grittner
Josh Berkus j...@agliodbs.com wrote:
 
 6. This would require us to be more aggressive about VACUUMing
 old-cold relations/page, e.g. VACUUM FREEZE.  This it would make
 one of our worst issues for data warehousing even worse.
 
I continue to feel that it is insane that when a table is populated
within the same database transaction which created it (e.g., a bulk
load of a table or partition), that we don't write the tuples with
hint bits set for commit and xmin frozen.  By the time any but the
creating transaction can see the tuples, *if* any other transaction
is ever able to see the tuples, these will be the correct values;
we really should be able to deal with it within the creating
transaction somehow.
 
If we ever handle that, would #6 be a moot point, or do you think
it's still a significant issue?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Build farm server database migration complete

2010-11-09 Thread Andrew Dunstan


It took a little longer than expected, due to a slightly clagged network 
between the old and new servers, but the database migration is complete 
and the server is back up and running.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 5:03 PM, Josh Berkus j...@agliodbs.com wrote:
 On 11/9/10 1:50 PM, Robert Haas wrote:
 5. It would be pretty much impossible to run with autovacuum turned
 off, and in fact you would likely need to make it a good deal more
 aggressive in the specific case of aborted transactions, to mitigate
 problems #1, #3, and #4.

 6. This would require us to be more aggressive about VACUUMing old-cold
 relations/page, e.g. VACUUM FREEZE.  This it would make one of our worst
 issues for data warehousing even worse.

Uh, no it doesn't.  It only requires you to be more aggressive about
vacuuming the transactions that are in the aborted-XIDs array.  It
doesn't affect transaction wraparound vacuuming at all, either
positively or negatively.  You still have to freeze xmins before they
flip from being in the past to being in the future, but that's it.

 What about having this map (and other hintbits) be per-relation?  Hmmm.
  That wouldn't work for DDL, I suppose ...

This map?  I suppose you could track aborted XIDs per relation
instead of globally, but I don't see why that would affect DDL any
differently than anything else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 5:15 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Josh Berkus j...@agliodbs.com wrote:

 6. This would require us to be more aggressive about VACUUMing
 old-cold relations/page, e.g. VACUUM FREEZE.  This it would make
 one of our worst issues for data warehousing even worse.

 I continue to feel that it is insane that when a table is populated
 within the same database transaction which created it (e.g., a bulk
 load of a table or partition), that we don't write the tuples with
 hint bits set for commit and xmin frozen.  By the time any but the
 creating transaction can see the tuples, *if* any other transaction
 is ever able to see the tuples, these will be the correct values;
 we really should be able to deal with it within the creating
 transaction somehow.

I agree.

 If we ever handle that, would #6 be a moot point, or do you think
 it's still a significant issue?

I think it's a moot point anyway, per previous email.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 3:05 PM, Greg Stark gsst...@mit.edu wrote:
 On Tue, Nov 9, 2010 at 7:37 PM, Josh Berkus j...@agliodbs.com wrote:
 Well, most of the other MVCC-in-table DBMSes simply don't deal with
 large, on-disk databases.  In fact, I can't think of one which does,
 currently; while MVCC has been popular for the New Databases, they're
 all focused on in-memory databases.  Oracle and InnoDB use rollback
 segments.

 Well rollback segments are still MVCC. However Oracle's MVCC is
 block-based. So they only have to do the visibility check once per
 block, not once per row. Once they find the right block version they
 can process all the rows on it.

 Also Oracle's snapshots are just the log position. Instead of having
 to check whether every transaction committed or not, they just find
 the block version which was last modified before the log position for
 when their transaction started.

That is cool.  One problem is that it might sometimes result in
additional I/O.  A transaction begins and writes a tuple.  We must
write a preimage of the page (or at least, sufficient information to
reconstruct a preimage of the page) to the undo segment.  If the
transaction commits relatively quickly, and all transactions which
took their snapshots before the commit end either by committing or by
aborting, we can discard that information from the undo segment
without ever writing it to disk.  However, if that doesn't happen, the
undo log page may get evicted, and we're now doing three writes (WAL,
page, undo) rather than just two (WAL, page).  That's no worse than an
update where the old and new tuples land on different pages, but it IS
worse than an update where the old and new tuples are on the same
page, or at least I think it is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-platform, multi-locale regression tests

2010-11-09 Thread Cédric Villemain
2010/11/9 David E. Wheeler da...@kineticode.com:
 On Nov 9, 2010, at 12:18 PM, Peter Eisentraut wrote:

 One possible way out is not to include these tests in the main test set
 and instead require manual invocation.

 Better ideas?

 I've been talking to Nasby and Dunstan about adding a Test::More/pgTAP-based 
 test suite to the core. It wouldn't run with the usual core suite used by 
 developers, which would continue to use pg_regress. But they could run it if 
 they wanted (and had the prerequisites), and the build farm animals would run 
 them regularly.

 The nice thing about using a TAP-based framework is that you can skip tests 
 that don't meet platform requirements, and compare values within the tests, 
 right where you write them, rather than in a separate file. You can also 
 dynamically change how you compare things depending on the environment, such 
 as the locales that vary on different platforms.

 Thoughts?

Are you thinking of a contrib module 'pgtap' that we can use to
accomplish the optionnal regression tests ?

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-platform, multi-locale regression tests

2010-11-09 Thread David E. Wheeler
On Nov 9, 2010, at 2:42 PM, Cédric Villemain wrote:

 Are you thinking of a contrib module 'pgtap' that we can use to
 accomplish the optionnal regression tests ?

Oh, if the project wants it in contrib, sure. Otherwise I'd probably just have 
the test stuff include it somehow.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Josh Berkus
Robert,

 Uh, no it doesn't.  It only requires you to be more aggressive about
 vacuuming the transactions that are in the aborted-XIDs array.  It
 doesn't affect transaction wraparound vacuuming at all, either
 positively or negatively.  You still have to freeze xmins before they
 flip from being in the past to being in the future, but that's it.

Sorry, I was trying to say that it's similar to the freeze issue, not
that it affects freeze.  Sorry for the lack of clarity.

What I was getting at is that this could cause us to vacuum
relations/pages which would otherwise never be vaccuumed (or at least,
not until freeze).  Imagine a very large DW table which is normally
insert-only and seldom queried, but once a month or so the insert aborts
and rolls back.

I'm not saying that your proposal isn't worth testing.  I'm just saying
that it may prove to be a net loss to overall system efficiency.

 If we ever handle that, would #6 be a moot point, or do you think
  it's still a significant issue?

Kevin, the case which your solution doesn't fix is the common one of
log tables which keep adding records continuously, with  5% inserts
or updates.  That may seem like a corner case but such a table,
partitioned or unpartitioned, exists in around 1/3 of the commercial
applications I've worked on, so it's a common pattern.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-09 Thread Robert Haas
On Mon, Nov 8, 2010 at 9:52 PM, Itagaki Takahiro
itagaki.takah...@gmail.com wrote:
 Are there any activities in JSON data types for the next commitfest?

I'm leaning toward the view that we shouldn't commit a JSON
implementation to core (or contrib) for 9.1.  We have at least three
competing proposals on the table.  I thought of picking it up and
hacking on it myself, but then we'd have four competing proposals on
the table.  Even if we could come to some consensus on which of those
proposals is technically superior, the rate at which new ideas are
being proposed suggests to me that it would be premature to anoint any
single implementation as our canonical one.  I'd like to see some of
these patches finished and put up on pgfoundry or github, and then
consider moving one of them into core when we have a clear and stable
consensus that one of them is technically awesome and the best thing
we're going to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-platform, multi-locale regression tests

2010-11-09 Thread Cédric Villemain
2010/11/9 David E. Wheeler da...@kineticode.com:
 On Nov 9, 2010, at 2:42 PM, Cédric Villemain wrote:

 Are you thinking of a contrib module 'pgtap' that we can use to
 accomplish the optionnal regression tests ?

 Oh, if the project wants it in contrib, sure. Otherwise I'd probably just 
 have the test stuff include it somehow.

Adding a unit test layer shipped with postgresql sounds good to me.
And pgTAP can claim to be platform agnostic.

-- 
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: Add JSON datatype to PostgreSQL (GSoC, WIP)

2010-11-09 Thread Terry Laurenzo
Robert,
I think I agree.  At a minimum, I would like to see the chosen of the
competing priorities live on as an outside module for use by previous
versions.  Even having proposed one, and soon to be two of the competing
implementations, it makes me nervous to commit to one at this juncture.

I'm wrapping some items up this week but expect to have some time over the
next two weeks to complete my implementation.

Here's a quick status on where I'm at:
   - Binary format has been implemented as specified here:
https://github.com/tlaurenzo/pgjson/blob/master/pgjson/jsonlib/BINARY-README.txt
   - Hand coded a JSON-text lexer/parser and JSON-binary parser and
transcoders
   - Ran out of time to do exhaustive tests, but typical payloads yield
significant space savings
   - Based on an admittedly small number of test cases, I identified that
the process of encoding a string literal is the most expensive operation in
the general case, accounting for 1/2 to 2/3 of the time spent in a
transcoding operation.  This is fairly obvious but good to know.  I did not
spend any time looking into this further.
   - Drastically simplified the code in preparation to build a stand-alone
module

As soon as I get a bit of time I was going to do the following:
   - Create a simple PGXS based build, stripping out the rest of the bits I
was doodling on
   - Re-implement the PG module based on the new jsonlib binary format and
parser
   - Add JSONPath and some encoding bits back in from the original patch
   - Do some holistic profiling between the JSON-as-Text approach and the
JSON-as-Binary approach

This is still a bit of a fishing expedition, imo and I would have a hard
time getting this ready for commit on Monday.  If getting something in right
now is critical, Joey's original patch is the most complete at this point.

Terry


On Tue, Nov 9, 2010 at 3:48 PM, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Nov 8, 2010 at 9:52 PM, Itagaki Takahiro
 itagaki.takah...@gmail.com wrote:
  Are there any activities in JSON data types for the next commitfest?

 I'm leaning toward the view that we shouldn't commit a JSON
 implementation to core (or contrib) for 9.1.  We have at least three
 competing proposals on the table.  I thought of picking it up and
 hacking on it myself, but then we'd have four competing proposals on
 the table.  Even if we could come to some consensus on which of those
 proposals is technically superior, the rate at which new ideas are
 being proposed suggests to me that it would be premature to anoint any
 single implementation as our canonical one.  I'd like to see some of
 these patches finished and put up on pgfoundry or github, and then
 consider moving one of them into core when we have a clear and stable
 consensus that one of them is technically awesome and the best thing
 we're going to get.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 Though incidentally all of the other items you mentioned are generic
 problems caused by with MVCC, not hint bits.

 Yes, but the hint bits prevent us from implementing workarounds.

If we got rid of hint bits, we'd need workarounds for the ensuing
massive performance loss.  There is no reason whatsoever to imagine
that we'd come out ahead in the end.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 dons asbestos underpants
 4. There would presumably be some finite limit on the size of the
 shared memory structure for aborted transactions.  I don't think
 there'd be any reason to make it particularly small, but if you sat
 there and aborted transactions at top speed you might eventually run
 out of room, at which point any transactions you started wouldn't be
 able to abort until vacuum made enough progress to free up an entry.

Um, that bit is a *complete* nonstarter.  The possibility of a failed
transaction always has to be allowed.  What if vacuum itself gets an
error for example?  Or, what if the system crashes?

I thought for a bit about inverting the idea, such that there were a
limit on the number of unvacuumed *successful* transactions rather than
the number of failed ones.  But that seems just as unforgiving: what if
you really need to commit a transaction to effect some system state
change?  An example might be dropping some enormous table that you no
longer need, but vacuum is going to insist on plowing through before
it'll let you have any more transactions.

I'm of the opinion that any design that presumes it can always fit all
the required transaction-status data in memory is probably not even
worth discussing.  There always has to be a way for status data to spill
to disk.  What's interesting is how you can achieve enough locality of
access so that most of what you need to look at is usually in memory.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus j...@agliodbs.com wrote:
 Robert,

 Uh, no it doesn't.  It only requires you to be more aggressive about
 vacuuming the transactions that are in the aborted-XIDs array.  It
 doesn't affect transaction wraparound vacuuming at all, either
 positively or negatively.  You still have to freeze xmins before they
 flip from being in the past to being in the future, but that's it.

 Sorry, I was trying to say that it's similar to the freeze issue, not
 that it affects freeze.  Sorry for the lack of clarity.

 What I was getting at is that this could cause us to vacuum
 relations/pages which would otherwise never be vaccuumed (or at least,
 not until freeze).  Imagine a very large DW table which is normally
 insert-only and seldom queried, but once a month or so the insert aborts
 and rolls back.

Oh, I see.  In that case, under the proposed scheme, you'd get an
immediate vacuum of everything inserted into the table since the last
failed insert.  Everything prior to the last failed insert would be
OK, since the visibility map bits would already be set for those
pages.  Yeah, that would be annoying.

There's a related problem with index-only scans.  If a large DW table
which is normally insert-only, but which IS queried regularly, it
won't be able to use index-only scans effectively because without
regularly vacuuming, the visibility map bits won't be set.  We've
previously discussed the possibility of having the background writer
set hint bits before writing the pages, and maybe it could even set
the all-visible bit and update the visibility map, too.  But that
won't help if the transaction inserts a large enough quantity of data
that it starts spilling buffers to disk before it commits.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Gurjeet Singh
On Wed, Nov 10, 2010 at 1:15 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Once you know that there is, or isn't,
 a filesystem-level error involved, what are you going to do next?
 You're going to go try to debug the component you know is at fault,
 that's what.  And that problem is still AI-complete.


If we know for sure that Postgres was not at fault then we have standby node
to failover to, where Postgres warm standby is being maintained by streaming
replication.

Regards
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 6:42 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 dons asbestos underpants
 4. There would presumably be some finite limit on the size of the
 shared memory structure for aborted transactions.  I don't think
 there'd be any reason to make it particularly small, but if you sat
 there and aborted transactions at top speed you might eventually run
 out of room, at which point any transactions you started wouldn't be
 able to abort until vacuum made enough progress to free up an entry.

 Um, that bit is a *complete* nonstarter.  The possibility of a failed
 transaction always has to be allowed.  What if vacuum itself gets an
 error for example?  Or, what if the system crashes?

I wasn't proposing that it was impossible to abort, only that aborts
might have to block.  I admit I don't know what to do about VACUUM
itself failing.  A transient failure mightn't be so bad, but if you
find yourself permanently unable to eradicate the XIDs left behind by
an aborted transaction, you'll eventually have to shut down the
database, lest the XID space wrap around.

Actually, come to think of it, there's no reason you COULDN'T spill
the list of aborted-but-not-yet-cleaned-up XIDs to disk.  It's just
that XidInMVCCSnapshot() would get reeally expensive after a
while.

 I thought for a bit about inverting the idea, such that there were a
 limit on the number of unvacuumed *successful* transactions rather than
 the number of failed ones.  But that seems just as unforgiving: what if
 you really need to commit a transaction to effect some system state
 change?  An example might be dropping some enormous table that you no
 longer need, but vacuum is going to insist on plowing through before
 it'll let you have any more transactions.

The number of relevant aborted XIDs tends naturally to decline to zero
as vacuum does its thing, while the number of relevant committed XIDs
tends to grow very, very large (it starts to decline only when we
start freezing things), so remembering the not-yet-cleaned-up aborted
XIDs seems likely to be cheaper.  In fact, in many cases, the set of
not-yet-cleaned-up aborted XIDs will be completely empty.

 I'm of the opinion that any design that presumes it can always fit all
 the required transaction-status data in memory is probably not even
 worth discussing.

Well, InnoDB does it.

 There always has to be a way for status data to spill
 to disk.  What's interesting is how you can achieve enough locality of
 access so that most of what you need to look at is usually in memory.

We're not going to get any more locality of reference than we're
already getting from hint bits, are we?  The advantage of trying to do
timely cleanup of aborted transactions is that you can assume that any
XID before RecentGlobalXmin is committed, without checking CLOG and
without having to update hint bits and write out the ensuing dirty
pages.  If we could make CLOG access cheap enough that we didn't need
hint bits, that would also solve that problem, but nobody (including
me) seems to think that's feasible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DROP TABLESPACE needs crash-resistance

2010-11-09 Thread Gurjeet Singh
On Wed, Nov 10, 2010 at 1:24 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh singh.gurj...@gmail.com writes:
  We are facing a problem in dropping a tablespace after crash recovery.
 The
  recovery starts from the last checkpoint, but the tables that were
 created
  by
  a transaction in a tablespace before the checkpoint are still lying
 around;
  the
  transaction had not finished by the time of crash.

  After recovery, when the app tries to drop the tablespace, the command
 fails
  because the tablespace directory is not empty.

 Hmm.  The reason DROP TABLESPACE fails in that case, rather than just
 arbitrarily rm -rf'ing the files, is fear of deleting valuable data by
 accident.  I suppose we could have a mode that deletes the files without
 any manual intervention, but personally I'd regard that as a foot-gun.


That'd be a lot of help, especially when the DBA/app knows that there's
nothing supposed to be leftover in that tablespace; maybe querying different
pg_class from all the databases would act as a cross check.

The problem is that if there are open transactions who just created a
relation in that tablespace, that record won't be visible in pg_class. Also,
we don't take any kind of lock on a tablespace when opening a relation, so
we cannot be sure if there are no running transactions with an open relation
from that tablespace (I guess this is moot if DBA/app knows there's
nothing supposed to be in the tablespace, but this doesn't hold for any new
connections trying to create tables there). Join between pg_locks and
pg_class would have been helpful, but pg_locks doesn't hold tablespace oid,
and pg_class rows may not be visible yet).



  Solving this problem has become quite critical since the the platform
 where
  Postgres is being used is supposed to run unattended.

 I'm not entirely clear as to the use-case for unattended DROP TABLESPACE?
 That doesn't really seem like an operation you should need on a routine
 basis.


For every new element to be managed, the application creates all the
relevant objects in a new schema, and assigns all the objects in that schema
to a new tablespace. So when that element supposed to be removed, we need to
drop schema and the associated tablespace.

Regardless, having Postgres leave its trash behind is not desirable in any
scenario, so a solution that clears such files at the end of recovery would
be much more desirable.

Regards,
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device


Re: [HACKERS] timestamp of the last replayed transaction

2010-11-09 Thread Fujii Masao
On Wed, Nov 10, 2010 at 3:28 AM, Josh Berkus j...@agliodbs.com wrote:
 On 11/9/10 5:44 AM, Fujii Masao wrote:
 But, pg_last_xact_replay_timestamp is more intuitive for many people?
 If so, let's change
 the name.

 *None* of these names are intuitive.  So let's just go for consistency.

OK. I changed the name to pg_last_xact_replay_timestamp.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


pg_last_xact_replay_timestamp_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against unexpected zero-pages: proposal

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 7:04 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 9, 2010 at 5:45 PM, Josh Berkus j...@agliodbs.com wrote:
 Robert,

 Uh, no it doesn't.  It only requires you to be more aggressive about
 vacuuming the transactions that are in the aborted-XIDs array.  It
 doesn't affect transaction wraparound vacuuming at all, either
 positively or negatively.  You still have to freeze xmins before they
 flip from being in the past to being in the future, but that's it.

 Sorry, I was trying to say that it's similar to the freeze issue, not
 that it affects freeze.  Sorry for the lack of clarity.

 What I was getting at is that this could cause us to vacuum
 relations/pages which would otherwise never be vaccuumed (or at least,
 not until freeze).  Imagine a very large DW table which is normally
 insert-only and seldom queried, but once a month or so the insert aborts
 and rolls back.

 Oh, I see.  In that case, under the proposed scheme, you'd get an
 immediate vacuum of everything inserted into the table since the last
 failed insert.  Everything prior to the last failed insert would be
 OK, since the visibility map bits would already be set for those
 pages.  Yeah, that would be annoying.

Ah, but it might be fixable.  You wouldn't really need to do a
full-fledged vacuum.  It would be sufficient to scan the heap pages
that might contain the XID we're trying to clean up after, without
touching the indexes.  Instead of actually removing tuples with an
aborted XMIN, you could just mark the line pointers LP_DEAD.  Tuples
with an aborted XMAX don't require touching the indexes anyway.  So as
long as you have some idea which segment of the relation was
potentially dirtied by that transaction, you could just scan those
blocks and update the item pointers and/or XMAX values for the
offending tuples without doing anything else (although you'd probably
want to opportunistically grab the buffer cleanup lock and defragment
if possible).

Unfortunately, I'm now realizing another problem.  During recovery,
you have to assume that any XIDs that didn't commit are aborted; under
the scheme I proposed upthread, if a transaction that was in-flight at
crash time had begun prior to the last checkpoint, you wouldn't know
which relations it had potentially dirtied.  Ouch.  But I think this
is fixable, too.  Let's invent a new on-disk structure called the
content-modified log.  Transactions that want to insert, update, or
delete tuples allocate pages from this structure.  The header of each
page stores the XID of the transaction that owns that page and the ID
of the database to which that transaction is bound.  Following the
header, there are a series of records of the form: tablespace OID,
table OID, starting page number, ending page number.  Each such record
indicates that the given XID may have put its XID on disk within the
given page range of the specified relation.  Each checkpoint flushes
the dirty pages of the modified-content log to disk along with
everything else.  Thus, on redo, we can reconstruct the additional
entries that need to be added to the log from the contents of WAL
subsequent to the redo pointer.

If a transaction commits, we can remove all of its pages from the
modified-content log; in fact, if a transaction begins and commits
without an intervening checkpoint, the pages never need to hit the
disk at all.  If a transaction aborts, its modified-content log pages
must stick around until we've eradicated any copies of its XID in the
relation data files.  We maintain a global value for the oldest
aborted XID which is not yet fully cleaned up (let's called this the
OldestNotQuiteDeadYetXID).  When we see an XID which precedes
OldestNotQuiteDeadYetXID, we know it's committed.  Otherwise, we check
whether the XID precedes the xmin of our snapshot.  If it does, we
have to check whether the XID is committed or aborted (it must be one
or the other).  If it does not, we use our snapshot, as now.  Checking
XIDs between OldestNotQuiteDeadYetXID and our snapshot's xmin is
potentially expensive, but (1) if there aren't many aborted
transactions, this case shouldn't arise very often; (2) if the XID
turns out to be aborted and we can get an exclusive buffer content
lock, we can nuke that copy of the XID to save the next guy the
trouble of examining it; and (3) we can maintain a size-limited
per-backend cache of this information, which should help in the normal
cases where there either aren't that many XIDs that fall into this
category or our transaction doesn't see all that many of them.

This also addresses Tom's concern about needing to store all the
information in memory, and the need to WAL-log not-yet-cleaned-up XIDs
at each checkpoint.  You still need to aggressively clean up after
aborted transactions, either using our current vacuum mechanism or the
just zap the XIDs shortcut described above.

(An additional interesting point about this design is that you could
potentially also use it to 

Re: [HACKERS] timestamp of the last replayed transaction

2010-11-09 Thread Robert Haas
On Tue, Nov 9, 2010 at 8:55 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Nov 10, 2010 at 3:28 AM, Josh Berkus j...@agliodbs.com wrote:
 On 11/9/10 5:44 AM, Fujii Masao wrote:
 But, pg_last_xact_replay_timestamp is more intuitive for many people?
 If so, let's change
 the name.

 *None* of these names are intuitive.  So let's just go for consistency.

 OK. I changed the name to pg_last_xact_replay_timestamp.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] We need index-only scans

2010-11-09 Thread Bruce Momjian
We last researched index-only scans, also called covering indexes, in
September of 2008, but have made little progress on it since.  Many have
been waiting for Heikki to implement this but I talked to him and he
doesn't have time.  

I believe it is time for the community to move forward and I would like
to assemble a team to work on this feature.  We might not be able to
implement it for Postgres 9.1, but hopefully we can make some progress
on this.

I have created a wiki page for this TODO item:

http://wiki.postgresql.org/wiki/Index-only_scans

I am interested in people improving this wiki page, and in people
discussing and coding some of the items needed to implement this
feature.  I have personally talked to a few people already who seemed
receptive.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] timestamp of the last replayed transaction

2010-11-09 Thread Fujii Masao
On Wed, Nov 10, 2010 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Nov 9, 2010 at 8:55 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Nov 10, 2010 at 3:28 AM, Josh Berkus j...@agliodbs.com wrote:
 On 11/9/10 5:44 AM, Fujii Masao wrote:
 But, pg_last_xact_replay_timestamp is more intuitive for many people?
 If so, let's change
 the name.

 *None* of these names are intuitive.  So let's just go for consistency.

 OK. I changed the name to pg_last_xact_replay_timestamp.

 Committed.

Thanks!

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] security hooks on object creation

2010-11-09 Thread Robert Haas
2010/11/9 KaiGai Kohei kai...@ak.jp.nec.com:
 The attached patch provides plugin modules a hook just after object
 creation time. In typical use cases, it enables to assign default
 security labels on object creation by the external security providers.

 As Robert suggested before, it provides a generic purpose main hook.
 It takes an enum of ObjectAccessType which informs plugins what kind
 of accesses are required, and identifier of the object to be referenced.
 But, in this version, no additional information, such as new name in
 ALTER xxx RENAME TO, are not supported.

 The ObjectAccessType is defined as follows:

  typedef enum ObjectAccessType {
    OAT_POST_CREATE,    /* Post creation fixups; such as security labeling */
  } ObjectAccessType;

 We will support more complete kind of access types in the future version,
 however, we focus on default labeling rather than DDL permissions right
 now, so only OAT_POST_CREATE is defined here.
 Perhaps, we will add OAT_ALTER, OAT_DROP, OAT_COMMENT and so on.

 In this patch, I put hooks on the place just after creation of database
 objects that we can assign security labels. (schema, relation, attribute,
 procedure, language, type, large object)

 However, I didn't touch or move CommandCounterIncrement() yet, although
 we had a long discussion MVCC visibility of new object.
 Because I'm not clear whether it is really preferable to inject CCIs
 onto random points such as TypeCreate() or ProcedureCreate() under
 development of the version killed by myself.
 (In other words, it was simply ugly...)

 At least, we can see the new entries with SnapshotSelf, although we will
 pay performance penalty. If so, it is an idea not to touch anything
 related to CCIs.
 The purpose of post creation hooks are assignment of default security
 labels, not DDL permissions. So, it is not a bad idea not to touch
 routines related to CCIs in the earlier version of external security
 provider.

 In this patch, we put InvokeObjectAccessHook0 on the following functions.

 - heap_create_with_catalog() for relations/attributes
 - ATExecAddColumn() for attributes
 - NamespaceCreate() for schemas
 - ProcedureCreate() for aggregates/functions
 - TypeCreate() and TypeShellMake() for types
 - create_proc_lang() for procedural languages
 - inv_create() for large objects

I think you ought to try to arrange to avoid the overhead of a
function call in the common case where nobody's using the hook.
That's why I originally suggested making InvokeObjectAccessHook() a
macro around the actual function call.

I don't want to refer to this as a framework for enhanced security
providers.  Let's stick with the term object access hook.  Calling
it an enhanced security provider overspecifies; it could equally well
be used for, say, logging.

Is there any compelling reason not to apply this to every object type
in the system (e.g. all the ones COMMENT can apply to)?  I don't see
any reason to restrict it to the set of objects to which it's sensible
to apply security labels.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Which file does the SELECT?

2010-11-09 Thread Robert Haas
On Mon, Nov 8, 2010 at 9:55 PM, Vaibhav Kaushal
vaibhavkaushal...@gmail.com wrote:
 I have started with the work and am using Eclipse and it helps quite a lot.
 I can find the declarations quite easily. Thanks to open Source.

 BTW, I am encountering too many (just too many) data types as I try to
 understand the backend (specifically the executor). I do think that its
 normal because the executor has to consider almost everything that the other
 parts of the DB can possibly command it to do.

 Is there some documentation available on the data types / structures which
 are in use at the backend?

There's less than one might hope.  I think you pretty much have to
look through the README files and source code comments.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] multi-platform, multi-locale regression tests

2010-11-09 Thread Peter Eisentraut
On tis, 2010-11-09 at 14:00 -0800, David E. Wheeler wrote:
 I've been talking to Nasby and Dunstan about adding a Test::More/pgTAP-based 
 test suite to the core. It wouldn't run with the usual core suite used by 
 developers, which would continue to use pg_regress. But they could run it if 
 they wanted (and had the prerequisites), and the build farm animals would run 
 them regularly.

I'd welcome something like that, but I'm not sure that that's the best
overall solution to this particular problem in the short run.  But it
would be great to have anyway.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers