Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Simon Riggs
On 30 October 2013 14:34, Yann Fontana  wrote:
>
>
>> On 30 October 2013 11:23, Leonardo Francalanci  wrote:
>>
>> >> In terms of generality, do you think its worth a man year of developer
>> >> effort to replicate what you have already achieved? Who would pay?
>
>
> I work on an application that does exactly what Leonardo described. We hit
> the exact same problem, and came up with the same exact same solution (down
> to the 15 minutes interval). But I have also worked on other various
> datamarts (all using Oracle), and they are all subject to this problem in
> some form: B-tree indexes slow down bulk data inserts too much and need to
> be disabled or dropped and then recreated after the load. In some cases this
> is done easily enough, in others it's more complicated (example: every day,
> a process imports from 1 million to 1 billion records into a table partition
> that may contain from 0 to 1 billion records. To be as efficient as
> possible, you need some logic to compare the number of rows to insert to the
> number of rows already present, in order to decide whether to drop the
> indexes or not).
>
> Basically, my point is that this is a common problem for datawarehouses and
> datamarts. In my view, indexes that don't require developers to work around
> poor insert performance would be a significant feature in a
> "datawarehouse-ready" DBMS.


Everybody on this thread is advised to look closely at Min Max indexes
before starting any further work.

MinMax will give us access to many new kinds of plan, plus they are
about as close to perfectly efficient, by which I mean almost zero
overhead, with regard to inserts as it is possible to get.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shave a few instructions from child-process startup sequence

2013-11-04 Thread Gurjeet Singh
On Mon, Nov 4, 2013 at 12:20 AM, Tom Lane  wrote:

> But we're not buying much.  A few instructions during postmaster shutdown
> is entirely negligible.
>

The patch is for ClosePostmasterPorts(), which is called from every child
process startup sequence (as $subject also implies), not in postmaster
shutdown. I hope that adds some weight to the argument.

Best regards,
-- 
Gurjeet Singh   gurjeet.singh.im
EnterpriseDB Inc. www.enterprisedb.com


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Simon Riggs
On 4 November 2013 19:55, Gavin Flower  wrote:

> How about having a 'TRANSIENT INDEX' that only exists in memory, so there is
> no requirement to write it to disk or to replicate directly? This type of
> index would be very fast and easier to implement.  Recovery would involve
> rebuilding the index, and sharing would involve recreating on a slave.
> Probably not appropriate for a primary index, but may be okay for secondary
> indexes used to speed specific queries.
>
> This could be useful in some situations now, and allow time to get
> experience in how best to implement the basic concept.  Then a more robust
> solution using WAL etc can be developed later.
>
> I suspect that such a TRANSIENT INDEX would still be useful even when a more
> robust in memory index method was available.  As I expect it would be faster
> to set up than a robust memory index - which might be good when you need to
> have one or more indexes for a short period of time, or the size of the
> index is so small that recreating it requires very little time (total
> elapsed time might even be less than a disk backed one?).

UNLOGGED indexes have been discussed over many years and there is
pretty good acceptance of the idea for some use cases.

The main tasks are
* marking the index so they are unavailable on a hot standby
* rebuilding the index in full at the end of recovery - requires an
additional process to rebuild them possibly in priority order
* make sure the above doesn't violate security
* marking the index so it can't be used in the planner until rebuild
is complete - subtle stuff

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical column order and physical column order

2013-11-04 Thread David Rowley
On Mon, Nov 4, 2013 at 3:14 AM, Alvaro Herrera wrote:

> David Rowley escribió:
> > I've just been looking at how alignment of columns in tuples can make the
> > tuple larger than needed.
>
> This has been discussed at length previously, and there was a design
> proposed to solve this problem.  See these past discussions:
>
> http://archives.postgresql.org/pgsql-hackers/2006-12/msg00983.php
> http://archives.postgresql.org/pgsql-hackers/2007-02/msg01235.php
> http://archives.postgresql.org/pgsql-hackers/2008-11/msg01680.php
>
> I started work on this, and managed to get parts of it to work.  While
> doing so I realized that it was quite a lot more hideous than I had
> originally expected.  I published a tree at github:
> https://github.com/alvherre/postgres/tree/column


Thanks for the archive links... I read these last night and pulled out some
key pieces of information from some of the posts.

I should say that I've not dived into the code too much to see how hard it
would be, but my, perhaps naive original idea would have just be to add 1
column to pg_attribute to store the logical order and have attnum store the
physical order... This would have meant that at least only the following
places would have to take into account the change.


1. pg_dump needs to display columns in logical order both for create tables
and copy/insert statements.
2. INSERT INTO table values( ... ) (i.e without column names) needs to look
at logical order.
3. create table like 
4. backup and restore using copy
5. select * expand to column names

And of lesser importance as I'd assume it would just be a change in an
ORDER BY clause in their queries of pg_attribute

1. Display in clients... psql Pg Admin

I thought the above would have been doable and I did wonder what all the
fuss was about relating to bugs in the code where it could use the logical
number instead of attnum.

On reading of the posts last night I can see that the idea was to add not 1
but 2 new fields to pg_attribute. One was for the physical order and one
for the logical order and at first I didn't really understand as I thought
attnum would always be the physical order. I didn't really know before this
that attnum was static... I did some tests were I dropped one of the middle
columns out of a table and then rewrote the table with cluster and I see
that the pg_attribute record is kept even though the remains of the column
values have been wiped out by the rewrite... Is this done because things
like indexes, foreign keys and sequences reference the {attrelid,attnum} ?
if so then I see why the 2 extra columns are needed and I guess that's
where the extra complications come from.

So now I'm wondering, with my freshly clustered table which I dropped one
of the middle columns from before the cluster... my pg_attributes look
something like:

 relname |   attname| attnum
-+--+
 dropcol | tableoid | -7
 dropcol | cmax | -6
 dropcol | xmax | -5
 dropcol | cmin | -4
 dropcol | xmin | -3
 dropcol | ctid | -1
 dropcol | one  |  1
 dropcol | pg.dropped.2 |  2
 dropcol | three|  3

and I would imagine since the table has just been clustered that the
columns are stored like {..., ctid, one,three}
In this case how does Postgresql know that attnum 3 is the 2nd user column
in that table? Unless I have misunderstood something then there must be
some logic in there to skip dropped columns and if so then could it not
just grab the "attphynum" at that location? then just modify the 1-5 places
listed above to sort on attlognum?

Regards

David Rowley


Re: [HACKERS] WITHIN GROUP patch

2013-11-04 Thread Vik Fearing
On 11/04/2013 08:43 AM, Atri Sharma wrote:
> Please find attached our latest version of the patch. This version
> fixes the issues pointed out by the reviewers.

No, it doesn't.  The documentation still contains formatting and
grammatical errors, and the code comments still do not match the their
surroundings.  (The use of "I" in the code comments is a point I have
conceded on IRC, but I stand by my other remarks.)

Don't bother submitting a new patch until I've posted my technical
review, but please fix these issues on your local copy.

-- 
Vik



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Amit Kapila
On Tue, Nov 5, 2013 at 12:52 AM, Tom Lane  wrote:
> Amit Kapila  writes:
>> On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane  wrote:
>>> Really I'd like to see standalone mode, in its current form, go away
>>> completely.  I had a prototype patch for allowing psql and other clients
>>> to interface to a standalone backend.  I think getting that finished would
>>> be a way more productive use of time than improving debugtup.
>
>> The last patch including Windows implementation was posted at below
>> link sometime back.
>> http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs
>
>> If this is the right thing, I can rebase the patch and make it ready.
>
> IIRC, the discussion stalled out because people had security concerns
> and/or there wasn't consensus about how it should look at the user level.
> There's probably not much point in polishing a patch until we have more
> agreement about the high-level design.

Okay. However +1 for working on that idea as lot of people have shown
interest in it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of "binary-compatible" data types

2013-11-04 Thread Noah Misch
On Mon, Nov 04, 2013 at 05:23:36PM -0800, Josh Berkus wrote:
> On 11/04/2013 05:21 PM, Josh Berkus wrote:
> > Thom,
> > 
> > 
> >> SELECT
> >>   castsource::regtype::text,
> >>   array_agg(casttarget::regtype order by casttarget::regtype::text) 
> >> casttargets
> >> FROM pg_cast
> >> WHERE castmethod = 'b'
> >> GROUP BY 1
> >> ORDER BY 1;
> > 
> > Are we actually covering 100% of these for ALTER COLUMN now?

Yes; ALTER TABLE ALTER TYPE refers to the same metadata as Thom's query.  If
you add to the list by issuing CREATE CAST ... WITHOUT FUNCTION, ALTER TABLE
will respect that, too.

> Also, JSON <--> Text seems to be missing from the possible binary
> conversions.  That's a TODO, I suppose.

Only json --> text, not json <-- text.  Note that you can add the cast
manually if you have an immediate need.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-11-04 Thread Craig Ringer
On 11/04/2013 11:17 PM, Robert Haas wrote:
> I'd still like to here what's wrong with what I said here:
> 
> http://www.postgresql.org/message-id/ca+tgmoyr1phw3x9vnvuwdcfxkzk2p_jhtwc0fv2q58negcx...@mail.gmail.com

For me, just my understanding. I'm still too new to the planner and
rewriter to grasp that properly as written.

I was responding to Tom's objection, and he makes a good point about
CASE and optimisation. We have to be free to re-order and pre-evaluate
where LEAKPROOF flags make it safe and permissible, without ever
otherwise doing so. That makes the SECURITY BARRIER subquery look
better, since the limited pull-up / push-down is already implemented there.

Robert, any suggesitons on how to approach what you suggest? I'm pretty
new to the planner's guts, but I know there've been some complaints
about the way the current RLS code fiddles with Vars when it changes a
direct scan of a rel into a subquery scan.

The code in:

https://github.com/ringerc/postgres/blob/rls-9.4/src/backend/optimizer/prep/prepunion.c#L1647

and

https://github.com/ringerc/postgres/blob/rls-9.4/src/backend/optimizer/prep/prepunion.c#L1591

seems to be the one folks were complaining about earlier.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of "binary-compatible" data types

2013-11-04 Thread Josh Berkus
On 11/04/2013 05:21 PM, Josh Berkus wrote:
> Thom,
> 
> 
>> SELECT
>>   castsource::regtype::text,
>>   array_agg(casttarget::regtype order by casttarget::regtype::text) 
>> casttargets
>> FROM pg_cast
>> WHERE castmethod = 'b'
>> GROUP BY 1
>> ORDER BY 1;
> 
> Are we actually covering 100% of these for ALTER COLUMN now?

Also, JSON <--> Text seems to be missing from the possible binary
conversions.  That's a TODO, I suppose.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of "binary-compatible" data types

2013-11-04 Thread Josh Berkus
Thom,


> SELECT
>   castsource::regtype::text,
>   array_agg(casttarget::regtype order by casttarget::regtype::text) 
> casttargets
> FROM pg_cast
> WHERE castmethod = 'b'
> GROUP BY 1
> ORDER BY 1;

Are we actually covering 100% of these for ALTER COLUMN now?


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.

2013-11-04 Thread Noah Misch
On Mon, Nov 04, 2013 at 02:34:02PM -0500, Tom Lane wrote:
> Noah Misch  writes:
> > On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote:
> >> Remove internal uses of CTimeZone/HasCTZSet.
> 
> > This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES
> > styles, because it inserts a space before "tzn" but does not insert a space
> > before EncodeTimezone() output.  Example:
> 
> >   set datestyle = sql,mdy;
> >   select '2013-01-01'::timestamptz;
> 
> > old output:
> 
> >   timestamptz   
> > 
> >  01/01/2013 00:00:00+00
> 
> > new output:
> 
> >timestamptz
> > -
> >  01/01/2013 00:00:00 +00
> 
> > I assume this was unintended.
> 
> Yeah, I had seen some cases of that.  I don't find it of great concern.
> We'd have to insert some ugly special-case code that looks at the zone
> name to decide whether to insert a space or not, and I don't think it'd
> actually be an improvement to do so.  (Arguably, these formats are
> more consistent this way.)  Still, this change is probably a sufficient
> reason not to back-patch this part of the fix.

I was prepared to suppose that no substantial clientele relies on the
to_char() "TZ" format code expanding to blank, the other behavior that changed
with this patch.  It's more of a stretch to figure applications won't stumble
over this new whitespace.  I do see greater consistency in the new behavior,
but changing type output is a big deal.  While preserving the old output does
require ugly special-case code, such code was in place for years until removal
by this commit and the commit following it.  Perhaps making the behavior
change is best anyway, but we should not conclude that decision on the basis
of its origin as a side effect of otherwise-desirable refactoring.

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row-security writer-side checks proposal

2013-11-04 Thread Craig Ringer
On 11/04/2013 09:55 PM, Robert Haas wrote:
> I continue to think that this syntax is misguided.  For SELECT and
> DELETE there is only read-side security, and for INSERT there is only
> write-side security, so that's OK as far as it goes, but for UPDATE
> both read-side security and write-side security are possible, and
> there ought to be a way to get one without the other.  This syntax
> won't support that cleanly.

That's what I was thinking earlier too - separate "FOR READ" and "FOR
WRITE" instead.

The reason I came back to insert/update/delete was that it's entirely
reasonable to want to prohibit deletes but permit updates to the same
tuple. Both are writes; both set xmax, it's just that one _replaces_ the
tuple, the other doesn't.

So really, there are four cases:

READ
WRITE INSERT
WRITE UPDATE
WRITE DELETE

> I wonder whether it's worth thinking about the relationship between
> the write-side security contemplated for this feature iand the WITH
> CHECK OPTION syntax that we have for auto-updateable views, which
> serves more or less the same purpose.  I'm not sure that syntax is any
> great shakes, but it's existing precedent of some form and could
> perhaps at least be looked at as a source of inspiration.

I've been thinking about the overlap with WITH CHECK OPTION as well.

> I would generally expect that most people would want either "read side
> security for all commands" or "read and write side security for all
> commands".  I think whatever syntax we come up with this feature ought
> to make each of those things straightforward to get.

but sometimes with different predicates for read and write, i.e. you can
see rows you can't modify or can insert rows / update rows that you
can't see after the change.

Similarly, saying you can update but not delete seems quite reasonable
to me.

On the other hand, we might choose to say "if you want to do things with
that granularity use your own triggers to enforce it" and provide only
READ and WRITE for RLS.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of "binary-compatible" data types

2013-11-04 Thread Thom Brown
On 4 November 2013 21:58, Josh Berkus  wrote:
> Folks,
>
> From our docs:
>
> "Adding a column with a non-null default or changing the type of an
> existing column will require the entire table and indexes to be
> rewritten. As an exception, if the USING clause does not change the
> column contents and the old type is either binary coercible to the new
> type or an unconstrained domain over the new type, a table rewrite is
> not needed ..."
>
> Which is nice, but nowhere do we present users with a set of
> binary-compatible data types, even among the built-in types.  I'd
> happily write this up, if I knew what the binary-compatible data types
> *were*.

You could try this:

SELECT
  castsource::regtype::text,
  array_agg(casttarget::regtype order by casttarget::regtype::text) casttargets
FROM pg_cast
WHERE castmethod = 'b'
GROUP BY 1
ORDER BY 1;

-- 
Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Claudio Freire
On Mon, Nov 4, 2013 at 5:01 PM, Simon Riggs  wrote:
>> Of course, it's possible that even we do get a shared memory
>> allocator, a hypothetical person working on this project might prefer
>> to make the data block-structured anyway and steal storage from
>> shared_buffers.  So my aspirations in this area may not even be
>> relevant.  But I wanted to mention them, just in case anyone else is
>> thinking about similar things, so that we can potentially coordinate.
>
> If anyone was going to work on LSM tree, I would advise building a
> tree in shared/temp buffers first, then merging with the main tree.
> The merge process could use the killed tuple approach to mark the
> merging.
>
> The most difficult thing about buffering the inserts is deciding which
> poor sucker gets the task of cleaning up. That's probably better as an
> off-line process, which is where the work comes in. Non shared
> buffered approaches would add too much overhead to the main task.

Thing is, if you want crash safety guarantees, you cannot use temp
(unlogged) buffers, and then you always have to flush to WAL at each
commit. If the staging index is shared, then it could mean a lot of
WAL (ie: probably around double the amount of WAL a regular b-tree
would generate).

Process-private staging trees that get merged on commit, ie:
transaction-scope staging trees, on the other hand, do not require WAL
logging, they can use temp buffers, and since they don't outlive the
transaction, it's quite obvious who does the merging (the committer).
Question is what kind of workload does that speed up with any
significance and whether the amount of work is worth that speedup on
those workloads.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] List of "binary-compatible" data types

2013-11-04 Thread Josh Berkus
Folks,

>From our docs:

"Adding a column with a non-null default or changing the type of an
existing column will require the entire table and indexes to be
rewritten. As an exception, if the USING clause does not change the
column contents and the old type is either binary coercible to the new
type or an unconstrained domain over the new type, a table rewrite is
not needed ..."

Which is nice, but nowhere do we present users with a set of
binary-compatible data types, even among the built-in types.  I'd
happily write this up, if I knew what the binary-compatible data types
*were*.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 1: additional information

2013-11-04 Thread Alexander Korotkov
On Mon, Oct 21, 2013 at 11:12 PM, Alexander Korotkov
wrote:

> Attached version of patch is debugged. I would like to note that number of
> bugs was low and it wasn't very hard to debug. I've rerun tests on it. You
> can see that numbers are improved as the result of your refactoring.
>
>  event | period
> ---+-
>  index_build   | 00:01:45.416822
>  index_build_recovery  | 00:00:06
>  index_update  | 00:05:17.263606
>  index_update_recovery | 00:01:22
>  search_new| 00:24:07.706482
>  search_updated| 00:26:25.364494
> (6 rows)
>
>  label  | blocks_mark
> +-
>  search_new |   847587636
>  search_updated |   881210486
> (2 rows)
>
>  label |   size
> ---+---
>  new   | 419299328
>  after_updates | 715915264
> (2 rows)
>
> Beside simple bugs, there was a subtle bug in incremental item indexes
> update. I've added some more comments including ASCII picture about how
> item indexes are shifted.
>
> Now, I'm trying to implement support of old page format. Then we can
> decide which approach to use.
>

Attached version of patch has support of old page format. Patch still needs
more documentation and probably refactoring, but I believe idea is clear
and it can be reviewed. In the patch I have to revert change of null
category placement for compatibility with old posting list format.

--
With best regards,
Alexander Korotkov.


gin-packed-postinglists-12.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #8542: Materialized View with another column_name does not work?

2013-11-04 Thread Kevin Grittner
Kevin Grittner  wrote:
> Michael Paquier  wrote:
>
>> I am not sure that adding a boolean flag introducing a concept
>> related to matview inside checkRuleResultList is the best
>> approach to solve that. checkRuleResultList is something related
>> only to rules, and has nothing related to matviews in it yet.
>
> Well, I was tempted to keep that concept in DefineQueryRewrite()
> where the call is made, by calling  the new bool something like
> requireColumnNameMatch and not having checkRuleResultList()
> continue to use isSelect for that purpose at all.
> DefineQueryRewrite() already references RELKIND_RELATION once,
> RELKIND_MATVIEW twice, and RELKIND_VIEW three times, so it would
> hardly be introducing a new concept there.

Upon reflection, that seemed to be cleaner.  Pushed fix that way. 
Not much of a change from the previously posted patch, but attached
here in case anyone wants to argue against this approach.

Thanks for the report!

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company*** a/src/backend/rewrite/rewriteDefine.c
--- b/src/backend/rewrite/rewriteDefine.c
***
*** 44,50 
  
  
  static void checkRuleResultList(List *targetList, TupleDesc resultDesc,
! 	bool isSelect);
  static bool setRuleCheckAsUser_walker(Node *node, Oid *context);
  static void setRuleCheckAsUser_Query(Query *qry, Oid userid);
  
--- 44,50 
  
  
  static void checkRuleResultList(List *targetList, TupleDesc resultDesc,
! 	bool isSelect, bool requireColumnNameMatch);
  static bool setRuleCheckAsUser_walker(Node *node, Oid *context);
  static void setRuleCheckAsUser_Query(Query *qry, Oid userid);
  
***
*** 355,361  DefineQueryRewrite(char *rulename,
  		 */
  		checkRuleResultList(query->targetList,
  			RelationGetDescr(event_relation),
! 			true);
  
  		/*
  		 * ... there must not be another ON SELECT rule already ...
--- 355,363 
  		 */
  		checkRuleResultList(query->targetList,
  			RelationGetDescr(event_relation),
! 			true,
! 			event_relation->rd_rel->relkind !=
! RELKIND_MATVIEW);
  
  		/*
  		 * ... there must not be another ON SELECT rule already ...
***
*** 484,490  DefineQueryRewrite(char *rulename,
  		 errmsg("RETURNING lists are not supported in non-INSTEAD rules")));
  			checkRuleResultList(query->returningList,
  RelationGetDescr(event_relation),
! false);
  		}
  	}
  
--- 486,492 
  		 errmsg("RETURNING lists are not supported in non-INSTEAD rules")));
  			checkRuleResultList(query->returningList,
  RelationGetDescr(event_relation),
! false, false);
  		}
  	}
  
***
*** 613,627  DefineQueryRewrite(char *rulename,
   *		Verify that targetList produces output compatible with a tupledesc
   *
   * The targetList might be either a SELECT targetlist, or a RETURNING list;
!  * isSelect tells which.  (This is mostly used for choosing error messages,
!  * but also we don't enforce column name matching for RETURNING.)
   */
  static void
! checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect)
  {
  	ListCell   *tllist;
  	int			i;
  
  	i = 0;
  	foreach(tllist, targetList)
  	{
--- 615,634 
   *		Verify that targetList produces output compatible with a tupledesc
   *
   * The targetList might be either a SELECT targetlist, or a RETURNING list;
!  * isSelect tells which.  This is used for choosing error messages.
!  *
!  * A SELECT targetlist may optionally require that column names match.
   */
  static void
! checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect,
! 	bool requireColumnNameMatch)
  {
  	ListCell   *tllist;
  	int			i;
  
+ 	/* Only a SELECT may require a column name match. */
+ 	Assert(isSelect || !requireColumnNameMatch);
+ 
  	i = 0;
  	foreach(tllist, targetList)
  	{
***
*** 657,663  checkRuleResultList(List *targetList, TupleDesc resultDesc, bool isSelect)
  	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  	 errmsg("cannot convert relation containing dropped columns to view")));
  
! 		if (isSelect && strcmp(tle->resname, attname) != 0)
  			ereport(ERROR,
  	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
  	 errmsg("SELECT rule's target entry %d has different column name from \"%s\"", i, attname)));
--- 664,670 
  	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
  	 errmsg("cannot convert relation containing dropped columns to view")));
  
! 		if (requireColumnNameMatch && strcmp(tle->resname, attname) != 0)
  			ereport(ERROR,
  	(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
  	 errmsg("SELECT rule's target entry %d has different column name from \"%s\"", i, attname)));
*** a/src/test/regress/expected/matview.out
--- b/src/test/regress/expected/matview.out
***
*** 450,452  SELECT * FROM boxmv ORDER BY id;
--- 450,475 
  
  DROP TABLE boxes CASCADE;
  NOTICE:  drop cascades to materialized view

Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Jeff Janes
On Mon, Nov 4, 2013 at 8:09 AM, Robert Haas  wrote:

> On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs  wrote:
> > On 29 October 2013 16:10, Peter Geoghegan  wrote:
> >> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci 
> wrote:
> >>> I don't see much interest in insert-efficient indexes.
> >>
> >> Presumably someone will get around to implementing a btree index
> >> insertion buffer one day. I think that would be a particularly
> >> compelling optimization for us, because we could avoid ever inserting
> >> index tuples that are already dead when the deferred insertion
> >> actually occurs.
> >
> > That's pretty much what the LSM-tree is.
>
> What is pretty cool about this sort of thing is that there's no
> intrinsic reason the insertion buffer needs to be block-structured or
> disk-backed.


How do we commit to not spilling to disk, in the face of an unbounded
number of indexes existing and wanting to use this mechanism
simultaneously?  If it routinely needs to spill to disk, that would
probably defeat the purpose of having it in the first place, but committing
to never doing so seems to be extremely restrictive. As you say it is also
freeing, in terms of using pointers and such, but I think the restrictions
would outweigh the freedom.




>  In theory, you can structure the in-memory portion of
> the tree any way you like, using pointers and arbitrary-size memory
> allocations and all that fun stuff.  You need to log that there's a
> deferred insert (or commit to flushing the insertion buffer before
> every commit, which would seem to miss the point) so that recovery can
> reconstruct the in-memory data structure and flush it, but that's it:
> the WAL format need not know any other details of the in-memory
> portion of the tree.  I think that, plus the ability to use pointers
> and so forth, might lead to significant performance gains.
>
> In practice, the topology of our shared memory segment makes this a
> bit tricky.  The problem isn't so much that it's fixed size as that it
> lacks a real allocator, and that all the space used for shared_buffers
> is nailed down and can't be borrowed for other purposes.


I think the fixed size is also a real problem, especially given the
ubiquitous advice not to exceed 2 to 8 GB.

Cheers,

Jeff


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Simon Riggs
On 4 November 2013 16:09, Robert Haas  wrote:
> On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs  wrote:
>> On 29 October 2013 16:10, Peter Geoghegan  wrote:
>>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci  
>>> wrote:
 I don't see much interest in insert-efficient indexes.
>>>
>>> Presumably someone will get around to implementing a btree index
>>> insertion buffer one day. I think that would be a particularly
>>> compelling optimization for us, because we could avoid ever inserting
>>> index tuples that are already dead when the deferred insertion
>>> actually occurs.
>>
>> That's pretty much what the LSM-tree is.
>
> What is pretty cool about this sort of thing is that there's no
> intrinsic reason the insertion buffer needs to be block-structured or
> disk-backed.  In theory, you can structure the in-memory portion of
> the tree any way you like, using pointers and arbitrary-size memory
> allocations and all that fun stuff.  You need to log that there's a
> deferred insert (or commit to flushing the insertion buffer before
> every commit, which would seem to miss the point) so that recovery can
> reconstruct the in-memory data structure and flush it, but that's it:
> the WAL format need not know any other details of the in-memory
> portion of the tree.  I think that, plus the ability to use pointers
> and so forth, might lead to significant performance gains.

Sounds like it could be cool, yes.

> In practice, the topology of our shared memory segment makes this a
> bit tricky.  The problem isn't so much that it's fixed size as that it
> lacks a real allocator, and that all the space used for shared_buffers
> is nailed down and can't be borrowed for other purposes.  I'm very
> interested in solving the problem of getting a real allocator for
> shared memory because I think I need it for parallel sort

Agreed, you need a shared memory allocator for parallel query. It's
just too tempting to build a hash table in shared memory on one
thread, then use the hash table from multiple sessions as we do a
parallel scan. Roughly same thing for parallel sort - passing pointers
to complete data objects around is much easier than actually moving
the data between processes, which would slow things down. We do also
need generalised inter-node pipe but that's not the best solution to
every problem.

It's also a great way of controlling over-allocation of resources.

> , and even if
> there's a way to avoid needing it there I have a strong feeling that
> we'll want it for other applications of dynamic shared memory.  But it
> should be possible to write the allocator in such a way that you just
> give it a chunk of memory to manage, and it does, remaining agnostic
> about whether the memory is from the main shared memory segment or a
> dynamic one.

Agreed

> Of course, it's possible that even we do get a shared memory
> allocator, a hypothetical person working on this project might prefer
> to make the data block-structured anyway and steal storage from
> shared_buffers.  So my aspirations in this area may not even be
> relevant.  But I wanted to mention them, just in case anyone else is
> thinking about similar things, so that we can potentially coordinate.

If anyone was going to work on LSM tree, I would advise building a
tree in shared/temp buffers first, then merging with the main tree.
The merge process could use the killed tuple approach to mark the
merging.

The most difficult thing about buffering the inserts is deciding which
poor sucker gets the task of cleaning up. That's probably better as an
off-line process, which is where the work comes in. Non shared
buffered approaches would add too much overhead to the main task.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Gavin Flower

On 05/11/13 05:35, Robert Haas wrote:

On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund  wrote:

I think doing this outside of s_b will make stuff rather hard for
physical replication and crash recovery since we either will need to
flush the whole buffer at checkpoints - which is hard since the
checkpointer doesn't work inside individual databases - or we need to
persist the in-memory buffer across restart which also sucks.

You might be right, but I think part of the value of LSM-trees is that
the in-memory portion of the data structure is supposed to be able to
be optimized for in-memory storage rather than on disk storage.  It
may be that block-structuring that data bleeds away much of the
performance benefit.  Of course, I'm talking out of my rear end here:
I don't really have a clue how these algorithms are supposed to work.

How about having a 'TRANSIENT INDEX' that only exists in memory, so 
there is no requirement to write it to disk or to replicate directly? 
This type of index would be very fast and easier to implement.  Recovery 
would involve rebuilding the index, and sharing would involve recreating 
on a slave.  Probably not appropriate for a primary index, but may be 
okay for secondary indexes used to speed specific queries.


This could be useful in some situations now, and allow time to get 
experience in how best to implement the basic concept.  Then a more 
robust solution using WAL etc can be developed later.


I suspect that such a TRANSIENT INDEX would still be useful even when a 
more robust in memory index method was available.  As I expect it would 
be faster to set up than a robust memory index - which might be good 
when you need to have one or more indexes for a short period of time, or 
the size of the index is so small that recreating it requires very 
little time (total elapsed time might even be less than a disk backed one?).



Cheers,
Gavin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.

2013-11-04 Thread Tom Lane
Noah Misch  writes:
> On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote:
>> Remove internal uses of CTimeZone/HasCTZSet.

> This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES
> styles, because it inserts a space before "tzn" but does not insert a space
> before EncodeTimezone() output.  Example:

>   set datestyle = sql,mdy;
>   select '2013-01-01'::timestamptz;

> old output:

>   timestamptz   
> 
>  01/01/2013 00:00:00+00

> new output:

>timestamptz
> -
>  01/01/2013 00:00:00 +00

> I assume this was unintended.

Yeah, I had seen some cases of that.  I don't find it of great concern.
We'd have to insert some ugly special-case code that looks at the zone
name to decide whether to insert a space or not, and I don't think it'd
actually be an improvement to do so.  (Arguably, these formats are
more consistent this way.)  Still, this change is probably a sufficient
reason not to back-patch this part of the fix.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Tom Lane
Amit Kapila  writes:
> On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane  wrote:
>> Really I'd like to see standalone mode, in its current form, go away
>> completely.  I had a prototype patch for allowing psql and other clients
>> to interface to a standalone backend.  I think getting that finished would
>> be a way more productive use of time than improving debugtup.

> The last patch including Windows implementation was posted at below
> link sometime back.
> http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs

> If this is the right thing, I can rebase the patch and make it ready.

IIRC, the discussion stalled out because people had security concerns
and/or there wasn't consensus about how it should look at the user level.
There's probably not much point in polishing a patch until we have more
agreement about the high-level design.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql: Remove internal uses of CTimeZone/HasCTZSet.

2013-11-04 Thread Noah Misch
On Fri, Nov 01, 2013 at 04:51:34PM +, Tom Lane wrote:
> Remove internal uses of CTimeZone/HasCTZSet.
> 
> The only remaining places where we actually look at CTimeZone/HasCTZSet
> are abstime2tm() and timestamp2tm().  Now that session_timezone is always
> valid, we can remove these special cases.  The caller-visible impact of
> this is that these functions now always return a valid zone abbreviation
> if requested, whereas before they'd return a NULL pointer if a brute-force
> timezone was in use.  In the existing code, the only place I can find that
> changes behavior is to_char(), whose TZ format code will now print
> something useful rather than nothing for such zones.  (In the places where
> the returned zone abbreviation is passed to EncodeDateTime, the lack of
> visible change is because we've chosen the abbreviation used for these
> zones to match what EncodeTimezone would have printed.)

This changed EncodeDateTime() output for USE_SQL_DATES and USE_GERMAN_DATES
styles, because it inserts a space before "tzn" but does not insert a space
before EncodeTimezone() output.  Example:

  set datestyle = sql,mdy;
  select '2013-01-01'::timestamptz;

old output:

  timestamptz   

 01/01/2013 00:00:00+00

new output:

   timestamptz
-
 01/01/2013 00:00:00 +00

I assume this was unintended.

> This could be back-patched if we decide we'd like to fix to_char()'s
> behavior in the back branches, but there doesn't seem to be much
> enthusiasm for that at present.

Note that for the corresponding scenario in Oracle, the "TZD" format code
yields blank and the "TZR" format code yields "-01:30".  (Oracle does not have
a plain "TZ" escape.)  So there's precedent for each choice of behavior, and I
don't see this as a back-patch candidate.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How can I build OSSP UUID support on Windows to avoid duplicate UUIDs?

2013-11-04 Thread Christopher Browne
On Thu, Oct 31, 2013 at 3:42 PM, Robert Haas  wrote:
> On Thu, Oct 31, 2013 at 2:44 PM, Garick Hamlin 
wrote:
>> I think using /dev/urandom directly would be surprising. At least it
would
>> have probably have taken me a while to figure out what was depleting the
>> entropy pool here.
>
> Perhaps so; a bigger problem IMHO is that it's not portable. I think
> the only way to solve this problem is to import (or have an option to
> link with) a strong, sophisticated PRNG with much larger internal
> state than pg_lrand48, which uses precisely 48 bits of internal state.
> For this kind of thing, I'm fairly sure that we need something with
> at least 128 bits of internal state (as wide as the random value we
> want to generate) and I suspect it might be advantageous to have
> something a whole lot wider, maybe a few kB.

I mentioned the notion of building an entropy pool, into which one might
add various sorts of random inputs, under separate cover...

The last time I had need of a rather non-repeating RNG, I went with
a Fibonacci-based one, namely Mersenne Twister...



The sample has 624 integers (presumably that means 624x32 bits) as
its internal state. Apparently not terribly suitable for cryptographic
purposes,
but definitely highly non-repetitive, which is what we're notably
worried about for UUIDs.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 11:45 AM, Stephen Frost  wrote:
> * Peter Eisentraut (pete...@gmx.net) wrote:
>> On 11/4/13, 8:58 AM, Robert Haas wrote:
>> > On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
>> >  wrote:
>> >>> Please find attached a patch doing what is written in the $subject.
>> >> With the documentation updated, this is even better...
>> >
>> > I'm unconvinced that there's any value in this.
>>
>> Yeah, the only thing this will accomplish is to annoy people who are
>> actually using that level.  It would be more interesting if we could get
>> rid of the wal_level setting altogether, but of course there are valid
>> reasons against that.
>
> It would actually be valuable to 'upgrade' those people to
> hot_standby, which is what I had kind of been hoping would happen
> eventually.  I agree that there's no use for 'archive' today, but rather
> than break existing configs that use it, just make 'archive' and
> 'hot_standby' mean the same thing.  In the end, I'd probably vote to
> make 'hot_standby' the 'legacy/deprecated' term anyway.

That strikes me as a better idea than what the patch actually does,
but I still think it's nanny-ism.  I don't believe we have the right
to second-guess the choices our users make in this area.  We can make
recommendations in the documentation, but at the end of the day if
users choose to use archive rather than hot_standby, we should respect
that choice, not break it because we think we know better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
> On 11/4/13, 8:58 AM, Robert Haas wrote:
> > On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
> >  wrote:
> >>> Please find attached a patch doing what is written in the $subject.
> >> With the documentation updated, this is even better...
> > 
> > I'm unconvinced that there's any value in this.
> 
> Yeah, the only thing this will accomplish is to annoy people who are
> actually using that level.  It would be more interesting if we could get
> rid of the wal_level setting altogether, but of course there are valid
> reasons against that.

It would actually be valuable to 'upgrade' those people to
hot_standby, which is what I had kind of been hoping would happen
eventually.  I agree that there's no use for 'archive' today, but rather
than break existing configs that use it, just make 'archive' and
'hot_standby' mean the same thing.  In the end, I'd probably vote to
make 'hot_standby' the 'legacy/deprecated' term anyway.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 11:32 AM, Andres Freund  wrote:
> I think doing this outside of s_b will make stuff rather hard for
> physical replication and crash recovery since we either will need to
> flush the whole buffer at checkpoints - which is hard since the
> checkpointer doesn't work inside individual databases - or we need to
> persist the in-memory buffer across restart which also sucks.

You might be right, but I think part of the value of LSM-trees is that
the in-memory portion of the data structure is supposed to be able to
be optimized for in-memory storage rather than on disk storage.  It
may be that block-structuring that data bleeds away much of the
performance benefit.  Of course, I'm talking out of my rear end here:
I don't really have a clue how these algorithms are supposed to work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Andres Freund
On 2013-11-04 11:27:33 -0500, Robert Haas wrote:
> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire  
> wrote:
> > Such a thing would help COPY, so maybe it's worth a look
> 
> I have little doubt that a deferred insertion buffer of some kind
> could help performance on some workloads, though I suspect the buffer
> would have to be pretty big to make it worthwhile on a big COPY that
> generates mostly-random insertions.

Even for random data presorting the to-be-inserted data appropriately
could result in much better io patterns.

> I think the question is not so
> much whether it's worth doing but where anyone's going to find the
> time to do it.

Yea :(

I think doing this outside of s_b will make stuff rather hard for
physical replication and crash recovery since we either will need to
flush the whole buffer at checkpoints - which is hard since the
checkpointer doesn't work inside individual databases - or we need to
persist the in-memory buffer across restart which also sucks.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 11:31 AM, Claudio Freire  wrote:
> On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas  wrote:
>> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire  
>> wrote:
>>> Such a thing would help COPY, so maybe it's worth a look
>>
>> I have little doubt that a deferred insertion buffer of some kind
>> could help performance on some workloads, though I suspect the buffer
>> would have to be pretty big to make it worthwhile on a big COPY that
>> generates mostly-random insertions.  I think the question is not so
>> much whether it's worth doing but where anyone's going to find the
>> time to do it.
>
>
> However, since an admin can increase work_mem for that COPY, using
> work_mem for this would be reasonable, don't you agree?

Without implementing it and benchmarking the result, it's pretty hard
to know.  But if I were a betting man, I'd bet that's not nearly big
enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl status with nonexistent data directory

2013-11-04 Thread Robert Haas
On Sat, Nov 2, 2013 at 3:32 PM, Peter Eisentraut  wrote:
> This doesn't seem right:
>
> $ pg_ctl -D /nowhere status
> pg_ctl: no server running
>
> It does exit with status 3, so it's not all that broken, but I think the
> error message could be more accurate.

I doubt anyone will object if you feel the urge to fix it.  I won't, anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Claudio Freire
On Mon, Nov 4, 2013 at 1:27 PM, Robert Haas  wrote:
> On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire  
> wrote:
>> Such a thing would help COPY, so maybe it's worth a look
>
> I have little doubt that a deferred insertion buffer of some kind
> could help performance on some workloads, though I suspect the buffer
> would have to be pretty big to make it worthwhile on a big COPY that
> generates mostly-random insertions.  I think the question is not so
> much whether it's worth doing but where anyone's going to find the
> time to do it.


However, since an admin can increase work_mem for that COPY, using
work_mem for this would be reasonable, don't you agree?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 11:24 AM, Claudio Freire  wrote:
> Such a thing would help COPY, so maybe it's worth a look

I have little doubt that a deferred insertion buffer of some kind
could help performance on some workloads, though I suspect the buffer
would have to be pretty big to make it worthwhile on a big COPY that
generates mostly-random insertions.  I think the question is not so
much whether it's worth doing but where anyone's going to find the
time to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Claudio Freire
On Mon, Nov 4, 2013 at 1:09 PM, Robert Haas  wrote:
> On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs  wrote:
>> On 29 October 2013 16:10, Peter Geoghegan  wrote:
>>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci  
>>> wrote:
 I don't see much interest in insert-efficient indexes.
>>>
>>> Presumably someone will get around to implementing a btree index
>>> insertion buffer one day. I think that would be a particularly
>>> compelling optimization for us, because we could avoid ever inserting
>>> index tuples that are already dead when the deferred insertion
>>> actually occurs.
>>
>> That's pretty much what the LSM-tree is.
>
> What is pretty cool about this sort of thing is that there's no
> intrinsic reason the insertion buffer needs to be block-structured or
> disk-backed.  In theory, you can structure the in-memory portion of
> the tree any way you like, using pointers and arbitrary-size memory
> allocations and all that fun stuff.  You need to log that there's a
> deferred insert (or commit to flushing the insertion buffer before
> every commit, which would seem to miss the point)


Such a thing would help COPY, so maybe it's worth a look


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm use of uint64

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 10:55 AM, Andres Freund  wrote:
>> Ah.  This is because I didn't change the format code used to print the
>> arguments; it's still using UINT64_FORMAT, but the argument is now a
>> Size.  What's the right way to print out a Size, anyway?
>
> There isn't a nice one currently. glibc/gcc have %z that automatically
> has the right width, but that's not supported by windows. I've been
> wondering if we shouldn't add support for that just like we have added
> support for %m.
>
>>  Should I
>> just try %lu?  It seems that sysv_shmem.c uses %lu, but also casts the
>> value to (unsigned long); I could follow that precedent here, if
>> there's no consistency to what format is needed to print a size_t.
>
> Yes, you need a cast like that.

Whee, portability is fun.  OK, I changed it to work that way.
Hopefully that'll do the trick.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast insertion indexes: why no developments

2013-11-04 Thread Robert Haas
On Sat, Nov 2, 2013 at 6:07 AM, Simon Riggs  wrote:
> On 29 October 2013 16:10, Peter Geoghegan  wrote:
>> On Tue, Oct 29, 2013 at 7:53 AM, Leonardo Francalanci  
>> wrote:
>>> I don't see much interest in insert-efficient indexes.
>>
>> Presumably someone will get around to implementing a btree index
>> insertion buffer one day. I think that would be a particularly
>> compelling optimization for us, because we could avoid ever inserting
>> index tuples that are already dead when the deferred insertion
>> actually occurs.
>
> That's pretty much what the LSM-tree is.

What is pretty cool about this sort of thing is that there's no
intrinsic reason the insertion buffer needs to be block-structured or
disk-backed.  In theory, you can structure the in-memory portion of
the tree any way you like, using pointers and arbitrary-size memory
allocations and all that fun stuff.  You need to log that there's a
deferred insert (or commit to flushing the insertion buffer before
every commit, which would seem to miss the point) so that recovery can
reconstruct the in-memory data structure and flush it, but that's it:
the WAL format need not know any other details of the in-memory
portion of the tree.  I think that, plus the ability to use pointers
and so forth, might lead to significant performance gains.

In practice, the topology of our shared memory segment makes this a
bit tricky.  The problem isn't so much that it's fixed size as that it
lacks a real allocator, and that all the space used for shared_buffers
is nailed down and can't be borrowed for other purposes.  I'm very
interested in solving the problem of getting a real allocator for
shared memory because I think I need it for parallel sort, and even if
there's a way to avoid needing it there I have a strong feeling that
we'll want it for other applications of dynamic shared memory.  But it
should be possible to write the allocator in such a way that you just
give it a chunk of memory to manage, and it does, remaining agnostic
about whether the memory is from the main shared memory segment or a
dynamic one.

Of course, it's possible that even we do get a shared memory
allocator, a hypothetical person working on this project might prefer
to make the data block-structured anyway and steal storage from
shared_buffers.  So my aspirations in this area may not even be
relevant.  But I wanted to mention them, just in case anyone else is
thinking about similar things, so that we can potentially coordinate.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm use of uint64

2013-11-04 Thread Andres Freund
On 2013-11-04 10:46:06 -0500, Robert Haas wrote:
> On Fri, Nov 1, 2013 at 11:45 PM, Peter Eisentraut  wrote:
> > On Mon, 2013-10-28 at 12:17 -0400, Robert Haas wrote:
> >> On Sun, Oct 27, 2013 at 11:34 PM, Noah Misch  wrote:
> >> > On Fri, Oct 25, 2013 at 10:11:41PM -0400, Robert Haas wrote:
> >> >> When I wrote the dynamic shared memory patch, I used uint64 everywhere
> >> >> to measure sizes - rather than, as we do for the main shared memory
> >> >> segment, Size.  This now seems to me to have been the wrong decision;
> >
> > This change is now causing compiler warnings on 32-bit platforms.  You
> > can see them here, for example:
> > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwing&dt=2013-11-01%2020%3A45%3A01&stg=make
> 
> Ah.  This is because I didn't change the format code used to print the
> arguments; it's still using UINT64_FORMAT, but the argument is now a
> Size.  What's the right way to print out a Size, anyway?

There isn't a nice one currently. glibc/gcc have %z that automatically
has the right width, but that's not supported by windows. I've been
wondering if we shouldn't add support for that just like we have added
support for %m.

>  Should I
> just try %lu?  It seems that sysv_shmem.c uses %lu, but also casts the
> value to (unsigned long); I could follow that precedent here, if
> there's no consistency to what format is needed to print a size_t.

Yes, you need a cast like that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Peter Eisentraut
On 11/4/13, 8:58 AM, Robert Haas wrote:
> On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
>  wrote:
>>> Please find attached a patch doing what is written in the $subject.
>> With the documentation updated, this is even better...
> 
> I'm unconvinced that there's any value in this.

Yeah, the only thing this will accomplish is to annoy people who are
actually using that level.  It would be more interesting if we could get
rid of the wal_level setting altogether, but of course there are valid
reasons against that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dsm use of uint64

2013-11-04 Thread Robert Haas
On Fri, Nov 1, 2013 at 11:45 PM, Peter Eisentraut  wrote:
> On Mon, 2013-10-28 at 12:17 -0400, Robert Haas wrote:
>> On Sun, Oct 27, 2013 at 11:34 PM, Noah Misch  wrote:
>> > On Fri, Oct 25, 2013 at 10:11:41PM -0400, Robert Haas wrote:
>> >> When I wrote the dynamic shared memory patch, I used uint64 everywhere
>> >> to measure sizes - rather than, as we do for the main shared memory
>> >> segment, Size.  This now seems to me to have been the wrong decision;
>
> This change is now causing compiler warnings on 32-bit platforms.  You
> can see them here, for example:
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=lapwing&dt=2013-11-01%2020%3A45%3A01&stg=make

Ah.  This is because I didn't change the format code used to print the
arguments; it's still using UINT64_FORMAT, but the argument is now a
Size.  What's the right way to print out a Size, anyway?  Should I
just try %lu?  It seems that sysv_shmem.c uses %lu, but also casts the
value to (unsigned long); I could follow that precedent here, if
there's no consistency to what format is needed to print a size_t.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Amit Kapila
On Mon, Nov 4, 2013 at 8:15 PM, Tom Lane  wrote:
> Andres Freund  writes:
>> On 2013-11-02 15:29:36 -0400, Tom Lane wrote:
>>> Attached is a proposed patch for this.  It fixes most of the functions
>>> in printtup.c to use a per-row memory context.  (I did not bother to
>>> fix debugtup(), which is used only in standalone mode.  If you're doing
>>> queries large enough for mem leaks to be problematic in standalone mode,
>>> you're nuts.)
>
>> FWIW, by that definition I have been nuts several time in the past -
>> without much choice since I was recovering data from a corrupted cluster
>> and the database couldn't be started normally. This now has gotten worse
>> (even in the backbranches) since debugtup won't clean up detoasted data
>> anymore.
>> But I guess the solution for that is to use COPY in those situations
>> which shouldn't have that problem and should work. Also, easier to parse ;)
>
> Considering the output from debugtup goes to stdout where it will be
> intermixed with prompts etc, I'd have to think that COPY is a way better
> solution for dealing with bulk data.
>
> Really I'd like to see standalone mode, in its current form, go away
> completely.  I had a prototype patch for allowing psql and other clients
> to interface to a standalone backend.  I think getting that finished would
> be a way more productive use of time than improving debugtup.

The last patch including Windows implementation was posted at below
link sometime back.
http://www.postgresql.org/message-id/6C0B27F7206C9E4CA54AE035729E9C382853263C@szxeml509-mbs

If this is the right thing, I can rebase the patch and make it ready.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 9:37 AM, Tom Lane  wrote:
> Craig Ringer  writes:
>> On 09/04/2013 11:22 PM, Tom Lane wrote:
>>> AFAICT, to deal with update/delete the RLS patch needs to constrain order
>>> of qual application without the crutch of having a separate level of
>>> subquery; and it's that behavior that I have zero confidence in, either
>>> as to whether it works as submitted or as to our odds of not breaking it
>>> in the future.
>
>> Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also
>> protect against malicious RLS functions?
>
> You mean wrap all the query quals in a CASE?  Sure, if you didn't mind
> totally destroying any optimization possibilities.  If you did that,
> every table scan would become a seqscan and every join a nestloop.

I'd still like to here what's wrong with what I said here:

http://www.postgresql.org/message-id/ca+tgmoyr1phw3x9vnvuwdcfxkzk2p_jhtwc0fv2q58negcx...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something fishy happening on frogmouth

2013-11-04 Thread Tom Lane
Andres Freund  writes:
> On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote:
>> On 04.11.2013 11:55, Andres Freund wrote:
>>> Also, I don't think it's portable across platforms to access segments
>>> that already have been unlinked.

>> See
>> http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html:
>> "If one or more references to the shared memory object exist when the object
>> is unlinked, the name shall be removed before shm_unlink() returns, but the
>> removal of the memory object contents shall be postponed until all open and
>> map references to the shared memory object have been removed."

> We also support sysv shmem and have the same cleanup problem there.

And what about Windows?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Andres Freund
On 2013-11-04 09:45:22 -0500, Tom Lane wrote:
> Really I'd like to see standalone mode, in its current form, go away
> completely.  I had a prototype patch for allowing psql and other clients
> to interface to a standalone backend.  I think getting that finished would
> be a way more productive use of time than improving debugtup.

+many

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Tom Lane
Andres Freund  writes:
> On 2013-11-02 15:29:36 -0400, Tom Lane wrote:
>> Attached is a proposed patch for this.  It fixes most of the functions
>> in printtup.c to use a per-row memory context.  (I did not bother to
>> fix debugtup(), which is used only in standalone mode.  If you're doing
>> queries large enough for mem leaks to be problematic in standalone mode,
>> you're nuts.) 

> FWIW, by that definition I have been nuts several time in the past -
> without much choice since I was recovering data from a corrupted cluster
> and the database couldn't be started normally. This now has gotten worse
> (even in the backbranches) since debugtup won't clean up detoasted data
> anymore.
> But I guess the solution for that is to use COPY in those situations
> which shouldn't have that problem and should work. Also, easier to parse ;)

Considering the output from debugtup goes to stdout where it will be
intermixed with prompts etc, I'd have to think that COPY is a way better
solution for dealing with bulk data.

Really I'd like to see standalone mode, in its current form, go away
completely.  I had a prototype patch for allowing psql and other clients
to interface to a standalone backend.  I think getting that finished would
be a way more productive use of time than improving debugtup.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-11-04 Thread Tom Lane
Craig Ringer  writes:
> On 09/04/2013 11:22 PM, Tom Lane wrote:
>> AFAICT, to deal with update/delete the RLS patch needs to constrain order
>> of qual application without the crutch of having a separate level of
>> subquery; and it's that behavior that I have zero confidence in, either
>> as to whether it works as submitted or as to our odds of not breaking it
>> in the future.

> Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also
> protect against malicious RLS functions?

You mean wrap all the query quals in a CASE?  Sure, if you didn't mind
totally destroying any optimization possibilities.  If you did that,
every table scan would become a seqscan and every join a nestloop.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Robert Haas
On Mon, Nov 4, 2013 at 5:57 AM, Michael Paquier
 wrote:
>> Please find attached a patch doing what is written in the $subject.
> With the documentation updated, this is even better...

I'm unconvinced that there's any value in this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Row-security writer-side checks proposal

2013-11-04 Thread Robert Haas
On Fri, Nov 1, 2013 at 3:52 AM, Craig Ringer  wrote:
> I've been looking some more into write-side checks in row-security and
> have a suggestion.
>
> Even though write-side checks are actually fairly separate to read
> checks, and can be done as another step, I'd like to think about them
> before the catalog format and syntax are settled. I think we need fields
> for write operations in pg_rowsecurity and the syntax to set them so
> that the catalog information can be used by triggers to enforce write
> checks. Even if, for the first cut, they're not supported by built-in
> auto-created triggers.
>
> Here's my proposal, let me know what you think:
>
> SET ROW SECURITY FOR { ALL COMMANDS | {[SELECT,INSERT,UPDATE,DELETE}+}
>
> in other words, you specify either:
>
> SET ROW SECURITY FOR ALL COMMANDS

I continue to think that this syntax is misguided.  For SELECT and
DELETE there is only read-side security, and for INSERT there is only
write-side security, so that's OK as far as it goes, but for UPDATE
both read-side security and write-side security are possible, and
there ought to be a way to get one without the other.  This syntax
won't support that cleanly.

I wonder whether it's worth thinking about the relationship between
the write-side security contemplated for this feature iand the WITH
CHECK OPTION syntax that we have for auto-updateable views, which
serves more or less the same purpose.  I'm not sure that syntax is any
great shakes, but it's existing precedent of some form and could
perhaps at least be looked at as a source of inspiration.

I would generally expect that most people would want either "read side
security for all commands" or "read and write side security for all
commands".  I think whatever syntax we come up with this feature ought
to make each of those things straightforward to get.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extension Templates S03E11

2013-11-04 Thread Stephen Frost
Dimitri,

* Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote:
> So please find v15 of the patch attached to this email, that passes all
> previously done checks and this one too now.

Looks like there's been a bit of unfortunate bitrot due to Tom's change
to disable fancy output:

patching file src/test/regress/expected/sanity_check.out
Hunk #1 FAILED at 104.
Hunk #2 FAILED at 166.
2 out of 2 hunks FAILED -- saving rejects to file 
src/test/regress/expected/sanity_check.out.rej

Are there any other changes you have pending for this..?  Would be nice
to see the latest version which you've tested and which patches cleanly
against master... ;)

I'll still go ahead and start looking through this, per our discussion.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

2013-11-04 Thread Andres Freund
On 2013-11-04 13:48:32 +0100, Andres Freund wrote:
> > Hmm, the startup process doesn't participate in sinval messaging at all,
> > does it?
> 
> Well, not sinval but inval, in hot standby via commit messages.

Err, that's bullshit, sorry for that. We send the messages via sinval,
but never (probably at least?) process sinval messages.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

2013-11-04 Thread Andres Freund
On 2013-11-04 14:37:53 +0200, Heikki Linnakangas wrote:
> On 04.11.2013 11:35, Andres Freund wrote:
> >On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote:
> >>diff --git a/src/backend/access/transam/xlogutils.c 
> >>b/src/backend/access/transam/xlogutils.c
> >>index 5429d5e..f732e71 100644
> >>--- a/src/backend/access/transam/xlogutils.c
> >>+++ b/src/backend/access/transam/xlogutils.c
> >>@@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
> >>rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
> >>rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
> >>
> >>-   rel->rd_smgr = NULL;
> >>+   /*
> >>+* Open the underlying storage, but don't set rd_owner. We want the
> >>+* SmgrRelation to persist after the fake relcache entry is freed.
> >>+*/
> >>+   rel->rd_smgr = smgropen(rnode, InvalidBackendId);
> >>
> >>return rel;
> >>  }
> >
> >I don't really like that - that will mean we'll leak open smgr handles
> >for every relation touched via smgr during recovery since they will
> >never be closed unless the relation is dropped or such. And in some
> >databases there can be huge amounts of relations.
> >Since recovery is long lived that doesn't seem like a good idea, besides
> >the memory usage it will also bloat smgr's internal hash which actually
> >seems just as likely to hurt performance.
> 
> That's the way SmgrRelations are supposed to be used. Opened on first use,
> and kept open forever. We never smgrclose() during normal operation either,
> unless the relation is dropped or something like that. And see how
> XLogReadBuffer() also calls smgropen() with no corresponding smgrclose().

Ok, that's quite the fair argument although I'd say normal backends
won't open many relations in comparison to the startup process when
doing replication. But that certainly isn't sufficient argument for
changing logic in the back branches or even HEAD.

> >I think intentionally not using the owner mechanism also is dangerous -
> >what if we have longer lived fake relcache entries and we've just
> >processed sinval messages? Then we'll have a ->rd_smgr pointing into
> >nowhere.

> Hmm, the startup process doesn't participate in sinval messaging at all,
> does it?

Well, not sinval but inval, in hot standby via commit messages.

What about just unowning the smgr entry with
if (rel->rd_smgr != NULL)
   smgrsetowner(NULL, rel->rd_smgr)
when closing the fake relcache entry?

That shouldn't require any further changes than changing the comment in
smgrsetowner() that this isn't something expected to frequently happen.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

2013-11-04 Thread Heikki Linnakangas

On 04.11.2013 11:35, Andres Freund wrote:

On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote:

Secondly, it will fail if you create
two fake relcache entries for the same relfilenode. Freeing the first will
close the smgr entry, and freeing the second will try to close the same smgr
entry again. We never do that in the current code, but it seems like a
possible source of bugs in the future.


I think if we go there we'll need refcounted FakeRelcacheEntry's
anyway. Otherwise we'll end up with a relation smgropen()ed multiple
times in the same backend and such which doesn't seem like a promising
course to me since the smgr itself isn't refcounted.


How about the attached instead?



diff --git a/src/backend/access/transam/xlogutils.c 
b/src/backend/access/transam/xlogutils.c
index 5429d5e..f732e71 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
rel->rd_lockInfo.lockRelId.relId = rnode.relNode;

-   rel->rd_smgr = NULL;
+   /*
+* Open the underlying storage, but don't set rd_owner. We want the
+* SmgrRelation to persist after the fake relcache entry is freed.
+*/
+   rel->rd_smgr = smgropen(rnode, InvalidBackendId);

return rel;
  }


I don't really like that - that will mean we'll leak open smgr handles
for every relation touched via smgr during recovery since they will
never be closed unless the relation is dropped or such. And in some
databases there can be huge amounts of relations.
Since recovery is long lived that doesn't seem like a good idea, besides
the memory usage it will also bloat smgr's internal hash which actually
seems just as likely to hurt performance.


That's the way SmgrRelations are supposed to be used. Opened on first 
use, and kept open forever. We never smgrclose() during normal operation 
either, unless the relation is dropped or something like that. And see 
how XLogReadBuffer() also calls smgropen() with no corresponding 
smgrclose().



I think intentionally not using the owner mechanism also is dangerous -
what if we have longer lived fake relcache entries and we've just
processed sinval messages? Then we'll have a ->rd_smgr pointing into
nowhere.


Hmm, the startup process doesn't participate in sinval messaging at all, 
does it?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pg_receivexlog: fixed to work with logical segno > 0

2013-11-04 Thread Mika Eloranta
On Nov 4, 2013, at 11:06, Heikki Linnakangas wrote:
> On 01.11.2013 11:42, Mika Eloranta wrote:
>> pg_receivexlog calculated the xlog segment number incorrectly
>> when started after the previous instance was interrupted.
>> 
>> Resuming streaming only worked when the physical wal segment
>> counter was zero, i.e. for the first 256 segments or so.
> 
> Oops. Fixed, thanks for the report!
> 
> It's a bit scary that this bug went unnoticed for this long; it was 
> introduced quite early in the 9.3 development cycle. Seems that I did all the 
> testing of streaming timeline changes with pg_receivexlog later in 9.3 cycle 
> with segment numbers < 256, and no-one else have done long-running tests with 
> pg_receivexlog either.

Thanks for the fix, Heikki!

It sounds like either PostgreSQL 9.3.x and/or pg_receivexlog is not yet used in 
a lot of places. Otherwise this probably would have been found earlier.

Affected versions:

$ git tag --contains dfda6eba
REL9_3_0
REL9_3_1
REL9_3_BETA1
REL9_3_BETA2
REL9_3_RC1

What makes this a really sneaky and severe problem is the way it stays dormant 
for a period of time after a fresh db init or pg_upgrade. Here's how I bumped 
into it:

1. Old postgresql 9.2 db running, pg_receivexlog streaming extra backups to a 
remote box.
2. pg_upgrade to 9.3.1.
3. pg_receivexlog from the upgraded DB still works ok and handles restarts 
fine, because the xlog indexes were reset back to zero at pg_upgrade.
4. xlog history eventually grows over 256 * 16MB.
5. pg_receivexlog gets interrupted for whatever reason (gets stopped, killed, 
crashes, host is restarted).
6. A new pg_receivexlog instance fails to resume streaming and there is no easy 
workaround that would maintain an uninterrupted, gapless xlog history.

Initially, before I had analysed the problem any further, I had to stash the 
xlogs, restart pg_receivexlog and after that trigger new pg_basebackups.

Regardless of this bug, I find that pg_receivexlog (and pg_basebackup) are 
excellent tools and people should use them more!

PS. something like "pg_receivexlog --start-pos=2D/1500" might be nice for 
overriding the streaming start position.

-- 
Mika Eloranta
Ohmu Ltd.  http://www.ohmu.fi/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something fishy happening on frogmouth

2013-11-04 Thread Andres Freund
On 2013-11-04 13:13:27 +0200, Heikki Linnakangas wrote:
> On 04.11.2013 11:55, Andres Freund wrote:
> >On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
> >>Postmaster creates the POSIX shared memory object at startup, by calling
> >>shm_open(), and immediately calls shm_unlink on it. That way, once all the
> >>processes have exited, the object will be removed automatically. Child
> >>processes inherit the file descriptor at fork(), and don't need to call
> >>shm_open, just mmap().
> >
> >Uh. Won't that completely and utterly remove the point of dsm which is
> >that you can create segments *after* startup? We surely don't want to
> >start overallocating enough shmem so we don't ever dynamically need to
> >allocate segments.

> You don't need to allocate the shared memory beforehand, only create the
> file descriptor. Note that the size of the segment is specified in the
> shm_open() call, but the mmap() that comes later.
>
> If we need a large amount of small segments, so that it's not feasible to
> shm_open() them all at postmaster startup, you could shm_open() just one
> segment, and carve out smaller segments from it by mmap()ing at different
> offsets.

That quickly will result in fragmentation which we don't have the tools
to handle.

> >Also, I don't think it's portable across platforms to access segments
> >that already have been unlinked.
> 
> See
> http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html:
> "If one or more references to the shared memory object exist when the object
> is unlinked, the name shall be removed before shm_unlink() returns, but the
> removal of the memory object contents shall be postponed until all open and
> map references to the shared memory object have been removed."

We also support sysv shmem and have the same cleanup problem there.

> That doesn't explicitly say that a new shm_open() on the file descriptor
> must still work after shm_unlink, but I would be surprised if there is a
> platform where it doesn't.

Probably true.

> >I think this is looking for a solution without an actually relevant
> >problem disregarding the actual problem space.

To make that clearer: I think the discussions about making it impossible
to leak segments after rm -rf are the irrelevant problem.

> I agree. What are these dynamic shared memory segments supposed to be used
> for?

Parallel sort and stuff like that.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something fishy happening on frogmouth

2013-11-04 Thread Heikki Linnakangas

On 04.11.2013 11:55, Andres Freund wrote:

On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:

Hmm, here's another idea:

Postmaster creates the POSIX shared memory object at startup, by calling
shm_open(), and immediately calls shm_unlink on it. That way, once all the
processes have exited, the object will be removed automatically. Child
processes inherit the file descriptor at fork(), and don't need to call
shm_open, just mmap().


Uh. Won't that completely and utterly remove the point of dsm which is
that you can create segments *after* startup? We surely don't want to
start overallocating enough shmem so we don't ever dynamically need to
allocate segments.


You don't need to allocate the shared memory beforehand, only create the 
file descriptor. Note that the size of the segment is specified in the 
shm_open() call, but the mmap() that comes later.


If we need a large amount of small segments, so that it's not feasible 
to shm_open() them all at postmaster startup, you could shm_open() just 
one segment, and carve out smaller segments from it by mmap()ing at 
different offsets.



Also, I don't think it's portable across platforms to access segments
that already have been unlinked.


See 
http://pubs.opengroup.org/onlinepubs/009695299/functions/shm_unlink.html: "If 
one or more references to the shared memory object exist when the object 
is unlinked, the name shall be removed before shm_unlink() returns, but 
the removal of the memory object contents shall be postponed until all 
open and map references to the shared memory object have been removed."


That doesn't explicitly say that a new shm_open() on the file descriptor 
must still work after shm_unlink, but I would be surprised if there is a 
platform where it doesn't.



I think this is looking for a solution without an actually relevant
problem disregarding the actual problem space.


I agree. What are these dynamic shared memory segments supposed to be 
used for?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Removal of archive in wal_level

2013-11-04 Thread Michael Paquier
> Please find attached a patch doing what is written in the $subject.
With the documentation updated, this is even better...
Regards,
--
Michael
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index ccb76d8..0f20253 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -582,7 +582,7 @@ tar -cf backup.tar /usr/local/pgsql/data
 

 To enable WAL archiving, set the 
-configuration parameter to archive (or hot_standby),
+configuration parameter to hot_standby,
  to on,
 and specify the shell command to use in the  configuration parameter.  In practice
@@ -1246,7 +1246,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
   If more flexibility in copying the backup files is needed, a lower
   level process can be used for standalone hot backups as well.
   To prepare for low level standalone hot backups, set wal_level to
-  archive (or hot_standby), archive_mode to
+  hot_standby, archive_mode to
   on, and set up an archive_command that performs
   archiving only when a switch file exists.  For example:
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77a9303..2c6a022 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1606,10 +1606,9 @@ include 'filename'
 wal_level determines how much information is written
 to the WAL. The default value is minimal, which writes
 only the information needed to recover from a crash or immediate
-shutdown. archive adds logging required for WAL archiving,
-and hot_standby further adds information required to run
-read-only queries on a standby server.
-This parameter can only be set at server start.
+shutdown. hot_standby adds logging required for WAL
+archiving and information required to run read-only queries on a
+standby server. This parameter can only be set at server start.


 In minimal level, WAL-logging of some bulk
@@ -1625,22 +1624,17 @@ include 'filename'
 
 But minimal WAL does not contain
 enough information to reconstruct the data from a base backup and the
-WAL logs, so either archive or hot_standby
-level must be used to enable
+WAL logs hot_standby level must be used to enable
 WAL archiving () and streaming
 replication.


-In hot_standby level, the same information is logged as
-with archive, plus information needed to reconstruct
-the status of running transactions from the WAL. To enable read-only
-queries on a standby server, wal_level must be set to
-hot_standby on the primary, and
- must be enabled in the standby. It is
-thought that there is
-little measurable difference in performance between using
-hot_standby and archive levels, so feedback
-is welcome if any production impacts are noticeable.
+In hot_standby level, necessary information is logged
+to reconstruct the status of running transactions from the WAL and
+to enable read-only queries on a standby server. Note that
+wal_level must be set to hot_standby on
+the primary, and  must be enabled
+in the standby.

   
  
@@ -2198,8 +2192,8 @@ include 'filename'
 of connections, so the parameter cannot be set higher than
 .  This parameter can only
 be set at server start. wal_level must be set
-to archive or hot_standby to allow
-connections from standby servers.
+to hot_standby to allow connections from standby
+servers.


   
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index f407753..af4 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -630,7 +630,7 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	 * WAL record that will allow us to conflict with queries
 	 * running on standby.
 	 */
-	if (XLogStandbyInfoActive())
+	if (XLogIsNeeded())
 	{
 		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 073190f..81f7c75 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -826,7 +826,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * take care to issue the record for last actual block and not for the
 	 * last block that was scanned. Ignore empty indexes.
 	 */
-	if (XLogStandbyInfoActive() &&
+	if (XLogIsNeeded() &&
 		num_pages > 1 && vstate.lastBlockVacuumed < (num_pages - 1))
 	{
 		Buffer		buf;
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 1b36f9a..4735cfb 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/back

Re: [HACKERS] [BUGS] BUG #8573: int4range memory consumption

2013-11-04 Thread Andres Freund
On 2013-11-02 15:29:36 -0400, Tom Lane wrote:
> Attached is a proposed patch for this.  It fixes most of the functions
> in printtup.c to use a per-row memory context.  (I did not bother to
> fix debugtup(), which is used only in standalone mode.  If you're doing
> queries large enough for mem leaks to be problematic in standalone mode,
> you're nuts.) 

FWIW, by that definition I have been nuts several time in the past -
without much choice since I was recovering data from a corrupted cluster
and the database couldn't be started normally. This now has gotten worse
(even in the backbranches) since debugtup won't clean up detoasted data
anymore.
But I guess the solution for that is to use COPY in those situations
which shouldn't have that problem and should work. Also, easier to parse ;)

> My original thought had been to get rid of all pfree's of output function
> results, so as to make the world safe for returning constant strings when
> such functions find it appropriate.  However, I ran into a showstopper
> problem: SPI_getvalue(), which is little more than a wrapper around the
> appropriate type output function, is documented as returning a pfree'able
> string.  Some though not all of the callers in core and contrib take the
> hint and pfree the result, and I'm sure there are more in third-party
> extensions.  So if we want to allow output functions to return
> non-palloc'd strings, it seems we have to either change SPI_getvalue()'s
> API contract or insert a pstrdup() call in it.  Neither of these is
> attractive, mainly because the vast majority of output function results
> would still be palloc'd even if we made aggressive use of the option not
> to.  That means that if we did the former, light testing wouldn't
> necessarily show a problem if someone forgot to remove a pfree() after a
> SPI_getvalue() call, and it also means that if we did the latter, the
> pstrdup() would usually be a waste of cycles and space.

I guess one option for the future is to to pstrdup() in SPI_getvalue()
and adding a new function that tells whether the result can be freed or
not. Then callers that care can move over. Although I'd guess that many
of those caring will already use SPI_getbinval() manually.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something fishy happening on frogmouth

2013-11-04 Thread Andres Freund
On 2013-11-04 10:27:47 +0200, Heikki Linnakangas wrote:
> On 01.11.2013 18:22, Noah Misch wrote:
> >On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:
> >>On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas 
> >> wrote:
> >>>On 31.10.2013 16:43, Robert Haas wrote:
> There should be no cases where the main shared memory
> segment gets cleaned up and the dynamic shared memory segments do not.
> >>>
> >>>1. initdb -D data1
> >>>2. initdb -D data2
> >>>3. postgres -D data1
> >>>4. killall -9 postgres
> >>>5. postgres -D data2
> >>>
> >>>The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
> >>>The DSM segment will not.
> >
> >Note that dynamic_shared_memory_type='mmap' will give the desired behavior.

Well, with the significant price of causing file-io.

> Hmm, here's another idea:
> 
> Postmaster creates the POSIX shared memory object at startup, by calling
> shm_open(), and immediately calls shm_unlink on it. That way, once all the
> processes have exited, the object will be removed automatically. Child
> processes inherit the file descriptor at fork(), and don't need to call
> shm_open, just mmap().

Uh. Won't that completely and utterly remove the point of dsm which is
that you can create segments *after* startup? We surely don't want to
start overallocating enough shmem so we don't ever dynamically need to
allocate segments.
Also, I don't think it's portable across platforms to access segments
that already have been unlinked.

I think this is looking for a solution without an actually relevant
problem disregarding the actual problem space.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] buffile.c resource owner breakage on segment extension

2013-11-04 Thread Andres Freund
On 2013-11-01 15:28:54 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > While not particularly nice, given the API, it seems best for buffile.c
> > to remember the resource owner used for the original segment and
> > temporarily set that during the extension.
> 
> Hm, yeah, that seems right.  It's just like repalloc keeping the memory
> chunk in its original context.  The comments here are a bit inadequate...

Thanks for committing and sorry for the need to freshen up the
comments. I don't think I had ever opened buffile.c before and thus
wasn't sure if there isn't a better fix, so I didn't want to spend too
much time on it before somebody agreed it is the right fix.

Also, I was actually just trying to recover some data from a corrupted
database and this stopped me from it ;)

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] missing locking in at least INSERT INTO view WITH CHECK

2013-11-04 Thread Andres Freund
On 2013-11-02 17:05:24 -0700, Kevin Grittner wrote:
> Andres Freund  wrote:
> 
> > the matview patch (0002)
> 
> This is definitely needed as a bug fix.  Will adjust comments and
> commit, back-patched to 9.3.

Thanks.

> > Also attached is 0004 which just adds a heap_lock() around a
> > newly created temporary table in the matview code which shouldn't
> > be required for correctness but gives warm and fuzzy feelings as
> > well as less debugging noise.
> 
> Will think about this.  I agree is is probably worth doing
> something to reduce the noise when looking for cases that actually
> matter.

It's pretty much free, so I don't think there really is any reason to
deviate from other parts of the code. Note how e.g. copy_heap_data(),
DefineRelation() and ATRewriteTable() all lock the new relation, even if
it just has been created and is (and in the latter two cases will always
be) invisible.

> > Wouldn't it be a good idea to tack such WARNINGs (in a proper and
> > clean form) to index_open (checking the underlying relation is
> > locked), relation_open(..., NoLock) (checking the relation has
> > previously been locked) and maybe RelationIdGetRelation() when
> > cassert is enabled? ISTM we frequently had bugs around this.
> 
> It would be nice to have such omissions pointed out during early
> testing.

Will try to come up with something.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] missing RelationCloseSmgr in FreeFakeRelcacheEntry?

2013-11-04 Thread Andres Freund
On 2013-11-04 09:38:27 +0200, Heikki Linnakangas wrote:
> On 29.10.2013 03:16, Andres Freund wrote:
> >Hi,
> >
> >I've started a valgrind run earlier when trying to run the regression
> >tests with valgrind --error-exitcode=122 (to cause the regression tests
> >to fail visibly) but it crashed frequently...
> >
> >One of them was:
> >...
> >Which seems to indicate a dangling ->rd_smgr pointer, confusing the heck
> >out of me because I couldn't see how that'd happen.
> >
> >Looking a bit closer it seems to me that the fake relcache
> >infrastructure seems to neglect the chance that something used the fake
> >entry to read something which will have done a RelationOpenSmgr(). Which
> >in turn will have set rel->rd_smgr->smgr_owner to rel.
> >When we then pfree() the fake relation in FreeFakeRelcacheEntry we have
> >a dangling pointer.
> 
> Yeah, that's clearly a bug.
> 
> >diff --git a/src/backend/access/transam/xlogutils.c 
> >b/src/backend/access/transam/xlogutils.c
> >index 5429d5e..ee7c892 100644
> >--- a/src/backend/access/transam/xlogutils.c
> >+++ b/src/backend/access/transam/xlogutils.c
> >@@ -433,6 +433,7 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
> >  void
> >  FreeFakeRelcacheEntry(Relation fakerel)
> >  {
> >+   RelationCloseSmgr(fakerel);
> > pfree(fakerel);
> >  }
> 
> Hmm, I don't think that's a very good solution. Firstly, it will force the
> underlying files to be closed, hurting performance. Fake relcache entries
> are used in heapam when clearing visibility map bits, which might happen
> frequently enough for that to matter.

Well, it will only close them when they actually were smgropen()ed which
will not always be the case. Although it probably is in the visibility
map case.

Feels like premature optimization to me.

> Secondly, it will fail if you create
> two fake relcache entries for the same relfilenode. Freeing the first will
> close the smgr entry, and freeing the second will try to close the same smgr
> entry again. We never do that in the current code, but it seems like a
> possible source of bugs in the future.

I think if we go there we'll need refcounted FakeRelcacheEntry's
anyway. Otherwise we'll end up with a relation smgropen()ed multiple
times in the same backend and such which doesn't seem like a promising
course to me since the smgr itself isn't refcounted.

> How about the attached instead?

> diff --git a/src/backend/access/transam/xlogutils.c 
> b/src/backend/access/transam/xlogutils.c
> index 5429d5e..f732e71 100644
> --- a/src/backend/access/transam/xlogutils.c
> +++ b/src/backend/access/transam/xlogutils.c
> @@ -422,7 +422,11 @@ CreateFakeRelcacheEntry(RelFileNode rnode)
>   rel->rd_lockInfo.lockRelId.dbId = rnode.dbNode;
>   rel->rd_lockInfo.lockRelId.relId = rnode.relNode;
>  
> - rel->rd_smgr = NULL;
> + /*
> +  * Open the underlying storage, but don't set rd_owner. We want the
> +  * SmgrRelation to persist after the fake relcache entry is freed.
> +  */
> + rel->rd_smgr = smgropen(rnode, InvalidBackendId);
>  
>   return rel;
>  }

I don't really like that - that will mean we'll leak open smgr handles
for every relation touched via smgr during recovery since they will
never be closed unless the relation is dropped or such. And in some
databases there can be huge amounts of relations.
Since recovery is long lived that doesn't seem like a good idea, besides
the memory usage it will also bloat smgr's internal hash which actually
seems just as likely to hurt performance.

I think intentionally not using the owner mechanism also is dangerous -
what if we have longer lived fake relcache entries and we've just
processed sinval messages? Then we'll have a ->rd_smgr pointing into
nowhere.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] pg_receivexlog: fixed to work with logical segno > 0

2013-11-04 Thread Heikki Linnakangas

On 01.11.2013 11:42, Mika Eloranta wrote:

pg_receivexlog calculated the xlog segment number incorrectly
when started after the previous instance was interrupted.

Resuming streaming only worked when the physical wal segment
counter was zero, i.e. for the first 256 segments or so.


Oops. Fixed, thanks for the report!

It's a bit scary that this bug went unnoticed for this long; it was 
introduced quite early in the 9.3 development cycle. Seems that I did 
all the testing of streaming timeline changes with pg_receivexlog later 
in 9.3 cycle with segment numbers < 256, and no-one else have done 
long-running tests with pg_receivexlog either.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] What stopped SECURITY BARRIER views from being auto-updatable?

2013-11-04 Thread Craig Ringer
The current code just reads:

/*
 * For now, we also don't support security-barrier views, because of the
 * difficulty of keeping upper-level qual expressions away from
 * lower-level data.  This might get relaxed in future.
 */
if (RelationIsSecurityView(view))
return gettext_noop("Security-barrier views are not automatically
updatable.");


which doesn't tell me tons. Is it the same problem RLS faces where you
can't just append the view predicate to the update predicate because you
may leak information if the update predicate gets evaluated before the
view predicate?

RLS attempts to solve that by wrapping the range table entry for the
relation in a subquery over the original table qualified with the rls
predicate, forcing its evaluation before the outer predicate of the
user-supplied query. That approach seems to be unpopular, but how much
of that is about the implementation by doing rewriting in the planner
its self, and how much is objection to the principle?

If updateable views supported security barriers it'd really help reduce
the amount of complex planner messing-around in the RLS patch, so this
is something I'm quite interested in tacling if anyone has a few
pointers on where to look to start with.

It wouldn't be total solution to RLS, as we'd still have the issue of
the rewriter not getting re-run on plan invalidation and re-planning,
which causes issues if the RLS clause was omitted first time and
shouldn't be this time or vice versa. We'd need to be able to store the
pre-rewrite parse tree and re-run the rewriter for plans where one or
more relations were flagged as having RLS.

>From what Kohei KaiGai was saying it sounds like we'd also need to
re-define how RLS and inheritance interacted, so a parent's RLS
predicate applies to all its children, as one of the reasons he had to
do it in the optimizer/planner was because of difficulties handling
totally independent RLS expressions for child and parent tables at the
rewriter level.

Might this be a more palatable way forward than the fiddling with Vars
and doing query rewriting inside the optimizer that the current RLS
patch does?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Something fishy happening on frogmouth

2013-11-04 Thread Heikki Linnakangas

On 01.11.2013 18:22, Noah Misch wrote:

On Fri, Nov 01, 2013 at 12:27:31AM -0400, Robert Haas wrote:

On Thu, Oct 31, 2013 at 7:48 PM, Heikki Linnakangas  
wrote:

On 31.10.2013 16:43, Robert Haas wrote:

There should be no cases where the main shared memory
segment gets cleaned up and the dynamic shared memory segments do not.


1. initdb -D data1
2. initdb -D data2
3. postgres -D data1
4. killall -9 postgres
5. postgres -D data2

The system V shmem segment orphaned at step 4 will be cleaned up at step 5.
The DSM segment will not.


Note that dynamic_shared_memory_type='mmap' will give the desired behavior.


Hmm, here's another idea:

Postmaster creates the POSIX shared memory object at startup, by calling 
shm_open(), and immediately calls shm_unlink on it. That way, once all 
the processes have exited, the object will be removed automatically. 
Child processes inherit the file descriptor at fork(), and don't need to 
call shm_open, just mmap().


I'm not sure how dynamic these segments need to be, but if 1-2 such file 
descriptors is not enough, you could mmap() different offsets from the 
same shmem object for different segments.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.4] row level security

2013-11-04 Thread Craig Ringer
On 09/04/2013 11:22 PM, Tom Lane wrote:
> AFAICT, to deal with update/delete the RLS patch needs to constrain order
> of qual application without the crutch of having a separate level of
> subquery; and it's that behavior that I have zero confidence in, either
> as to whether it works as submitted or as to our odds of not breaking it
> in the future.

Wouldn't CASE do that job, albeit in a somewhat ugly manner, and also
protect against malicious RLS functions?

For any subclause in a WHERE clause "(a) OR (b)" you can instead write a
short-circuit OR version as:

  CASE WHEN (a) THEN true ELSE (b) END

no?

So, given a row security condition like "(rowowner = current_user AND
rls_malicious_leak())" on table "test", this:

SELECT * FROM test WHERE client_leak_fn();

could be rewritten by row security as:

SELECT *
FROM test
WHERE

CASE WHEN
  CASE WHEN is_superuser() THEN true
  ELSE (rowowner = current_user AND rls_malicious_leak())
  END
THEN
  client_leak_fn()
END;


in other words: if the user is the superuser, don't evaluate the RLS
predicate, just assume they have the right to see the row. Otherwise
evaluate the RLS predicate to determine whether they can see the row. In
either case, if they have the right to see the row, run the original
WHERE clause.

My main concern is that it'd be relying on a guarantee that CASE is
always completely ordered, and that might not be ideal. It's also
hideously ugly, and future optimiser smarts could create unexpected issues.

Unless you propose the creaton of a new short-circuit left-to-right
order guaranteed OR operator, and think the planner/optimizer should be
taught special case rules to prevent it from doing pull-up / push-down
or pre-evaluating subclauses for it, I suspect the notion of using
security barrier views or subqueries is still going to be the sanest way
to do it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers