Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-17 Thread Samrat Revagade
On Sun, Jun 16, 2013 at 11:08 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 16 June 2013 17:25, Samrat Revagade revagade.sam...@gmail.com wrote:
 
 
  On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
 
 
 
  So I strongly object to calling this patch anything to do with
  failback safe. You simply don't have enough data to make such a bold
  claim. (Which is why we call it synchronous replication and not zero
  data loss, for example).
 
  But that's not the whole story. I can see some utility in a patch that
  makes all WAL transfer synchronous, rather than just commits. Some
  name like synchronous_transfer might be appropriate. e.g.
  synchronous_transfer = all | commit (default).
 
 
  I agree with you about the fact that,
  Now a days the need of fresh backup in crash recovery  seems to be a
 major
  problem.
  we might need to change the name of patch if there other problems too
 with
  crash recovery.

 (Sorry don't understand)

  Sorry for the confusion. I will change name of a patch.


  The idea of another slew of parameters that are very similar to
  synchronous replication but yet somehow different seems weird. I can't
  see a reason why we'd want a second lot of parameters. Why not just
  use the existing ones for sync rep? (I'm surprised the Parameter
  Police haven't visited you in the night...) Sure, we might want to
  expand the design for how we specify multi-node sync rep, but that is
  a different patch.
 
 
  The different set of parameters are needed to differentiate between
  fail-safe standby and synchronous standby, the fail-safe standby and
 standby
  in synchronous replication can be two different servers.

 Why would they be different? What possible reason would you have for
 that config? There is no *need* for those parameters, the proposal
 could work perfectly well without them.

 Let's make this patch fulfill the stated objectives, not add in
 optional extras, especially ones that don't appear well thought
 through. If you wish to enhance the design for the specification of
 multi-node sync rep, make that a separate patch, later.

 I agree with you.I will remove the extra parameters if they are not
required in next version of the patch.

-- 
Regards,

Samrat Revgade


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-17 Thread KONDO Mitsumasa

Thank you for giving comments and my patch reviewer!

(2013/06/16 23:27), Heikki Linnakangas wrote:

On 10.06.2013 13:51, KONDO Mitsumasa wrote:

I create patch which is improvement of checkpoint IO scheduler for
stable transaction responses.

* Problem in checkpoint IO schedule in heavy transaction case
When heavy transaction in database, I think PostgreSQL checkpoint
scheduler has two problems at start and end of checkpoint. One problem
is IO heavy when starting initial checkpoint in rounds of checkpoint.
This problem was caused by full-page-write which cause WAL IO in fast
page writes after checkpoint write page. Therefore, when starting
checkpoint, WAL-based checkpoint scheduler wrong judgment that is late
schedule by full-page-write, nevertheless checkpoint schedule is not
late. This is caused bad transaction response. I think WAL-based
checkpoint scheduler was not property in starting checkpoint.


Yeah, the checkpoint scheduling logic doesn't take into account the heavy WAL
activity caused by full page images. That's an interesting phenomenon, but did
you actually see that causing a problem in your tests?  I couldn't tell from the
results you posted what the impact of that was. Could you repeat the tests
separately with the two separate patches you posted later in this thread?

OK, I try to test with the two separate patches. My patches results which I 
send past
indicate high WAL throughputs(write_size_per_sec) and high transaction during 
checkpoint. Please see
under following HTML file which I set tag jump, and put 'checkpoint highlight 
switch' button.


* With my patched PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/patchedPG-report.html#wal_statistics

* Plain PG
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#transaction_statistics
http://pgstatsinfo.projects.pgfoundry.org/dbt2_result/report/plainPG-report.html#wal_statistics

In wal statistics result, I think that high WAL thorouputs in checkpoint starting 
indicates that checkpoint IO does not disturb other executing transaction IO.



Rationalizing a bit, I could even argue to myself that it's a *good* thing. At
the beginning of a checkpoint, the OS write cache should be relatively empty, as
the checkpointer hasn't done any writes yet. So it might make sense to write a
burst of pages at the beginning, to partially fill the write cache first, before
starting to throttle. But this is just handwaving - I have no idea what the
effect is in real life.
Yes, I think so. If we want to change IO throttle, we change OS parameter which 
are '/proc/sys/vm/dirty_background_ratio' or '/proc/sys/vm/dirty_ratio'. But this 
parameter effects whole applications in OS, it is difficult to change this 
parameter and cannot set intuitive parameter. And I think that database tuning 
should be set in database parameter rather than OS parameter. It is more clear in 
tuning a server.



Another thought is that rather than trying to compensate for that effect in the
checkpoint scheduler, could we avoid the sudden rush of full-page images in the
first place? The current rule for when to write a full page image is
conservative: you don't actually need to write a full page image when you modify
a buffer that's sitting in the buffer cache, if that buffer hasn't been flushed
to disk by the checkpointer yet, because the checkpointer will write and fsync 
it
later. I'm not sure how much it would smoothen WAL write I/O, but it would be
interesting to try.
It is most right method in ideal implementations. But I don't have any idea about 
this method. It seems very difficult...




Second problem is fsync freeze problem in end of checkpoint.
Normally, checkpoint write is executed in background by OS's IO
scheduler. But when it does not correctly work, end of checkpoint
fsync was caused IO freeze and slower transactions. Unexpected slow
transaction will cause monitor error in HA-cluster and decrease
user-experience in application service. It is especially serious
problem in cloud and virtual server database system which does not
have IO performance. However we don't have solution in
postgresql.conf parameter very much. We prefer checkpoint time to
fast response transactions. In fact checkpoint time is short, and it
becomes little bit long that is not problem. You may think that
checkpoint_segments and checkpoint_timeout are set larger value,
however large checkpoint_segments affects file-cache which is not
read and is wasted, and large checkpoint_timeout was caused
long-time crash-recovery.


A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp.
He posted very promising performance numbers, but it was dropped because Tom
couldn't reproduce the numbers, and because sorting requires allocating a large
array, which has the 

Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Simon Riggs
On 17 June 2013 00:43, Kevin Grittner kgri...@ymail.com wrote:

 Especially when one is known to be better than the other already.

 What is the hypothetical technique you're arguing is inferior?  For
 my own part, I haven't gotten beyond the phase of knowing that to
 meet all requests for the feature, it would need to be available at
 about the same point that AFTER EACH STATEMENT triggers fire, but
 that it should not involve any user-written triggers.  Have you
 implemented something similar to what you think I might be
 considering?  Do you have benchmark results?  Can you share
 details?

Recording the changeset required by replication is known to be more
efficient using WAL based extraction than using triggers. WAL writes
are effectively free and using WAL concentrates the reads to avoid
random I/O in large databases. That would be the most suitable
approach for continuously updated matviews, or frequently updates.

Extraction using multiple snapshots is also possible, using a
technique similar to concurrently mechanism. That would require
re-scanning the whole table which might be overkill depending upon the
number of changes. That would work for reasonably infrequent updates.

 Given that we also want to do concurrent CLUSTER and ALTER TABLE
 ... SET TABLESPACE using changeset extraction I think its time
 that discussion happened on hackers.

 No objections to that here; but please don't hijack this thread for
 that discussion.

There are multiple features all requiring efficient change set
extraction. It seems extremely relevant to begin discussing what that
mechanism might be in each case, so we don't develop 2 or even 3
different ones while everybody ignores each other. As you said, we
should be helping each other and working together, and I agree.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-17 Thread Pavan Deolasee
On Mon, Jun 17, 2013 at 2:18 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-06-16 17:27:56 +0300, Heikki Linnakangas wrote:

  A long time ago, Itagaki wrote a patch to sort the checkpoint writes:
 www.postgresql.org/message-id/flat/20070614153758.6a62.itagaki.takah...@oss.ntt.co.jp
 .
  He posted very promising performance numbers, but it was dropped because
 Tom
  couldn't reproduce the numbers, and because sorting requires allocating a
  large array, which has the risk of running out of memory, which would be
 bad
  when you're trying to checkpoint.

 Hm. We could allocate the array early on since the number of buffers
 doesn't change. Sure that would be pessimistic, but that seems fine.

 Alternatively I can very well imagine that it would still be beneficial
 to sort the dirty buffers in shared buffers. I.e. scan till we found 50k
 dirty pages, sort them and only then write them out.


Without knowing that Itagaki had done something similar in the past, couple
of months back I tried exactly the same thing i.e. sort the shared buffers
in chunks and then write them out at once. But I did not get any
significant performance gain except when the shared buffers are 3/4th (or
some such number) or more than the available RAM. I will see if I can pull
out the patch and the numbers. But if memory serves well, I concluded that
the kernel is already utilising its buffer cache to achieve the same thing
and it does not help beyond a point.

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

2013-06-17 Thread KONDO Mitsumasa
(2013/06/17 5:48), Andres Freund wrote: On 2013-06-16 17:27:56 +0300, Heikki 
Linnakangas wrote:

 If we don't mind scanning the buffer cache several times, we don't
 necessarily even need to sort the writes for that. Just scan the buffer
 cache for all buffers belonging to relation A, then fsync it. Then scan the
 buffer cache again, for all buffers belonging to relation B, then fsync
 that, and so forth.

 That would end up with quite a lot of scans in a reasonably sized
 machines. Not to talk of those that have a million+ relations. That
 doesn't seem to be a good idea for bigger shared_buffers. C.f. the stuff
 we did for 9.3 to make it cheaper to drop a bunch of relations at once
 by only scanning shared_buffers once.
As I written to reply to Heikki, I think that it is unnecessary to exactly buffer 
sort which has expensive cost. What we need to solve this problem, we need 
accuracy of sort which can be optimized in OS IO scheduler. And we normally have 
two optimized IO scheduler layer which are OS layer and RAID controller layer. I 
think that performance will be improved if it enables sort accuracy to optimize 
in these process. I think that computational complexity required to solve this 
problem is one sequential buffer descriptor scan for roughly buffer sort. I will 
try to study about this implementation, too.


Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Simon Riggs
On 17 June 2013 02:05, Josh Berkus j...@agliodbs.com wrote:

 I agree that the FSM behaviour shouldn't be linked to index existence.
 IMHO that should be a separate table parameter, WITH (fsm_mode = append)
 Index only scans would also benefit from that.

 -1 ... I cannot believe that such a parameter would ever get turned on
 in production by anyone.  If your table has a significant update rate,
 the resulting table bloat would make such behavior completely
 infeasible.  If you have few enough updates to make such a behavior
 practical, then you can live with the expensive index updates instead.

 I'm also thinking that if a table is really append-only, then there are
 never any middle-of-the-table pages in the FSM, no?

 Where this falls down is the table which is
 mostly-append-but-occasionally-needs-an-update-or-delete.  I think the
 answer there is to look for a way to make updating the index block range
 faster, not ways to modify how we append to the heap.  If we told users
 tables with Minmax indexes will be very expensive to update then I
 think they'd live with it; dropping and readding an index to enable fast
 updates is something which is already familiar.

This feature is using a similar technique to enhance SeqScans as we
already use on VACUUM. We don't really care about whether we have 100%
scan avoidance because we were likely to be using a WHERE clause that
doesn't give perfect constraint elimination anyway. So on a large
table we don't care about the fact that we still have to scan 1-5% of
the table - we are still 20 times faster than a full seqscan.

So there isn't a fall down thing here. We expect the recently
loaded/updated data to be scanned and that's OK.

Having the minmax index updated greedily is just adding extra work for
fast diminishing returns. We can always add that later if really
needed, but I doubt it will be needed - in just the same way as mat
views aren't greedily updated.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SLRU

2013-06-17 Thread Soroosh Sardari
Hey

I was reading the multi transaction log manager, multixact.c.
I didn't get what  SLRU does.

I want the goal of this module, and why we use it.
I'm kind of newbie, be patient with me ;)

Regards
Soroosh


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-17 Thread Pavan Deolasee
On Sun, Jun 16, 2013 at 5:10 PM, Simon Riggs si...@2ndquadrant.com wrote:



 My perspective is that if the master crashed, assuming that you know
 everything about that and suddenly jumping back on seem like a recipe
 for disaster. Attempting that is currently blocked by the technical
 obstacles you've identified, but that doesn't mean they are the only
 ones - we don't yet understand what all the problems lurking might be.
 Personally, I won't be following you onto that minefield anytime soon.


Would it be fair to say that a user will be willing to trust her crashed
master in all scenarios where she would have done so in a single instance
setup ? IOW without the replication setup, AFAIU users have traditionally
trusted the WAL recovery to recover from failed instances. This would
include some common failures such as power outages and hardware failures,
but may not include others such as on disk corruption.


 So I strongly object to calling this patch anything to do with
 failback safe. You simply don't have enough data to make such a bold
 claim. (Which is why we call it synchronous replication and not zero
 data loss, for example).


I agree. We should probably find a better name for this. Any suggestions ?


 But that's not the whole story. I can see some utility in a patch that
 makes all WAL transfer synchronous, rather than just commits. Some
 name like synchronous_transfer might be appropriate. e.g.
 synchronous_transfer = all | commit (default).


Its an interesting idea, but I think there is some difference here. For
example, the proposed feature allows a backend to wait at other points but
not commit. Since commits are more foreground in nature and this feature
does not require us to wait during common foreground activities, we want a
configuration where master can wait for synchronous transfers at other than
commits. May we can solve that by having more granular control to the said
parameter ?


 The idea of another slew of parameters that are very similar to
 synchronous replication but yet somehow different seems weird. I can't
 see a reason why we'd want a second lot of parameters. Why not just
 use the existing ones for sync rep? (I'm surprised the Parameter
 Police haven't visited you in the night...) Sure, we might want to
 expand the design for how we specify multi-node sync rep, but that is
 a different patch.


How would we then distinguish between synchronous and the new kind of
standby ? I am told, one of the very popular setups for DR is to have one
local sync standby and one async (may be cascaded by the local sync). Since
this new feature is more useful for DR because taking a fresh backup on a
slower link is even more challenging, IMHO we should support such setups.



 I'm worried to see that adding this feature and yet turning it off
 causes a measureable drop in performance. I don't think we want that
 at all. That clearly needs more work and thought.


I agree. We need to repeat those tests. I don't trust that turning the
feature is causing 1-2% drop. In one of the tests, I see turning the
feature on is showing better number compared to when its turn off. That's
clearly noise or need concrete argument to convince that way.


 I also think your performance results are somewhat bogus. Fast
 transaction workloads were already mostly commit waits -


But not in case of async standby, right ?


 measurements
 of what happens to large loads, index builds etc would likely reveal
 something quite different.


I agree. I also feel we need tests where the FlushBuffer gets called more
often by the normal backends to see how much added wait in that code path
causes performance drops. Another important thing to test would be to see
how it works on a slower/high latency links.


 I'm tempted by the thought that we should put the WaitForLSN inside
 XLogFlush, rather than scatter additional calls everywhere and then
 have us inevitably miss one.


That indeed seems cleaner.

Thanks,
Pavan


Re: [HACKERS] SLRU

2013-06-17 Thread Pavan Deolasee
On Mon, Jun 17, 2013 at 1:22 PM, Soroosh Sardari
soroosh.sard...@gmail.comwrote:

 Hey

 I was reading the multi transaction log manager, multixact.c.
 I didn't get what  SLRU does.

 I want the goal of this module, and why we use it.
 I'm kind of newbie, be patient with me ;)


Did you look at src/backend/access/transam/slru.c ? The first para in that
file is quite explanatory:

* We use a simple least-recently-used scheme to manage a pool of page
 * buffers.  Under ordinary circumstances we expect that write
 * traffic will occur mostly to the latest page (and to the just-prior
 * page, soon after a page transition).  Read traffic will probably touch
 * a larger span of pages, but in any case a fairly small number of page
 * buffers should be sufficient.  So, we just search the buffers using plain
 * linear search; there's no need for a hashtable or anything fancy.
 * The management algorithm is straight LRU except that we will never swap
 * out the latest page (since we know it's going to be hit again
eventually).

Thanks,
Pavan

-- 
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee


Re: [HACKERS] Hard limit on WAL space used (because PANIC sucks)

2013-06-17 Thread Dimitri Fontaine
Peter Eisentraut pete...@gmx.net writes:
 I suspect that there are actually only about 5 or 6 common ways to do
 archiving (say, local, NFS, scp, rsync, S3, ...).  There's no reason why
 we can't fully specify and/or script what to do in each of these cases.

And provide either fully reliable contrib scripts or internal archive
commands ready to use for those common cases. I can't think of other
common use cases, by the way.

+1

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-17 Thread Dimitri Fontaine
Hi,

Peter Eisentraut pete...@gmx.net writes:
 2. confuse users
 3. produce broken external extension modules that take contrib as an example

I agree that having both cases (sections) in the Makefile is a bad idea.
Still, why should we keep the in-tree build instructions?

Would it be possible instead to instruct PGXN to work with a non
installed server source tree? And how much do we need that really?

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add more regression tests for CREATE OPERATOR

2013-06-17 Thread Szymon Guz
On 23 May 2013 00:34, Robins Tharakan thara...@gmail.com wrote:

 Hi,

 Please find attached a patch to take code-coverage of CREATE OPERATOR
 (src/backend/commands/operatorcmds.c) from 56% to 91%.

 Any and all feedback is welcome.
 --
 Robins Tharakan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


Hi,
there is one commented out test. I think it should be run, or deleted.
There is no use of commented sql code which is not run.

What do you think?

regards,
Szymon


Re: [HACKERS] MD5 aggregate

2013-06-17 Thread Dean Rasheed
On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
 There seem to be 2 separate directions that this could go, which
 really meet different requirements:

 1). Produce an unordered sum for SQL to compare 2 tables regardless of
 the order in which they are scanned. A possible approach to this might
 be something like an aggregate

 md5_total(text/bytea) returns text

 that returns the sum of the md5 values of each input value, treating
 each md5 value as an unsigned 128-bit integer, and then producing the
 hexadecimal representation of the final sum. This should out-perform a
 solution based on numeric addition, and in typical cases, the result
 wouldn't be much longer than a regular md5 sum, and so would be easy
 to eyeball for differences.


I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
  FROM (SELECT * FROM foo ORDER BY id) foo;

 50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

 02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
   sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

  md5_agg   |  md5_total
+-
 8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean


md5_agg_v2.patch
Description: Binary data


md5-100m-row-test.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Heikki Linnakangas

On 14.06.2013 19:05, Kevin Grittner wrote:

Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for
9.4 CF1.  The goal of this patch is to allow a refresh without
interfering with concurrent reads, using transactional semantics.

It is my hope to get this committed during this CF to allow me to
focus on incremental maintenance for the rest of the release cycle.


I must say this seems a bit pointless on its own. But if it's a stepping 
stone to incremental maintenance, I have no objections.



I didn't need to touch very much outside of matview-specific files
for this.  My biggest concern is that I needed two small functions
which did *exactly* what some static functions in ri_triggers.c
were doing and couldn't see where the best place to share them from
was.  For the moment I just duplicated them, but my hope would be
that they could be put in a suitable location and called from both
places, rather than duplicating the 30-some lines of code.  The
function signatures are:

void quoteOneName(char *buffer, const char *name)
void quoteRelationName(char *buffer, Relation rel)


I'd just use quote_identifier and quote_qualified_identifier instead.

I didn't understand this error message:

+   if (!foundUniqueIndex)
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+  errmsg(concurrent refresh requires a unique index on just 
columns for all rows of the materialized view)));

+

What does that mean?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Review: Display number of changed rows since last analyze

2013-06-17 Thread Albe Laurenz
This is a review of the patch in 5192d7d2.8020...@catalyst.net.nz

The patch applies cleanly (with the exception of catversion.h of course),
compiles without warnings and passes the regression tests.

It contains enough documentation, though I'd prefer
Estimated number of rows modified since the table was last analyzed
to
Estimated number of row changes (inserts + updates + deletes) since the last 
analyze

The patch works as it should, and I think that this is a
useful addition.  It only exposes a value that is already
available internally, so there shouldn't be any penalties.

I think that the column name is ok as it is, even if it
is a bit long - I cannot come up with a more succinct
idea.  Perhaps n_changed_since_analyze could be shortened
to n_mod_since_analyze, but that's not much of an improvement.

This is a very simple change, and I'll mark this patch ready for committer.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-17 Thread Simon Riggs
On 17 June 2013 09:03, Pavan Deolasee pavan.deola...@gmail.com wrote:

 I agree. We should probably find a better name for this. Any suggestions ?

err, I already made one...

 But that's not the whole story. I can see some utility in a patch that
 makes all WAL transfer synchronous, rather than just commits. Some
 name like synchronous_transfer might be appropriate. e.g.
 synchronous_transfer = all | commit (default).

 Since commits are more foreground in nature and this feature
 does not require us to wait during common foreground activities, we want a
 configuration where master can wait for synchronous transfers at other than
 commits. May we can solve that by having more granular control to the said
 parameter ?


 The idea of another slew of parameters that are very similar to
 synchronous replication but yet somehow different seems weird. I can't
 see a reason why we'd want a second lot of parameters. Why not just
 use the existing ones for sync rep? (I'm surprised the Parameter
 Police haven't visited you in the night...) Sure, we might want to
 expand the design for how we specify multi-node sync rep, but that is
 a different patch.


 How would we then distinguish between synchronous and the new kind of
 standby ?

That's not the point. The point is Why would we have a new kind of
standby? and therefore why do we need new parameters?

 I am told, one of the very popular setups for DR is to have one
 local sync standby and one async (may be cascaded by the local sync). Since
 this new feature is more useful for DR because taking a fresh backup on a
 slower link is even more challenging, IMHO we should support such setups.

...which still doesn't make sense to me. Lets look at that in detail.

Take 3 servers, A, B, C with A and B being linked by sync rep, and C
being safety standby at a distance.

Either A or B is master, except in disaster. So if A is master, then B
would be the failover target. If A fails, then you want to failover to
B. Once B is the target, you want to failback to A as the master. C
needs to follow the new master, whichever it is.

If you set up sync rep between A and B and this new mode between A and
C. When B becomes the master, you need to failback from B from A, but
you can't because the new mode applied between A and C only, so you
have to failback from C to A. So having the new mode not match with
sync rep means you are forcing people to failback using the slow link
in the common case.

You might observe that having the two modes match causes problems if A
and B fail, so you are forced to go to C as master and then eventually
failback to A or B across a slow link. That case is less common and
could be solved by extending sync transfer to more/multi nodes.

It definitely doesn't make sense to have sync rep on anything other
than a subset of sync transfer. So while it may be sensible in the
future to make sync transfer a superset of sync rep nodes, it makes
sense to make them the same config for now.

Phew

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-17 Thread Marko Kreen
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
  There seem to be 2 separate directions that this could go, which
  really meet different requirements:
 
  1). Produce an unordered sum for SQL to compare 2 tables regardless of
  the order in which they are scanned. A possible approach to this might
  be something like an aggregate
 
  md5_total(text/bytea) returns text
 
  that returns the sum of the md5 values of each input value, treating
  each md5 value as an unsigned 128-bit integer, and then producing the
  hexadecimal representation of the final sum. This should out-perform a
  solution based on numeric addition, and in typical cases, the result
  wouldn't be much longer than a regular md5 sum, and so would be easy
  to eyeball for differences.
 
 
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:
 
 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;
 
  50bc42127fb9b028c9708248f835ed8f
 
 Time: 92960.021 ms
 
 SELECT md5_total(foo.*::text) FROM foo;
 
  02faea7fafee4d253fc94cfae031afc43c03479c
 
 Time: 96190.343 ms
 
 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.
 
 Note, however, that if there is an index that can be used for
 ordering, the performance is not necessarily better than md5_agg(), as
 this example shows. There is a small additional overhead per row for
 initialising the MD5 sums, and adding the results to the total, but I
 think the biggest factor is that md5_total() is processing more data.
 The reason is that MD5 works on 64-byte blocks, so the total amount of
 data going through the core MD5 algorithm is each row's size is
 rounded up to a multiple of 64. In this simple case it ends up
 processing around 1.5 times as much data:
 
 SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
 
   md5_agg   |  md5_total
 +-
  8103815438 | 12799909248
 
 although of course that overhead won't be as large on wider tables,
 and even in this case the overall performance is still on a par with
 md5_agg().
 
 ISTM that both aggregates are potentially useful in different
 situations. I would probably typically use md5_total() because of its
 simplicity/order-independence and consistent performance, but
 md5_agg() might also be useful when comparing with external data.

Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
  bigger than available RAM).  If you want to promote such use you should
  also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints.  Just let it
  happen.  Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
  or SpookyHash.  You can get C implementation from pghashlib.

-- 
marko



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add regression tests for SET xxx

2013-06-17 Thread Szymon Guz
On 26 May 2013 19:56, Robins Tharakan thara...@gmail.com wrote:

 Hi,

 Please find attached a patch to take code-coverage of SET (SESSION / SEED
 / TRANSACTION / DATESTYLE / TIME ZONE) (src/backend/commands/variable.c)
 from 65% to 82%.

 Any and all feedback is welcome.
 --
 Robins Tharakan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


Hi,
the patch applies cleanly on code from trunk, however there are failing
tests, diff attached.

regards
Szymon


regression.diffs
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MVCC catalog access

2013-06-17 Thread Andres Freund
On 2013-06-03 14:57:12 -0400, Robert Haas wrote:
 On Thu, May 30, 2013 at 1:39 AM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  +1.
 
 Here's a more serious patch for MVCC catalog access.  This one
 involves more data copying than the last one, I think, because the
 previous version did not register the snapshots it took, which I think
 is not safe.  So this needs to be re-tested for performance, which I
 have so far made no attempt to do.

Ok, I am starting to take a bit more serious look.

Minor issues I noticed:
* index.c:index_constraint_create()s - comments need to get updated
* index.c:IndexCheckExclusion() - why do we still use a SnapshotNow? I'd
  rather not use *Now if it isn't necessary.
* the * CONCURRENTLY infrastructure should be simplified once this has
  been applied, but I think it makes sense to keep that separate.
* index.c:reindex_index() - SnapshotNow comment should be updated

I still think that renaming SnapshotNow to something like
SnapshotPerTuple to force everyone to reavaluate their usage would be
good.

So, the biggest issue with the patch seems to be performance worries. I
tried to create a worst case scenario:
postgres (patched and HEAD) running with:
-c shared_buffers=4GB \
-c max_connections=2000 \
-c maintenance_work_mem=2GB \
-c checkpoint_segments=300 \
-c wal_buffers=64MB \
-c synchronous_commit=off \
-c autovacuum=off \
-p 5440

With one background pgbench running:
pgbench -p 5440 -h /tmp -f /tmp/readonly-busy.sql -c 1000 -j 10 -T 100 postgres
readonly-busy.sql:
BEGIN;
SELECT txid_current();
SELECT pg_sleep(0.0001);
COMMIT;

I measured the performance of one other pgbench:
pgbench -h /tmp -p 5440 postgres -T 10 -c 100 -j 100 -n -f /tmp/simplequery.sql 
-C
simplequery.sql:
SELECT * FROM af1, af2 WHERE af1.x = af2.x;
tables:
create table af1 (x) as select g from generate_series(1,4) g;
create table af2 (x) as select g from generate_series(4,7) g;

With that setup one can create quite a noticeable overhead for the mvcc
patch (best of 5):

master-optimize:
tps = 1261.629474 (including connections establishing)
tps = 15121.648834 (excluding connections establishing)

dev-optimize:
tps = 773.719637 (including connections establishing)
tps = 2804.239979 (excluding connections establishing)

Most of the time in both, patched and unpatched is by far spent in
GetSnapshotData. I think the reason this shows a far higher overhead
than what you previously measured is that a) in your test the other
backends were idle, in mine they actually modify PGXACT which causes
noticeable cacheline bouncing b) I have higher numer of connections 
#max_connections

A quick test shows that even with max_connection=600, 400 background,
and 100 foreground pgbenches there's noticeable overhead:
master-optimize:
tps = 2221.226711 (including connections establishing)
tps = 31203.259472 (excluding connections establishing)
dev-optimize:
tps = 1629.734352 (including connections establishing)
tps = 4754.449726 (excluding connections establishing)

Now I grant that's a somewhat harsh test for postgres, but I don't
think it's entirely unreasonable and the performance impact is quite
stark.

 It strikes me as rather unfortunate that the snapshot interface is
 designed in such a way as to require so much data copying.  It seems
 we always take a snapshot by copying from PGXACT/PGPROC into
 CurrentSnapshotData or SecondarySnapshotData, and then copying data a
 second time from there to someplace more permanent.  It would be nice
 to avoid that, at least in common cases.

Sounds doable. But let's do one thing at a atime ;). That copy wasn't
visible in the rather extreme workload from above btw...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:

 There are multiple features all requiring efficient change set
 extraction. It seems extremely relevant to begin discussing what
 that mechanism might be in each case

Changeset extraction has nothing to do with this patch, and cannot
possibly be useful for it.  Please keep discussion which is
completely unrelated to this patch off this thread.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Simon Riggs
On 17 June 2013 12:13, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 14.06.2013 19:05, Kevin Grittner wrote:

 Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for
 9.4 CF1.  The goal of this patch is to allow a refresh without
 interfering with concurrent reads, using transactional semantics.

 It is my hope to get this committed during this CF to allow me to
 focus on incremental maintenance for the rest of the release cycle.


 I must say this seems a bit pointless on its own. But if it's a stepping
 stone to incremental maintenance, I have no objections.

There are generally 4 kinds of mat view

1. Transactionally updated
2. Incremental update, eventually consistent
3. Incremental update, regular refresh
4. Full refresh

At the moment we only have type 4 and it holds a full lock while it
runs. We definitely need a CONCURRENTLY option and this is it.

Implementing the other types won't invalidate what we currently have,
so this makes sense to me.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Michael Paquier
On Mon, Jun 17, 2013 at 5:23 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-06-17 04:20:03 +0900, Fujii Masao wrote:
  On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier
  michael.paqu...@gmail.com wrote:
   Hi all,
  
   Please find attached the latest versions of REINDEX CONCURRENTLY for
 the 1st
   commit fest of 9.4:
   - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to
 allow
   a toast relation to have multiple indexes running in parallel (extra
 indexes
   could be created by a REINDEX CONCURRENTLY processed)
   - 20130606_2_reindex_concurrently_v26.patch, correcting some comments
 and
   fixed a lock in index_concurrent_create on an index relation not
 released at
   the end of a transaction
 
  Could you let me know how this patch has something to do with MVCC
 catalog
  access patch? Should we wait for MVCC catalog access patch to be
 committed
  before starting to review this patch?

 I wondered the same. The MVCC catalog patch, if applied, would make it
 possible to make the actual relfilenode swap concurrently instead of
 requiring to take access exlusive locks which obviously is way nicer. On
 the other hand, that function is only a really small part of this patch,
 so it seems quite possible to make another pass at it before relying on
 mvcc catalog scans.

As mentionned by Andres, the only thing that the MVCC catalog patch can
improve here
is the index swap phase (index_concurrent_swap:index.c) where the
relfilenode of the
old and new indexes are exchanged. Now an AccessExclusiveLock is taken on
the 2 relations
being swap, we could leverage that to ShareUpdateExclusiveLock with the
MVCC catalog
access I think.

Also, with the MVCC catalog patch in, we could add some isolation tests for
REINDEX CONCURRENTLY (there were some tests in one of the previous
versions),
what is currently not possible due to the exclusive lock taken at swap
phase.

Btw, those are minor things in the patch, so I think that it would be
better to not wait
for the MVCC catalog patch. Even if you think that it would be better to
wait for it,
you could even begin with the 1st patch allowing a toast relation to have
multiple
indexes (removal of reltoastidxid) which does not depend at all on it.

Thanks,
-- 
Michael


Re: [HACKERS] Department of Redundancy Department: makeNode(FuncCall) division

2013-06-17 Thread Stephen Frost
* David Fetter (da...@fetter.org) wrote:
 On Mon, Feb 11, 2013 at 10:48:38AM -0800, David Fetter wrote:
  On Sun, Feb 10, 2013 at 10:09:19AM -0500, Tom Lane wrote:
   David Fetter da...@fetter.org writes:
Per suggestions and lots of help from Andrew Gierth, please find
attached a patch to clean up the call sites for FuncCall nodes, which
I'd like to expand centrally rather than in each of the 37 (or 38, but
I only redid 37) places where it's called.  The remaining one is in
src/backend/nodes/copyfuncs.c, which has to be modified for any
changes in the that struct anyhow.
   
   TBH, I don't think this is an improvement.
   
   The problem with adding a new field to any struct is that you have to
   run around and examine (and, usually, modify) every place that
   manufactures that type of struct.  With a makeFuncCall defined like
   this, you'd still have to do that; it would just become a lot easier
   to forget to do so.

I don't really see how finding all callers of makeFuncCall is
particularly harder than finding the callers of makeNode(Func).  If
there were cases where we still wanted to use makeNode(Func), perhaps
that would be annoying since you'd have to look for both- but, iiuc,
this patch changes all of the callers to use makeFuncCall and it seems
reasonable for all callers to do so in the future as well.

It looks to me like the advantage of this patch is exactly that you
*don't* have to run around and modify things which are completely
unrelated to the new feature being added (eg: FILTER).  Add the new
field, set up the default/no-op case in makeFuncCall() and then only
change those places that actually need to worry about your new field.

   If the subroutine were defined like most other makeXXX subroutines,
   ie you have to supply *all* the fields, that argument would go away,
   but the notational advantage is then dubious.

Having to supply all the fields certainly wouldn't make things any
better.  Providing the base set of fields which are required to be set
for any FuncCall node does make sense though, which is what the patch
does.  The rest of the fields are all special cases for which a default
value works perfectly fine (when the field isn't involved in the
specific case being handled).

   The bigger-picture point is that you're proposing to make the coding
   conventions for building FuncCalls different from what they are for
   any other grammar node.  I don't think that's a great idea; it will
   mostly foster confusion.
  
  The major difference between FuncCalls and others is that `while most
  raw-parsetree nodes are constructed only in their own syntax
  productions, FuncCall is constructed in many places unrelated to
  actual function call syntax.

Yeah, FuncCall's are already rather special and they're built all over
the place.

That's my 2c on it anyhow.  I don't see it as some kind of major
milestone but it looks like improvement to me and likely to make things
a bit easier on patch authors and reviewers who otherwise have to ponder
a bunch of repeated 'x-q = false;' statements in areas which are
completely unrelated to the new feature itself.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Add regression tests for DISCARD

2013-06-17 Thread Marko Kreen
On Mon, May 13, 2013 at 2:58 AM, Robins Tharakan thara...@gmail.com wrote:
 Please find attached a patch that adds basic regression tests for DISCARD
 command.

 Any and all feedback is obviously welcome.

Perhaps existing tests in guc.sql should be merged into it?

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist

2013-06-17 Thread Josh Kupershmidt
On Fri, Mar 8, 2013 at 11:58 AM, Pavel Stehule pavel.steh...@gmail.com wrote:

 I'll see - please, stay tuned to 9.4 first commitfest

Hi Pavel,
Just a reminder, I didn't see this patch in the current commitfest. I
would be happy to spend some more time reviewing if you wish to pursue
the patch.

Josh


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 1: additional information

2013-06-17 Thread Alexander Korotkov
On Fri, Jun 14, 2013 at 12:09 AM, Alexander Korotkov
aekorot...@gmail.comwrote:

 Revised version of patch for additional information storage in GIN is
 attached. Changes are mostly bug fixes.


New version of patch is attached with some more refactoring and bug fixes.

--
With best regards,
Alexander Korotkov.


ginaddinfo.5.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Simon Riggs
On 17 June 2013 13:15, Kevin Grittner kgri...@ymail.com wrote:
 Simon Riggs si...@2ndquadrant.com wrote:

 There are multiple features all requiring efficient change set
 extraction. It seems extremely relevant to begin discussing what
 that mechanism might be in each case

 Changeset extraction has nothing to do with this patch, and cannot
 possibly be useful for it.  Please keep discussion which is
 completely unrelated to this patch off this thread.

Kevin,

You mentioned incremental maintenance in your original post and I
have been discussing it. Had you not mentioned it, I doubt I would
have thought of it.

But since you did mention it, and its clearly an important issue, it
seems relevant to have discussed it here and now.

I'm happy to wait for you to start the thread discussing it directly though.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2013-06-17 Thread Alexander Korotkov
On Sat, Jun 15, 2013 at 2:55 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 attached patch implementing fast scan technique for GIN. This is second
 patch of GIN improvements, see the 1st one here:

 http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com
 This patch allow to skip parts of posting trees when their scan is not
 necessary. In particular, it solves frequent_term  rare_term problem of
 FTS.
 It introduces new interface method pre_consistent which behaves like
 consistent, but:
 1) allows false positives on input (check[])
 2) allowed to return false positives

 Some example: frequent_term  rare_term becomes pretty fast.

 create table test as (select to_tsvector('english', 'bbb') as v from
 generate_series(1,100));
 insert into test (select to_tsvector('english', 'ddd') from
 generate_series(1,10));
 create index test_idx on test using gin (v);

 postgres=# explain analyze select * from test where v @@
 to_tsquery('english', 'bbb  ddd');
   QUERY PLAN

 ---
  Bitmap Heap Scan on test  (cost=942.75..7280.63 rows=5000 width=17)
 (actual time=0.458..0.461 rows=10 loops=1)
Recheck Cond: (v @@ '''bbb''  ''ddd'''::tsquery)
-  Bitmap Index Scan on test_idx  (cost=0.00..941.50 rows=5000
 width=0) (actual time=0.449..0.449 rows=10 loops=1)
  Index Cond: (v @@ '''bbb''  ''ddd'''::tsquery)
  Total runtime: 0.516 ms
 (5 rows)


Attached version of patch has some refactoring and bug fixes.

--
With best regards,
Alexander Korotkov.


gin_fast_scan.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-17 Thread Alexander Korotkov
On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkov aekorot...@gmail.comwrote:

 attached patch implementing ordering inside GIN index. This is third patch
 of GIN improvements, see previous two:

 http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com

 http://www.postgresql.org/message-id/CAPpHfdvftaJq7www381naLw1=4u0h+qpxgwvnhceb9hmvyw...@mail.gmail.com

 This patch introduces new interface method of GIN which takes same
 arguments as consistent but returns float8.
 float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
 nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
 nullFlags[], Datum addInfo[], bool addInfoIsNull[])
 This patch implements gingettuple method which can return ordering data
 using KNN infrastructure. Also it introduces  operator for fts which
 support ordering in GIN index. Some example:

 postgres=# explain analyze select * from dblp_titles2 where tsvector @@
 to_tsquery('english', 'statistics') order by tsvector 
 to_tsquery('english', 'statistics') limit 10;
QUERY
 PLAN

 -
  Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
 rows=10 loops=1)
-  Index Scan using dblp_titles2_idx on dblp_titles2
  (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
 rows=10 loops=1)
  Index Cond: (tsvector @@ '''statist'''::tsquery)
  Order By: (tsvector  '''statist'''::tsquery)
  Total runtime: 7.556 ms
 (5 rows)


Attached version of patch has some refactoring and bug fixes.

--
With best regards,
Alexander Korotkov.


gin_ordering.2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Kevin Grittner
Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 14.06.2013 19:05, Kevin Grittner wrote:
 Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY
 for 9.4 CF1.  The goal of this patch is to allow a refresh
 without interfering with concurrent reads, using transactional
 semantics.

 It is my hope to get this committed during this CF to allow me
 to focus on incremental maintenance for the rest of the release
 cycle.

 I must say this seems a bit pointless on its own.

I completely disagree.  When I read what people were posting about
the materialized view creation that went into 9.3, there were many
comments by people that they can't use it until the materialized
views can be refreshed without blocking readers.  There is a clear
need for this.  It doesn't do much to advance incremental
maintenance, but it is a much smaller patch which will make
matviews usable by a lot of people who can't use the initial
feature set.

 I didn't understand this error message:

 + if (!foundUniqueIndex)
 + ereport(ERROR,
 + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 +  errmsg(concurrent refresh requires a unique index on just 
 columns for all rows of the materialized view)));
 +

 What does that mean?

It means that the REFRESH MATERIALIZED VIEW CONCURRENTLY command
cannot be used on a materialized view unless it has at least one
UNIQUE index which is not partial (i.e., there is no WHERE clause)
and is not indexing on an expression (i.e., the index is entirely
on bare column names).  Set logic to do the diff is hard to get
right if the tables are not proper sets (i.e., they contain
duplicate rows).  I can see at least three ways it *could* be done,
but all of them are much more complex and significantly slower. 
With a UNIQUE index on some set of columns in all rows the correct
guarantees exist to use fast set logic.  It isn't that it's needed
for access; it is needed to provide a guarantee that there is no
row without NULLs that exactly duplicates another row.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN improvements part2: fast scan

2013-06-17 Thread Heikki Linnakangas

On 17.06.2013 15:55, Alexander Korotkov wrote:

On Sat, Jun 15, 2013 at 2:55 AM, Alexander Korotkovaekorot...@gmail.comwrote:


attached patch implementing fast scan technique for GIN. This is second
patch of GIN improvements, see the 1st one here:

http://www.postgresql.org/message-id/capphfduxv-il7aedwpw0w5fxrwgakfxijwm63_hzujacrxn...@mail.gmail.com
This patch allow to skip parts of posting trees when their scan is not
necessary. In particular, it solves frequent_term  rare_term problem of
FTS.
It introduces new interface method pre_consistent which behaves like
consistent, but:
1) allows false positives on input (check[])
2) allowed to return false positives

Some example: frequent_term  rare_term becomes pretty fast.

create table test as (select to_tsvector('english', 'bbb') as v from
generate_series(1,100));
insert into test (select to_tsvector('english', 'ddd') from
generate_series(1,10));
create index test_idx on test using gin (v);

postgres=# explain analyze select * from test where v @@
to_tsquery('english', 'bbb  ddd');
   QUERY PLAN

---
  Bitmap Heap Scan on test  (cost=942.75..7280.63 rows=5000 width=17)
(actual time=0.458..0.461 rows=10 loops=1)
Recheck Cond: (v @@ '''bbb''  ''ddd'''::tsquery)
-   Bitmap Index Scan on test_idx  (cost=0.00..941.50 rows=5000
width=0) (actual time=0.449..0.449 rows=10 loops=1)
  Index Cond: (v @@ '''bbb''  ''ddd'''::tsquery)
  Total runtime: 0.516 ms
(5 rows)



Attached version of patch has some refactoring and bug fixes.


Good timing, I just started looking at this.

I think you'll need to explain how this works. There are no docs, and 
almost no comments.


(and this shows how poorly I understand this, but) Why does this require 
the additional information patch? What extra information do you store 
on-disk, in the additional information?


The pre-consistent method is like the consistent method, but it allows 
false positives. I think that's because during the scan, before having 
scanned for all the keys, the gin AM doesn't yet know if the tuple 
contains all of the keys. So it passes the keys it doesn't yet know 
about as 'true' to pre-consistent. Could that be generalized, to pass a 
tri-state instead of a boolean for each key to the pre-consistent 
method? For each key, you would pass true, false, or don't know. I 
think you could then also speed up queries like !english  bbb.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-17 Thread Simon Riggs
On 9 June 2013 12:58, Craig Ringer cr...@2ndquadrant.com wrote:
 On 06/09/2013 04:58 PM, Simon Riggs wrote:
 There are also difficulties in semantics, since when
 we have OLD and NEW at row level we know we are discussing the same
 row. With sets of OLD and NEW we'd need to be able to link the
 relations back together somehow, which couldn't be done by PK since
 that could change.

 We don't currently have OLD and NEW relations so we're free to define
 how this works pretty freely.

 Rather than having OLD and NEW as separate relations, we could just have
 one OLD_AND_NEW relation. In that relation we exploit Pg's composite
 types to nest the old and new tuples in a single outer change record.

 OLD_AND_NEW would look to PL/PgSQL as if it were:

 CREATE TEMPORARY TABLE OLD_AND_NEW (
 OLD tabletype NOT NULL,
 NEW tabletype NOT NULL
 );

 ...though presumably without the ability to create indexes on it and the
 other things you can do to a real temp table. Though I can see cases
 where that'd be awfully handy too.

 For DELETE and INSERT we'd either provide different relations named OLD
 and NEW respectively, or we'd use OLD_AND_NEW with one field or the
 other blank. I'm not sure which would be best.

 Alternately, we could break the usual rules for relations and define OLD
 and NEW as ordered, so lock-step iteration would always return matching
 pairs of rows. That's useless in SQL since there's no way to achieve
 lock-step iteration, but if we provide a
 for_each_changed_row('some_function'::regproc) that scans them in
 lock-step and invokes `some_function` for each one...? (I haven't yet
 done enough in the core to have any idea if this approach is completely
 and absurdly impossible, or just ugly. Figured I'd throw it out there
 anyway.)


I think the best way, if we did do this, would be to have a number of
different relations defined:

OLD
NEW
INSERTED
DELETED
all of which would be defined same as main table

and also one called
UPDATED
which would have two row vars called OLD and NEW
so you would access it like e.g. IF UPDATED.OLD.id = 7

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Peter Eisentraut
On 6/17/13 8:23 AM, Michael Paquier wrote:
 As mentionned by Andres, the only thing that the MVCC catalog patch can
 improve here
 is the index swap phase (index_concurrent_swap:index.c) where the
 relfilenode of the
 old and new indexes are exchanged. Now an AccessExclusiveLock is taken
 on the 2 relations
 being swap, we could leverage that to ShareUpdateExclusiveLock with the
 MVCC catalog
 access I think.

Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is
not really concurrent, at least not concurrent to the standard set by
CREATE and DROP INDEX CONCURRENTLY.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 Kevin Grittner kgri...@ymail.com wrote:

 Changeset extraction has nothing to do with this patch, and
 cannot possibly be useful for it.  Please keep discussion which
 is completely unrelated to this patch off this thread.

 You mentioned incremental maintenance in your original post and
 I have been discussing it. Had you not mentioned it, I doubt I
 would have thought of it.

 But since you did mention it, and its clearly an important issue,
 it seems relevant to have discussed it here and now.

What I said was that I wanted to get this out of the way before I
started working on incremental maintenance.

 I'm happy to wait for you to start the thread discussing it
 directly though.

Cool.

-Kevin

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Andres Freund
On 2013-06-17 09:12:12 -0400, Peter Eisentraut wrote:
 On 6/17/13 8:23 AM, Michael Paquier wrote:
  As mentionned by Andres, the only thing that the MVCC catalog patch can
  improve here
  is the index swap phase (index_concurrent_swap:index.c) where the
  relfilenode of the
  old and new indexes are exchanged. Now an AccessExclusiveLock is taken
  on the 2 relations
  being swap, we could leverage that to ShareUpdateExclusiveLock with the
  MVCC catalog
  access I think.
 
 Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is
 not really concurrent, at least not concurrent to the standard set by
 CREATE and DROP INDEX CONCURRENTLY.

Well, it still does the main body of work in a concurrent fashion, so I
still don't see how that argument holds that much water. But anyway, the
argument was only whether we could continue reviewing before the mvcc
stuff goes in, not whether it can get committed before.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Nicolas Barbier
2013/6/17 Heikki Linnakangas hlinnakan...@vmware.com:

 +errmsg(concurrent refresh requires a
 unique index on just columns for all rows of the materialized view)));

Maybe my english is failing me here, but I don’t understand the “just” part.

Nicolas

--
A. Because it breaks the logical sequence of discussion.
Q. Why is top posting bad?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PQConnectPoll, connect(2), EWOULDBLOCK and somaxconn

2013-06-17 Thread Andres Freund
Hi,

When postgres on linux receives connection on a high rate client
connections sometimes error out with:
could not send data to server: Transport endpoint is not connected
could not send startup packet: Transport endpoint is not connected

To reproduce start something like on a server with sufficiently high
max_connections:
pgbench -h /tmp -p 5440 -T 10 -c 400 -j 400 -n -f /tmp/simplequery.sql

Now that's strange since that error should happen at connect(2) time,
not when sending the startup packet. Some investigation led me to
fe-secure.c's PQConnectPoll:

if (connect(conn-sock, addr_cur-ai_addr,
addr_cur-ai_addrlen)  0)
{
if (SOCK_ERRNO == EINPROGRESS ||
SOCK_ERRNO == EWOULDBLOCK ||
SOCK_ERRNO == EINTR ||
SOCK_ERRNO == 0)
{
/*
 * This is fine - we're in non-blocking mode, and
 * the connection is in progress.  Tell caller to
 * wait for write-ready on socket.
 */
conn-status = CONNECTION_STARTED;
return PGRES_POLLING_WRITING;
}
/* otherwise, trouble */
}

So, we're accepting EWOULDBLOCK as a valid return value for
connect(2). Which it isn't. EAGAIN in contrast is on some BSDs and on
linux. Unfortunately POSIX allows those two to share the same value...

My manpage tells me:
EAGAIN No more free local ports or insufficient entries in the routing cache.  
For
   AF_INET see the description of
   /proc/sys/net/ipv4/ip_local_port_range ip(7)
   for information on how to increase the number of local
   ports.

So, the problem is that we took a failed connection as having been
initially successfull but in progress.

Not accepting EWOULDBLOCK in the above if() results in:
could not connect to server: Resource temporarily unavailable
  Is the server running locally and accepting
  connections on Unix domain socket /tmp/.s.PGSQL.5440?

which makes more sense.

Trivial patch attached.

Now, the question is why we cannot complete connections on unix sockets?
Some code reading reading shows net/unix/af_unix.c:unix_stream_connect()
shows:
if (unix_recvq_full(other)) {
err = -EAGAIN;
if (!timeo)
goto out_unlock;
So, if we're in nonblocking mode - which we are - and the receive queue
is full we return EAGAIN. The receive queue for unix sockets is defined
as
static inline int unix_recvq_full(struct sock const *sk)
{
return skb_queue_len(sk-sk_receive_queue)  sk-sk_max_ack_backlog;
}
Where sk_max_ack_backlog is whatever has been passed to the
listen(backlog) on the listening side.

Question: But postgres does listen(fd, MaxBackends * 2), how can that be
a problem?
Answer:
   If the backlog argument is greater than the value in 
/proc/sys/net/core/somaxconn,
   then  it  is  silently  truncated to that value; the default value in 
this file is
   128.  In kernels before 2.4.25, this limit was a hard coded value, 
SOMAXCONN, with
   the value 128.

Setting somaxconn to something higher indeed makes the problem go away.

I'd guess that pretty much the same holds true for tcp connections,
although I didn't verify that which would explain some previous reports
on the lists.

TLDR: Increase /proc/sys/net/core/somaxconn

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From da5cfb7d237a4a07b146fb9d255f0de72207de10 Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Mon, 17 Jun 2013 16:00:58 +0200
Subject: [PATCH] libpq: Handle connect(2) returning EAGAIN/EWOULDBLOCK
 correctly

libpq used to accept EWOULDBLOCK - which is allowed to have the same value as
EAGAIN by posix - as a valid return code to connect(2) indicating that a
connection is in progress. While posix doesn't specify either as a valid return
code, BSD based systems and linux use it to indicate temporary resource
exhaustion.
Accepting either as a in-progress connection attempt leads to hard to diagnose
errors when sending the startup packet:
could not send data to server: Transport endpoint is not connected
could not send startup packet: Transport endpoint is not connected

Treating it as an error results in:
could not connect to server: Resource temporarily unavailable
  Is the server running locally and accepting
  connections on Unix domain socket ...?
which is more accurate.
---
 src/interfaces/libpq/fe-connect.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index 0d729c8..c17c303 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -1780,7 +1780,6 @@ keep_going:		/* We will come back to here until there is
 addr_cur-ai_addrlen)  0)
 	{
 		if (SOCK_ERRNO == EINPROGRESS ||
-			SOCK_ERRNO == EWOULDBLOCK ||
 			SOCK_ERRNO == EINTR ||
 			SOCK_ERRNO == 0)
 		{
-- 

[HACKERS] matview incremental maintenance

2013-06-17 Thread Kevin Grittner
Since there seems to be interest in discussing incremental
maintenance of materialized views *now*, I'm starting this thread
to try to avoid polluting unrelated threads with the discussion.  I
don't intend to spend a lot of time on it until the CF in progress
completes, but at that point the work will start in earnest.  So
I'll say where I'm at, and welcome anyone who has time to spare
outside of the CF to comment or contribute ideas.

The paper at the core of the discussion can be found by searching
for maintaining views incrementally gupta mumick subrahmanian --
it's on both the ACM and CiteSeerX websites.  Of course, one
doesn't need to understand that paper to discuss techniques for
capturing the base deltas, but I'm hoping that's not what takes up
most of the discussion.  I expect the most important discussions to
be around how best to handle the count(t) (or count_t) column,
what form should be use for intermediate results, how to modify or
add execution nodes which know how to deal with the count, how to
generate set operations to use those nodes, and how to modify the
planner to choose the best plan for these operations.  Whether to
pull the deltas off the WAL stream or stuff them into a tuplestore
as they are written seems to me to be a relatively minor point.  If
properly abstracted, the performance and complexity of alternatives
can be compared.

The one thing that seems somewhat clear to me at the moment is that
the complex set algebra needed to use the counting algorithm for
incremental maintenance is not going to be something I want to
handle by dynamically building up execution nodes.  That way lies
madness. SPI or something very similar to it should be used,
probably with a layer or two above it to simplify working with the
algebra separately from diddling around with strings for the query
fragments.

At the developer meeting last month, we talked about the special
new count column for a bit, and everyone seemed to agree that
adding such an animal, and creating execution nodes which were
aware of it, would best be done on top of the patch Álvaro has been
working on to replace attnum with three columns: a logical ID
number for each column, the physical order of the attribute within
the tuple image, and the display order (for SELECT *, INSERT
without a column list, and similar cases).  We seemed to have
consensus that the count_t column would not display by default, but
could be explicitly called out by a query, similar to the current
handling of system columns.  Nobody wanted to have a negative
column number for the count or add it to the tuple header
structure.  Unfortunately I have heard from Álvaro that the patch
is not complete and is not on his list of things to work on in the
near future.

Long term, timings for incremental maintenance that people would
like to see (from most eager to least eager) are:

 - as part of completing each statement, so that the affect on the
matview is immediately visible to the transaction which modifies a
supporting table, and becomes visible at commit to other
transactions

 - at transaction commit time, so that other transactions see the
changes to the base tables and the referencing matviews at the same
point in time

 - from a FIFO queue which is processed by a background process
whenever data is present (possibly with pacing)

 - from a FIFO queue based on a schedule, so that matviews are
stable between applications and/or to avoid burdening the machine
during peak periods

 - incremental update, or even full refresh, on an attempt to query
a stale matview

 - explicit request to apply incremental updates or refresh

Incremental maintenance of a materialized view is a heuristic, to
refresh contents more quickly than might happen by re-running the
query which defines the matview.  There will always be cases where
the changes are so extensive that applying the delta will be slower
than a refresh.  At some point we should have a cost-based way to
recognize when we have crossed that threshold, and fall back to the
refresh technique.  That's not for this release, though.

In previous discussion there seemed to be a consensus that before
incremental maintenance for a materialized view could be turned on,
the matview would need to be populated and all referenced tables
would need to be flagged as generating delta information, through a
new ALTER TABLE option.

While I have yet to look in detail at the mechanism for capturing
the initial delta on the base tables, the two fairly obvious
candidates are to stuff the before and after images into a
tuplestore or temp table as base table changes are written,
somewhere around the point that triggers would be fired, or to use
the WAL stream in some way.  The advantages of the former are that
it would be hard to find a lower overhead way to capture the data,
nor a more certain way to get exactly the right data.  The latter,
which Simon has been arguing is better than using triggers, would
have the advantage of not 

Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Peter Eisentraut
On 6/17/13 9:19 AM, Andres Freund wrote:
 Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is
 not really concurrent, at least not concurrent to the standard set by
 CREATE and DROP INDEX CONCURRENTLY.
 
 Well, it still does the main body of work in a concurrent fashion, so I
 still don't see how that argument holds that much water.

The reason we added DROP INDEX CONCURRENTLY is so that you don't get
stuck in a lock situation like

long-running-transaction - DROP INDEX - everything else

If we accepted REINDEX CONCURRENTLY as currently proposed, then it would
have the same problem.

I don't think we should accept a REINDEX CONCURRENTLY implementation
that is worse in that respect than a manual CREATE INDEX CONCURRENTLY +
DROP INDEX CONCURRENTLY combination.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Andres Freund
On 2013-06-17 11:03:35 -0400, Peter Eisentraut wrote:
 On 6/17/13 9:19 AM, Andres Freund wrote:
  Without getting rid of the AccessExclusiveLock, REINDEX CONCURRENTLY is
  not really concurrent, at least not concurrent to the standard set by
  CREATE and DROP INDEX CONCURRENTLY.
  
  Well, it still does the main body of work in a concurrent fashion, so I
  still don't see how that argument holds that much water.
 
 The reason we added DROP INDEX CONCURRENTLY is so that you don't get
 stuck in a lock situation like
 
 long-running-transaction - DROP INDEX - everything else
 
 If we accepted REINDEX CONCURRENTLY as currently proposed, then it would
 have the same problem.
 
 I don't think we should accept a REINDEX CONCURRENTLY implementation
 that is worse in that respect than a manual CREATE INDEX CONCURRENTLY +
 DROP INDEX CONCURRENTLY combination.

Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't:
* reindex primary keys
* reindex keys referenced by foreign keys
* reindex exclusion constraints
* reindex toast tables
* do all that for a whole database
so I don't think that comparison is fair. Having it would have made
several previous point releases far less painful (e.g. 9.1.6/9.2.1).

But anyway, the as I said the argument was only whether we could
continue reviewing before the mvcc stuff goes in, not whether it can get
committed before..

I don't think we a have need to decide whether REINDEX CONCURRENTLY can
go in with the short exclusive lock unless we find unresolveable
problems with the mvcc patch. Which I very, very much hope not to be the
case.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] matview incremental maintenance

2013-06-17 Thread Stefan Drees

On 2013-06-17 16:41 +02:00, Kevin Grittner wrote:

Since there seems to be interest in discussing incremental
maintenance of materialized views *now*, I'm starting this thread
to try to avoid polluting unrelated threads with the discussion.  I
don't intend to spend a lot of time on it until the CF in progress
completes, but at that point the work will start in earnest.  So
I'll say where I'm at, and welcome anyone who has time to spare
outside of the CF to comment or contribute ideas.

The paper at the core of the discussion can be found by searching
for maintaining views incrementally gupta mumick subrahmanian --
it's on both the ACM and CiteSeerX websites.  Of course, one


i.e.
Ashish Gupta, Inderpal Singh Mumick, and V. S. Subrahmanian. 1993. 
Maintaining views incrementally. In Proceedings of the 1993 ACM SIGMOD 
international conference on Management of data (SIGMOD '93), Peter 
Buneman and Sushil Jajodia (Eds.). ACM, New York, NY, USA, 157-166. 
DOI=10.1145/170035.170066 http://doi.acm.org/10.1145/170035.170066


just in case a direct reference might come in handy :-)


...


All the ebst,
Stefan.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Kevin Grittner
Nicolas Barbier nicolas.barb...@gmail.com wrote:
 2013/6/17 Heikki Linnakangas hlinnakan...@vmware.com:


 +    errmsg(concurrent refresh requires a
 unique index on just columns for all rows of the materialized view)));

 Maybe my english is failing me here, but I don’t understand the “just” part.

It means that the index must not use any expressions in the list of
what it's indexing on -- only column names.  Suggestions for better
wording would be welcome.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Request for Patch Feedback: Lag Lead Window Functions Can Ignore Nulls

2013-06-17 Thread Robert Haas
On Sat, Jun 15, 2013 at 9:37 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Nicholas White escribió:

 For the parsing changes, it seems I can either make RESPECT and IGNORE
 reserved keywords, or add a lookahead to construct synthetic RESPECT NULLS
 and IGNORE NULLS keywords. The grammar wouldn't compile if RESPECT and
 IGNORE were just normal unreserved keywords due to ambiguities after a
 function definition (e.g. select abs(1) respect; - which is currently a
 valid statement).

 Well, making them reserved keywords is not that great, so maybe the
 lookahead thingy is better.  However, this patch introduces the third
 and fourth uses of the save the lookahead token pattern in the
 default switch cases.  Can we refactor that bit somehow, to avoid so
 many duplicates?  (For a minute I thought that Andrew Gierth's WITH
 ORDINALITY patch would add another one, but it seems not.)

Making things reserved keywords is painful and I don't like it, but
I've started to become skeptical of shifting the problem to the lexer,
too.  Sometimes special case logic in the lexer about token combining
can have surprising consequences in other parts of the grammar.  For
example, with a lexer hack, what will happen if someone has a column
named RESPECT and does SELECT ... ORDER BY respect NULLS LAST?  I
haven't studied the code in detail so maybe it's fine, but it's
something to think about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Robert Haas
On Mon, Jun 17, 2013 at 11:21 AM, Kevin Grittner kgri...@ymail.com wrote:
 Nicolas Barbier nicolas.barb...@gmail.com wrote:
 2013/6/17 Heikki Linnakangas hlinnakan...@vmware.com:


 +errmsg(concurrent refresh requires a
 unique index on just columns for all rows of the materialized view)));

 Maybe my english is failing me here, but I don’t understand the “just” part.

 It means that the index must not use any expressions in the list of
 what it's indexing on -- only column names.  Suggestions for better
 wording would be welcome.

Random idea:

ERROR: materialized view \%s\ does not have a unique key

Perhaps augmented with:

HINT: Create a UNIQUE btree index with no WHERE clause on one or more
columns of the materialized view.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-17 Thread David E. Wheeler
On Jun 16, 2013, at 9:20 AM, Cédric Villemain ced...@2ndquadrant.com wrote:

 Then instead of the above you'd just be able to say something like
 
 MODULETEST = test
 
 or REGRESSDIR ?

Yeah, that sounds perfect.

 Also I suggest to remove the need to set REGRESS at all, and default to all 
 sql files in REGRESSDIR/sql (if REGRESSDIR is set)

Yeah, that would be nice. If one has different file names or something, then 
one should still be able to set REGRESS.

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-17 Thread Alvaro Herrera
Joe Conway wrote:

 On 06/15/2013 11:28 AM, Alvaro Herrera wrote:

  This use case seems too narrow to me to justify the burden of
  keeping PGXS-enabled makefiles in contrib.
 
 What was the burden of it?

Per 
http://www.postgresql.org/message-id/1371093408.309.5.ca...@vanquo.pezone.net :

: 1. take up space
: 2. confuse users
: 3. produce broken external extension modules that take contrib as an example
: 4. break builds of PostgreSQL when users try to fix 3. by exporting USE_PGXS

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] matview incremental maintenance

2013-06-17 Thread Simon Riggs
On 17 June 2013 15:41, Kevin Grittner kgri...@ymail.com wrote:

 Since there seems to be interest in discussing incremental
 maintenance of materialized views *now*

Since your earlier complaint, I specificaly said I was happy to wait
to discuss that. Why have you raised this now?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Remove useless USE_PGXS support in contrib

2013-06-17 Thread Cédric Villemain
Le lundi 17 juin 2013 18:41:32, Alvaro Herrera a écrit :
 Joe Conway wrote:
  On 06/15/2013 11:28 AM, Alvaro Herrera wrote:
   This use case seems too narrow to me to justify the burden of
   keeping PGXS-enabled makefiles in contrib.
  
  What was the burden of it?
 
 Per http://www.postgresql.org/message-
id/1371093408.309.5.ca...@vanquo.pezone.net :
 : 1. take up space
 : 2. confuse users
 : 3. produce broken external extension modules that take contrib as an
 : example 4. break builds of PostgreSQL when users try to fix 3. by
 : exporting USE_PGXS

But:
4. can be fixed (see patches I sent) so it is not an excuse.

I agree for other points.
My only grief is to loose the perfect regression tests for PGXS those contribs 
are.

-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation


signature.asc
Description: This is a digitally signed message part.


Re: [HACKERS] matview incremental maintenance

2013-06-17 Thread Simon Riggs
On 17 June 2013 15:41, Kevin Grittner kgri...@ymail.com wrote:

 While I have yet to look in detail at the mechanism for capturing
 the initial delta on the base tables, the two fairly obvious
 candidates are to stuff the before and after images into a
 tuplestore or temp table as base table changes are written,
 somewhere around the point that triggers would be fired, or to use
 the WAL stream in some way.  The advantages of the former are that
 it would be hard to find a lower overhead way to capture the data,
 nor a more certain way to get exactly the right data.  The latter,
 which Simon has been arguing is better than using triggers, would
 have the advantage of not directly slowing down a process writing
 to base tables, although for more eager modes transactions would
 need to block waiting for the data to flow through the walsender,
 be filtered and assembled as data of interest, and communicated
 back to the transaction somehow before it could proceed.  Assuming
 that it can provide the changeset prior to the commit, and that it
 can include before images, it could work, but the timing sure
 seems dubious for the more eager modes.

It isn't unconditionally true statement to say it would be hard to
find a lower overhead way to capture the data, since there is strong
experimental evidence from work on replication that shows that using
the WAL is very effective mechanism for changeset extraction.

There is nothing to say the changeset must occur through the
WalSender. That is just where it currently occurs, but it could easily
occur elsewhere, if the requirement existed. Similarly, changeset
extraction doesn't currently allow access to uncommitted rows, but it
could do so, if required. Before images of change could be provided by
direct access to prior versions via their tid, just as they are with
triggers.

There are other advantages to using WAL that you don't mention, such
as the avoidance of the need for the trigger queue to spill to disk,
avoidance of memory overhead for large transactions and avoidance of
random I/O.

ISTM that using WAL has to be properly considered as a viable option
which is why open discussion makes sense.

The timing of that discussion doesn't need to be immediate but
certainly it should happen before any options are precluded because of
the progress of other events. Let me me know when that's appropriate,
so we can discuss.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-17 Thread Pavel Stehule
2013/6/17 Simon Riggs si...@2ndquadrant.com:
 On 9 June 2013 12:58, Craig Ringer cr...@2ndquadrant.com wrote:
 On 06/09/2013 04:58 PM, Simon Riggs wrote:
 There are also difficulties in semantics, since when
 we have OLD and NEW at row level we know we are discussing the same
 row. With sets of OLD and NEW we'd need to be able to link the
 relations back together somehow, which couldn't be done by PK since
 that could change.

 We don't currently have OLD and NEW relations so we're free to define
 how this works pretty freely.

 Rather than having OLD and NEW as separate relations, we could just have
 one OLD_AND_NEW relation. In that relation we exploit Pg's composite
 types to nest the old and new tuples in a single outer change record.

 OLD_AND_NEW would look to PL/PgSQL as if it were:

 CREATE TEMPORARY TABLE OLD_AND_NEW (
 OLD tabletype NOT NULL,
 NEW tabletype NOT NULL
 );

 ...though presumably without the ability to create indexes on it and the
 other things you can do to a real temp table. Though I can see cases
 where that'd be awfully handy too.

 For DELETE and INSERT we'd either provide different relations named OLD
 and NEW respectively, or we'd use OLD_AND_NEW with one field or the
 other blank. I'm not sure which would be best.

 Alternately, we could break the usual rules for relations and define OLD
 and NEW as ordered, so lock-step iteration would always return matching
 pairs of rows. That's useless in SQL since there's no way to achieve
 lock-step iteration, but if we provide a
 for_each_changed_row('some_function'::regproc) that scans them in
 lock-step and invokes `some_function` for each one...? (I haven't yet
 done enough in the core to have any idea if this approach is completely
 and absurdly impossible, or just ugly. Figured I'd throw it out there
 anyway.)


 I think the best way, if we did do this, would be to have a number of
 different relations defined:

 OLD
 NEW
 INSERTED
 DELETED
 all of which would be defined same as main table

 and also one called
 UPDATED
 which would have two row vars called OLD and NEW
 so you would access it like e.g. IF UPDATED.OLD.id = 7


nice idea

+1

Pavel

 --
  Simon Riggs   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Fujii Masao
On Mon, Jun 17, 2013 at 9:23 PM, Michael Paquier
michael.paqu...@gmail.com wrote:



 On Mon, Jun 17, 2013 at 5:23 AM, Andres Freund and...@2ndquadrant.com
 wrote:

 On 2013-06-17 04:20:03 +0900, Fujii Masao wrote:
  On Thu, Jun 6, 2013 at 1:29 PM, Michael Paquier
  michael.paqu...@gmail.com wrote:
   Hi all,
  
   Please find attached the latest versions of REINDEX CONCURRENTLY for
   the 1st
   commit fest of 9.4:
   - 20130606_1_remove_reltoastidxid_v9.patch, removing reltoastidxid, to
   allow
   a toast relation to have multiple indexes running in parallel (extra
   indexes
   could be created by a REINDEX CONCURRENTLY processed)
   - 20130606_2_reindex_concurrently_v26.patch, correcting some comments
   and
   fixed a lock in index_concurrent_create on an index relation not
   released at
   the end of a transaction
 
  Could you let me know how this patch has something to do with MVCC
  catalog
  access patch? Should we wait for MVCC catalog access patch to be
  committed
  before starting to review this patch?

 I wondered the same. The MVCC catalog patch, if applied, would make it
 possible to make the actual relfilenode swap concurrently instead of
 requiring to take access exlusive locks which obviously is way nicer. On
 the other hand, that function is only a really small part of this patch,
 so it seems quite possible to make another pass at it before relying on
 mvcc catalog scans.

 As mentionned by Andres, the only thing that the MVCC catalog patch can
 improve here
 is the index swap phase (index_concurrent_swap:index.c) where the
 relfilenode of the
 old and new indexes are exchanged. Now an AccessExclusiveLock is taken on
 the 2 relations
 being swap, we could leverage that to ShareUpdateExclusiveLock with the MVCC
 catalog
 access I think.

 Also, with the MVCC catalog patch in, we could add some isolation tests for
 REINDEX CONCURRENTLY (there were some tests in one of the previous
 versions),
 what is currently not possible due to the exclusive lock taken at swap
 phase.

 Btw, those are minor things in the patch, so I think that it would be better
 to not wait
 for the MVCC catalog patch. Even if you think that it would be better to
 wait for it,
 you could even begin with the 1st patch allowing a toast relation to have
 multiple
 indexes (removal of reltoastidxid) which does not depend at all on it.

Here are the review comments of the removal_of_reltoastidxid patch.
I've not completed the review yet, but I'd like to post the current comments
before going to bed ;)

*** a/src/backend/catalog/system_views.sql
-pg_stat_get_blocks_fetched(X.oid) -
-pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
-pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+pg_stat_get_blocks_fetched(X.indrelid) -
+pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_read,
+pg_stat_get_blocks_hit(X.indrelid) AS tidx_blks_hit

ISTM that X.indrelid indicates the TOAST table not the TOAST index.
Shouldn't we use X.indexrelid instead of X.indrelid?

You changed some SQLs because of removal of reltoastidxid.
Could you check that the original SQL and changed one return
the same value, again?

doc/src/sgml/diskusage.sgml
 There will be one index on the
 acronymTOAST/ table, if present.

I'm not sure if multiple indexes on TOAST table are viewable by a user.
If it's viewable, we need to correct the above description.

doc/src/sgml/monitoring.sgml
 entrystructfieldtidx_blks_read//entry
 entrytypebigint//entry
 entryNumber of disk blocks read from this table's TOAST table index (if 
 any)/entry
/row
row
 entrystructfieldtidx_blks_hit//entry
 entrytypebigint//entry
 entryNumber of buffer hits in this table's TOAST table index (if 
 any)/entry

For the same reason as the above, we need to change index to indexes
in these descriptions?

*** a/src/bin/pg_dump/pg_dump.c
+ SELECT c.reltoastrelid, t.indexrelid 

  FROM pg_catalog.pg_class c LEFT JOIN 

- pg_catalog.pg_class t ON 
(c.reltoastrelid = t.oid) 
- WHERE c.oid = '%u'::pg_catalog.oid;,
+ pg_catalog.pg_index t ON 
(c.reltoastrelid = t.indrelid) 
+ WHERE c.oid = '%u'::pg_catalog.oid 
AND t.indisvalid 
+ LIMIT 1,

Is there the case where TOAST table has more than one *valid* indexes?
If yes, is it really okay to choose just one index by using LIMIT 1?
If no, i.e., TOAST table should have only one valid index, we should get rid
of LIMIT 1 and check that only one row is returned from this query.
Fortunately, ISTM this check has been already done by the subsequent
call of ExecuteSqlQueryForSingleRow(). Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list 

Re: [HACKERS] GIN improvements part 3: ordering in index

2013-06-17 Thread Heikki Linnakangas

On 17.06.2013 15:56, Alexander Korotkov wrote:

On Sat, Jun 15, 2013 at 3:02 AM, Alexander Korotkovaekorot...@gmail.comwrote:


This patch introduces new interface method of GIN which takes same
arguments as consistent but returns float8.
float8 gin_ordering(bool check[], StrategyNumber n, Datum query, int32
nkeys, Pointer extra_data[], bool *recheck, Datum queryKeys[], bool
nullFlags[], Datum addInfo[], bool addInfoIsNull[])
This patch implements gingettuple method which can return ordering data
using KNN infrastructure. Also it introduces  operator for fts which
support ordering in GIN index. Some example:

postgres=# explain analyze select * from dblp_titles2 where tsvector @@
to_tsquery('english', 'statistics') order by tsvector
to_tsquery('english', 'statistics') limit 10;
QUERY
PLAN

-
  Limit  (cost=12.00..48.22 rows=10 width=136) (actual time=6.999..7.120
rows=10 loops=1)
-   Index Scan using dblp_titles2_idx on dblp_titles2
  (cost=12.00..43003.03 rows=11868 width=136) (actual time=6.996..7.115
rows=10 loops=1)
  Index Cond: (tsvector @@ '''statist'''::tsquery)
  Order By: (tsvector  '''statist'''::tsquery)
  Total runtime: 7.556 ms
(5 rows)



Attached version of patch has some refactoring and bug fixes.


Thanks. There are no docs changes and not many comments, that needs to 
be fixed, but I think I understand how it works:


On the first call to gingettuple, the index is first scanned for all the 
matches, which are collected in an array in memory. Then, the array is 
sorted with qsort(), and the matches are returned one by one from the 
sorted array.


That has some obvious limitations. First of all, you can run out of 
memory. Secondly, is that really any faster than the plan you get 
without this patch? Ie. scan all the matches with a bitmap index scan, 
and sort the results on the rank function. If it is faster, why?


What parts of the previous two patches does this rely on?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] Commit Fest has started

2013-06-17 Thread Josh Berkus
Amit,

 I am interested in assisting you for this CF.
 Kindly let me know how can I add value for CommitFest management.

Thank you for the offer!  However, you're currently signed up to review
several patches, and I'd rather have you doing that than sending out
reminder emails.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-17 Thread Sawada Masahiko
On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila amit.kap...@huawei.com wrote:
 On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
 On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila amit.kap...@huawei.com wrote:

 On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
 On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila amit.kap...@huawei.com wrote:
 On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
 Hello,

 We have already started a discussion on pgsql-hackers for the problem of
 taking fresh backup during the failback operation here is the link for 
 that:


 http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-6OzRaew5pWhk7yQtb
 jgwrfu513...@mail.gmail.com

 Let me again summarize the problem we are trying to address.


   How will you take care of extra WAL on old master during recovery. If it
 plays the WAL which has not reached new-master, it can be a problem.

 you means that there is possible that old master's data ahead of new
 master's data.

   I mean to say is that WAL of old master can be ahead of new master. I 
 understood that
   data files of old master can't be ahead, but I think WAL can be ahead.

 so there is inconsistent data between those server when fail back. right?
 if so , there is not possible inconsistent. because if you use GUC option
 as his propose (i.g., failback_safe_standby_mode = remote_flush),
 when old master is working fine, all file system level changes aren't
 done  before WAL replicated.

 Would the propose patch will take care that old master's WAL is also not 
 ahead in some way?
 If yes, I think i am missing some point.

 yes it will happen that old master's WAL ahead of new master's WAL as you 
 said.
 but I think that we can solve them by delete all WAL file when old
 master starts as new standby.

 I think ideally, it should reset WAL location at the point where new master 
 has forrked off.
 In such a scenario it would be difficult for user who wants to get a dump of 
 some data in
 old master which hasn't gone to new master. I am not sure if such a need is 
 there for real users, but if it
 is there, then providing this solution will have some drawbacks.
I think that we can dumping data before all WAL files deleting.  All
WAL files deleting is done when old master starts as new standby.


Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Josh Berkus

 So there isn't a fall down thing here. We expect the recently
 loaded/updated data to be scanned and that's OK.
 
 Having the minmax index updated greedily is just adding extra work for
 fast diminishing returns. We can always add that later if really
 needed, but I doubt it will be needed - in just the same way as mat
 views aren't greedily updated.

Ok, in that case, can we add the patch without messing with the FSM
logic?  It'll work out-of-the-box for append-only tables, and that's a
pretty solid use case.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] refresh materialized view concurrently

2013-06-17 Thread Josh Berkus
On 06/17/2013 04:13 AM, Heikki Linnakangas wrote:
 On 14.06.2013 19:05, Kevin Grittner wrote:
 Attached is a patch for REFRESH MATERIALIZED VIEW CONCURRENTLY for
 9.4 CF1.  The goal of this patch is to allow a refresh without
 interfering with concurrent reads, using transactional semantics.

 It is my hope to get this committed during this CF to allow me to
 focus on incremental maintenance for the rest of the release cycle.
 
 I must say this seems a bit pointless on its own. But if it's a stepping
 stone to incremental maintenance, I have no objections.

Actually, CONCURRENTLY was cited as the #1 deficiency for the matview
feature for 9.3.  With it, the use-case for Postgres matviews is
broadened considerably.  So it's very valuable on its own, even if (for
example) INCREMENTAL didn't get done for 9.3.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Josh Berkus

 Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't:
 * reindex primary keys
 * reindex keys referenced by foreign keys
 * reindex exclusion constraints
 * reindex toast tables
 * do all that for a whole database
 so I don't think that comparison is fair. Having it would have made
 several previous point releases far less painful (e.g. 9.1.6/9.2.1).

FWIW, I have a client who needs this implementation enough that we're
backporting it to 9.1 for them.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch API for After Triggers

2013-06-17 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 On 9 June 2013 12:58, Craig Ringer cr...@2ndquadrant.com wrote:

 We don't currently have OLD and NEW relations so we're free to
 define how this works pretty freely.

 I think the best way, if we did do this, would be to have a
 number of different relations defined:

 OLD
 NEW
 INSERTED
 DELETED
 all of which would be defined same as main table

 and also one called
 UPDATED
 which would have two row vars called OLD and NEW
 so you would access it like e.g. IF UPDATED.OLD.id = 7

Well, there is the SQL standard, which has a couple paragraphs on
the topic which we might want to heed.  For a delete there is just
an old table; for an insert just a new one.  For an update you have
both, with the same cardinality.  The rows in the old and new
tables have a correspondence, but that is only visible to FOR EACH
ROW triggers.  For something like RI, why would you need to
establish correspondence?  A row with the referenced key either
exists after the statement completes, or it doesn't -- why would we
care whether it is an updated version of the same row?

Syntax for how to refer to the these is defined by the standard. 

As usual, I don't object to adding capabilities as long as the
standard syntax is also supported with standard semantics.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Andres Freund
On 2013-06-17 12:52:36 -0700, Josh Berkus wrote:
 
  Well, it can do lots stuff that DROP/CREATE CONCURRENTLY can't:
  * reindex primary keys
  * reindex keys referenced by foreign keys
  * reindex exclusion constraints
  * reindex toast tables
  * do all that for a whole database
  so I don't think that comparison is fair. Having it would have made
  several previous point releases far less painful (e.g. 9.1.6/9.2.1).
 
 FWIW, I have a client who needs this implementation enough that we're
 backporting it to 9.1 for them.

Wait. What? Unless you break catalog compatibility that's not safely
possible using this implementation.

Greetings,

Andres Freund

PS: Josh, minor thing, but could you please not trim the CC list, at
least when I am on it?

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Alvaro Herrera
Greg Stark wrote:
 On Fri, Jun 14, 2013 at 11:28 PM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:
  Re-summarization is relatively expensive, because the complete page range 
  has
  to be scanned.
 
 That doesn't sound too bad to me. It just means there's a downside to
 having larger page ranges. I would expect the page ranges to be
 something in the ballpark of 32 pages --  scanning 32 pages to
 resummarize doesn't sound that painful but sounds like it's large
 enough that the resulting index would be a reasonable size.

Actually, I'm thinking that a range is more like, say, 1280 pages, or 10
MB.  My goal is to consider tables in the 10 TB magnitude; if I store
one index tuple for every 32 pages, I would end up with too large an
index.  With 10 MBs per range I can index the whole table with an index
of 50 MB, which seems reasonable to scan.  But anyway my intention is
that page range size is configurable.

 But I don't understand why an insert would invalid a tuple. An insert
 can just update the min and max incrementally. It's a delete that
 invalidates the range but as you note it doesn't really invalidate it,
 just mark it as needing a refresh -- and even then only if the value
 being deleted is equal to either the min or max.

No, I don't intend to update the index tuple with each heap insert.  I
think this will end up being too slow.  The validity map is there to
hold a single bit for each page saying whether the page range is known
valid or not; one insert into any page in the range invalidates the
range (and any scan of the table needs to scan that range as a whole).
This way I can wait until the storm of inserts has passed from a range,
and only then do the summarization for that range.  This avoids having
to summarize the range over and over.

Alternatively, I could consider the index tuple always valid, and update
it online as soon as we do an insert or update (i.e. examine the min/max
values in the index tuple, and update it to match if the new value is
out of bounds).  This seems too slow, so I won't try.

Also, a delete does not invalidate a range either.  As Simon said
elsewhere in a thread, if the range is not minimal, this is not a
problem; we might have to scan some more ranges than absolutely
necessary, but it's not a correctness problem.  The heap scan recheck
node will get rid of the unwanted tuples anyway.

  Same-size page ranges?
  Current related literature seems to consider that each index entry in a
  minmax index must cover the same number of pages.  There doesn't seem to be 
  a
 
 I assume the reason for this in the literature is the need to quickly
 find the summary for a given page when you're handling an insert or
 delete.

Yeah, that's one thing to keep in mind.  I haven't gotten too much into
this; I only added these two entries at the end for my own future
reference, because I will need to consider them at some point.

Right now my intention is to have each index have a fixed page range
size, which is defined at index creation time.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Alvaro Herrera
Josh Berkus wrote:

  Value changes in columns that are part of a minmax index, and tuple 
  insertion
  in summarized pages, would invalidate the stored min/max values.  To support
  this, each minmax index has a validity map; a range can only be considered 
  in a
  scan if it hasn't been invalidated by such changes (A range not 
  considered in
  the scan needs to be returned in whole regardless of the stored min/max 
  values,
  that is, it cannot be pruned per query quals).  The validity map is very
  similar to the visibility map in terms of performance characteristics: quick
  enough that it's not contentious, allowing updates and insertions to proceed
  even when data values violate the minmax index conditions.  An invalidated
  range can be made valid by re-summarization (see below).
 
 This begins to sound like these indexes are only useful on append-only
 tables.  Not that there aren't plenty of those, but ...

But what?  This is a useful use-case; one that's not served by any other
type of index.  Sure, you can have btrees over append-only tables, but
they are really large and have huge maintainance costs.  A smaller lossy
index is useful if low-maintenance and if it avoids keeping the cost of
scanning the table low enough.

These indexes can be considered a sort of partitioning of a large table.
Only instead of having to define CHECK (insert_date = 'a month') for
each partition, you just create the index on the insert_date column, and
the index will allow a seqscan of the table to skip pages that don't
match the query's WHERE clause on that column.

  Re-summarization is relatively expensive, because the complete page range 
  has
  to be scanned.
 
 Why?  Why can't we just update the affected pages in the index?

The page range has to be scanned in order to find out the min/max values
for the indexed columns on the range; and then, with these data, update
the index.

   To avoid this, a table having a minmax index would be
  configured so that inserts only go to the page(s) at the end of the table; 
  this
  avoids frequent invalidation of ranges in the middle of the table.  We 
  provide
  a table reloption that tweaks the FSM behavior, so that summarized pages are
  not candidates for insertion.
 
 We haven't had an index type which modifies table insertion behavior
 before, and I'm not keen to start now; imagine having two indexes on the
 same table each with their own, conflicting, requirements.

This is not a requirement.  It merely makes the index more effective.

 If we're going to start adding reloptions for specific table behavior,
 I'd rather think of all of the optimizations we might have for a
 prospective append-only table and bundle those, rather than tying it
 to whether a certain index exists or not.

Eh?  Sure, my intention for this reloption is for the user to be able to
state their intention for the table, and each feature that has
append-only table optimization does its thing.  I wasn't thinking in
anything automatic.

 Also, I hate the name ...

Feel free to propose other names; that way I can hate your proposals
too (or maybe not).

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Andres Freund
On 2013-06-17 16:23:40 -0400, Alvaro Herrera wrote:
   Re-summarization is relatively expensive, because the complete page range 
   has
   to be scanned.
  
  Why?  Why can't we just update the affected pages in the index?
 
 The page range has to be scanned in order to find out the min/max values
 for the indexed columns on the range; and then, with these data, update
 the index.

Why? Assuming the initial summarization has been performed you can check
the current min/max value, check whether it's still smaller/bigger than
the value you're inserting and if not update the index accordingly with
the new value. You don't even need to wait for the new value to become
visible since the ranges don't need to be minimal.

I think the contention this possibly causes may a better argument, but I
am not sure how much of a problem that really becomes if we find a
deadlock free way to only lock the minmax pages exlusively if the range
is violated.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Alvaro Herrera
Tom Lane wrote:

 We've talked a lot about index-organized tables in the past.  How much
 of the use case for this would be subsumed by a feature like that?

IOTs are not flexible enough.  You can only have one index that you
index-organize the table on; and you can search only based on a prefix
of the index key.  If you want to change the key, ... um. I don't even
know what you'd do.

With minmax indexes, on the other hand, you can create one or several,
and they let you scan the table based on any of the indexed columns.  So
you can create a minmax index on creation_date, insertion_date, ship_date;
and have it serve queries that use any of these columns.  (You probably
don't add key column(s) to the minmax index because you already have
btrees on them.)

On the other hand, IOTs are expensive to insert into.  For each tuple to
insert you need to start from the root and descend the tree, insert your
tuple, then propagate splits upwards.  If you have a 10 TB table, you
cannot afford to have to do all that for each and every tuple.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Josh Berkus

 This begins to sound like these indexes are only useful on append-only
 tables.  Not that there aren't plenty of those, but ...
 
 But what?  

... but the other comments further down in my email.  Also, my
successive comments in other emails.

 Why?  Why can't we just update the affected pages in the index?
 
 The page range has to be scanned in order to find out the min/max values
 for the indexed columns on the range; and then, with these data, update
 the index.

Seems like you could incrementally update the range, at least for
inserts.  If you insert a row which doesn't decrease the min or increase
the max, you can ignore it, and if it does increase/decrease, you can
change the min/max.  No?

For updates, things are more complicated.  If the row you're updating
was the min/max, in theory you should update it to adjust that, but you
can't verify that it was the ONLY min/max row without doing a full scan.
 My suggestion would be to add a dirty flag which would indicate that
that block could use a rescan next VACUUM, and otherwise ignore changing
the min/max.  After all, the only defect to having min to low or max too
high for a block would be scanning too many blocks.  Which you'd do
anyway with it marked invalid.

 This is not a requirement.  It merely makes the index more effective.

Right.  So I'm saying let's do this index without the FSM modifications,
and then consider those as their own, separate patch, if we even do them.

 Eh?  Sure, my intention for this reloption is for the user to be able to
 state their intention for the table, and each feature that has
 append-only table optimization does its thing.  I wasn't thinking in
 anything automatic.

99.7% of our users have no idea what to do with reloptions.  We'd have
to expose it with an ALTER TABLE SET append_only=true.

 Also, I hate the name ...
 
 Feel free to propose other names; that way I can hate your proposals
 too (or maybe not).

Well, my first thought was block-range indexing, which I think is the
best description, but that isn't exactly an exciting name for a feature
which will likely be worthy of short-listing for 9.4.  I'd prefer it
over minmax, which users will think only works on aggregates, but it's
still not a great name.  Summary Index also comes to mind, but really
isn't a lot more exciting.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Alvaro Herrera
Andres Freund wrote:

 PS: Josh, minor thing, but could you please not trim the CC list, at
 least when I am on it?

Yes, it's annoying.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Josh Berkus
On 06/17/2013 01:40 PM, Alvaro Herrera wrote:
 Andres Freund wrote:
 
 PS: Josh, minor thing, but could you please not trim the CC list, at
 least when I am on it?
 
 Yes, it's annoying.

I also get private comments from people who don't want me to cc them
when they are already on the list.  I can't satisfy everyone.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-06-17 Thread Andres Freund
On 2013-06-17 13:46:07 -0700, Josh Berkus wrote:
 On 06/17/2013 01:40 PM, Alvaro Herrera wrote:
  Andres Freund wrote:
  
  PS: Josh, minor thing, but could you please not trim the CC list, at
  least when I am on it?
  
  Yes, it's annoying.
 
 I also get private comments from people who don't want me to cc them
 when they are already on the list.  I can't satisfy everyone.

Given that nobody but you trims the CC list I don't find that a
convincing argument.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add transforms feature

2013-06-17 Thread Peter Eisentraut
On 6/14/13 11:48 PM, Craig Ringer wrote:
 I wonder if that should be extended to install headers for hstore,
 ltree, and while we're at it, intarray as well?

Sure, if someone wants to go through and check which headers are
independently usable, and do the necessarily cleanups with necessary.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add transforms feature

2013-06-17 Thread Alvaro Herrera
Peter Eisentraut wrote:
 A transform is an SQL object that supplies to functions for converting
 between data types and procedural languages.  For example, a transform
 could arrange that hstore is converted to an appropriate hash or
 dictionary object in PL/Perl or PL/Python.
 
 Externally visible changes:

This is a large patch.  Do you intend to push the whole thing as a
single commit, or split it?



-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] Minmax indexes

2013-06-17 Thread Tomas Vondra
Hi!

This sounds really interesting, so a few quick comments.

On 15.6.2013 00:28, Alvaro Herrera wrote:

 In each index tuple (corresponding to one page range), we store: -
 first block this tuple applies to - last block this tuple applies to 
 - for each indexed column: * min() value across all tuples in the
 range * max() value across all tuples in the range * nulls present in
 any tuple?

What about adding a counter how many times the min/max value is present
in the range? The other messages in this thread suggest that the refresh
after UPDATE/DELETE is one of the concerns - as Greg Stark mentioned the
range invalidation may only happen when running DELETE on a row matching
the min/max value. I believe having a counter would be an improvement -
a refresh would be needed only if it reaches 0.

 Summarization -
 
 At index creation time, the whole table is scanned; for each page
 range the min() and max() values and nulls bitmap are collected, and
 stored in the index. The possibly-incomplete range at the end of the
 table is also included.

Would it make sense to do this using an existing (b-tree) index for very
large tables? Clearly it doesn't make sense to create a b-tree index and
then minmax on the same column, but for very large databases upgraded
using pg_upgrade this might be interesting.

 Once in a while, it is necessary to summarize a bunch of unsummarized
 pages (because the table has grown since the index was created), or
 re-summarize a range that has been marked invalid.  This is simple:
 scan the page range calculating the min() and max() for each indexed
 column, then insert the new index entry at the end of the index.  The
 main interesting questions are:
 
 a) when to do it The perfect time to do it is as soon as a complete
 page range of the configured range size has been filled (assuming
 page ranges are constant size).
 
 b) who does it (what process) It doesn't seem a good idea to have a
 client-connected process do it; it would incur unwanted latency.
 Three other options are (i) to spawn a specialized process to do it,
 which perhaps can be signalled by a client-connected process that
 executes a scan and notices the need to run summarization; or (ii) to
 let autovacuum do it, as a separate new maintenance task.  This seems
 simple enough to bolt on top of already existing autovacuum
 infrastructure.  The timing constraints of autovacuum might be
 undesirable, though.  (iii) wait for user command.

 
 The easiest way to go around this seems to have vacuum do it.  That
 way we can simply do re-summarization on the amvacuumcleanup routine.
 Other answers would mean we need a separate AM routine, which appears
 unwarranted at this stage.

I don't think this should be added to the autovacuum daemon. It's quite
tricky to tune autovacuum to be aggressive just enough, i.e. not to run
too frequently / not to lag. I'm afraid this would add task consuming
unpredictable amounts of time, which would make the autovacuum tuning
even trickier.

I can live with non-summarized minmax indexes, but I can't live with
non-vacuumed tables.

 Open questions --
 
 * Same-size page ranges? Current related literature seems to consider
 that each index entry in a minmax index must cover the same number
 of pages.  There doesn't seem to be a hard reason for this to be so;
 it might make sense to allow the index to self-tune so that some
 index entries cover smaller page ranges, if this allows the
 min()/max() values to be more compact.  This would incur larger space
 overhead for the index itself, but might allow better pruning of page
 ranges during scan.  In the limit of one index tuple per page, the
 index itself would occupy too much space, even though we would be 
 able to skip reading the most heap pages, because the min()/max() 
 ranges are tight; in the opposite limit of a single tuple that 
 summarizes the whole table, we wouldn't be able to prune anything
 from the seqscan even though the index is very small.

I see no particular reason not to allow that, and having variable range
size would be a very nice feature IMHO.

Do you have an indea on how the self-tuning might work?

I was thinking about something like this:

(1) estimate the initial range size, trying to keep good selectivity
while not having very small ranges, using

   (a) average row length
   (b) number of distinct values in the column
   (c) MCV/histogram/correlation

(2) once the index is built, running a merge on the ranges - merging
the adjacent pages if the resulting selectivity is not significantly
worse (again, this might be evaluated using MCV, histogram)

But it should be possible to override this (initial range size, max
range size) or disable heuristics completely, as having large ranges
probably makes the resummarizing more expensive. Although having the
counter might fix that as it's more likely there's another row with
min/max value.

BTW could this be used to pre-sort the table, so that the 

Re: [HACKERS] Batch API for After Triggers

2013-06-17 Thread Craig Ringer
On 06/18/2013 01:25 AM, Pavel Stehule wrote:
  and also one called
  UPDATED
  which would have two row vars called OLD and NEW
  so you would access it like e.g. IF UPDATED.OLD.id = 7
 
 nice idea

 +1
Much better naming than OLD_AND_NEW.

I'm not so sure about

OLD
NEW
INSERTED
DELETED

in that I imagine we'd want to pick one pair and stick with it. Since
using INSERTED / DELETED makes UPDATED make sense, and since OLD
and NEW are already used to refer to the magic variables of those
names in for each row triggers, I think INSERTED / UPDATED / DELETED is
the way to go.

INSERTED and UPDATED could just be views of the same data as UPDATED
that show only the OLD or only the NEW composite type fields. That'd
allow you to write a trigger without TG_OP tests in many cases, as
UPDATED would always contain what you wanted. It seems slightly weird to
have INSERTED and DELETED populated for an UPDATE, but when an UPDATE is
logically an INSERT+DELETE anyway...

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Add transforms feature

2013-06-17 Thread Craig Ringer
On 06/18/2013 04:58 AM, Peter Eisentraut wrote:
 On 6/14/13 11:48 PM, Craig Ringer wrote:
 I wonder if that should be extended to install headers for hstore,
 ltree, and while we're at it, intarray as well?
 Sure, if someone wants to go through and check which headers are
 independently usable, and do the necessarily cleanups with necessary.
I can do that if there are no objections. It's only tangental to this
work really, so I'll post a separate thread when I get on to it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [9.4 CF 1] Added in missing patches

2013-06-17 Thread Josh Berkus
Folks,

At this stage, all of the patches which where not already added into CF1
should be there.  So look carefully and make sure *all* of your patches
are there.

Amusingly, it's not the new submitters who forgot to add their patch to
the CF, but rather experienced contributors, and even a committer.

In some cases, I may have accidentally added a patch whose status is
already resolved; if so, that's because I couldn't determine the
resolution status.  Better to add some unnecessary patch entries than to
miss a patch.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dynamic background workers

2013-06-17 Thread Christopher Browne
BTW, one of the ideas that popped up in the unConference session on
replication was why couldn't we use a background worker as a replication
agent?

The main reason pointed out was 'because that means you have to restart the
postmaster to add a replication agent.'  (e.g. - like a Slony slon
process)

There may well be other better reasons not to do so, but it would be nice
to eliminate this reason.  It seems seriously limiting to the bg-worker
concept for them to be thus restricted.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, How would the Lone Ranger handle this?


[HACKERS] [9.4 CF 1] What the 5-day Deadline Means

2013-06-17 Thread Josh Berkus
Hackers,

I got a question on RRR which I thought should be addressed on this
list.  Basically, the questioner asked me I have a day job, I can't
promise to review all of these patches in 5 days.

The answer is: only put your name down on patches which you *can* review
in the next 5 days.  Don't claim every patch you might want to review
sometime before July 15th; instead, claim patch reviews as you get to
actually working on them.

Of course, sometimes life gets in the way.  At which point, I ask people
to *remove* their own names as reviewers as soon as possible when they
know they won't have time to review something.  It doesn't benefit the
CF process to have names down of people who aren't actually doing reviews.

Thanks!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] Added in missing patches

2013-06-17 Thread Michael Paquier
On Tue, Jun 18, 2013 at 7:41 AM, Josh Berkus j...@agliodbs.com wrote:
 Folks,

 At this stage, all of the patches which where not already added into CF1
 should be there.  So look carefully and make sure *all* of your patches
 are there.

 Amusingly, it's not the new submitters who forgot to add their patch to
 the CF, but rather experienced contributors, and even a committer.

 In some cases, I may have accidentally added a patch whose status is
 already resolved; if so, that's because I couldn't determine the
 resolution status.  Better to add some unnecessary patch entries than to
 miss a patch.
This is great. Thanks!
Just wondering, how many patches did you add? 8? I saw a total of 98
patches a couple of days ago, now up to 106.
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] Added in missing patches

2013-06-17 Thread Josh Berkus

 Just wondering, how many patches did you add? 8? I saw a total of 98
 patches a couple of days ago, now up to 106.

Then it must be 8.  That sounds a about right.  Mind you, I immediately
marked 2 as already committed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [9.4 CF 1] Added in missing patches

2013-06-17 Thread Michael Paquier
On Tue, Jun 18, 2013 at 9:00 AM, Josh Berkus j...@agliodbs.com wrote:

 Just wondering, how many patches did you add? 8? I saw a total of 98
 patches a couple of days ago, now up to 106.

 Then it must be 8.  That sounds a about right.  Mind you, I immediately
 marked 2 as already committed.
Just did the same with the patch about removal of pageinspect.sql, it
has already been committed by Heikki.
--
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] How do we track backpatches?

2013-06-17 Thread Josh Berkus
Contributors,

While going through this mailing list looking for missing 9.4 patches, I
realized that we don't track backpatches (that is, fixes to prior
versions) at all anywhere.  Where backpatches are submitted by
committers this isn't an issue, but we had a couple major ones (like the
autovacuum fix) which were submitted by general contributors.  The same
goes for beta fixes.

Should we add a prior version category to the CF to make sure these
don't get dropped?  Or do something else?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Add regression tests for SET xxx

2013-06-17 Thread Robins Tharakan
Thanks !

PFA the updated patch. Also remove a trailing whitespace at the end of SQL
script.

--
Robins Tharakan


On 17 June 2013 17:29, Szymon Guz mabew...@gmail.com wrote:

 On 26 May 2013 19:56, Robins Tharakan thara...@gmail.com wrote:

 Hi,

 Please find attached a patch to take code-coverage of SET (SESSION / SEED
 / TRANSACTION / DATESTYLE / TIME ZONE) (src/backend/commands/variable.c)
 from 65% to 82%.

 Any and all feedback is welcome.
 --
 Robins Tharakan


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


 Hi,
 the patch applies cleanly on code from trunk, however there are failing
 tests, diff attached.

 regards
 Szymon



regress_variable_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add support of IF NOT EXISTS to others CREATE statements

2013-06-17 Thread Peter Eisentraut
On Wed, 2013-06-12 at 16:31 -0300, Fabrízio de Royes Mello wrote:
  Btw., I also want REPLACE BUT DO NOT CREATE.
 
 Can you explain more about it?
 
Replace/alter the object if it already exists, but fail if it does not
exist.

The complete set of variants is:

- object does not exist:

- proceed (normal CREATE)
- error (my above description)

- object exists:

- replace (CREATE OR REPLACE)
- skip (CREATE IF NOT EXISTS)
- error (normal CREATE)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] dynamic background workers

2013-06-17 Thread Peter Eisentraut
On Fri, 2013-06-14 at 17:00 -0400, Robert Haas wrote:
 Alvaro's work on 9.3, we now have the ability to configure background
 workers via shared_preload_libraries.  But if you don't have the right
 library loaded at startup time, and subsequently wish to add a
 background worker while the server is running, you are out of luck.

We could tweak shared_preload_libraries so that it reacts sensibly to
reloads.  I basically gave up on that by writing
session_preload_libraries, but if there is more general use for that, we
could try.

(That doesn't invalidate your work, but it's a thought.)




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How do we track backpatches?

2013-06-17 Thread Peter Eisentraut
On Mon, 2013-06-17 at 17:11 -0700, Josh Berkus wrote:
 Contributors,
 
 While going through this mailing list looking for missing 9.4 patches, I
 realized that we don't track backpatches (that is, fixes to prior
 versions) at all anywhere.  Where backpatches are submitted by
 committers this isn't an issue, but we had a couple major ones (like the
 autovacuum fix) which were submitted by general contributors.  The same
 goes for beta fixes.
 
 Should we add a prior version category to the CF to make sure these
 don't get dropped?  Or do something else?

A separate commit fest for tracking proposed backpatches might be
useful.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] spurious wrap-around shutdown

2013-06-17 Thread Jeff Janes
On Sun, Jun 16, 2013 at 11:54 AM, Jeff Janes
jeff.ja...@gmail.comjavascript:_e({}, 'cvml',
'jeff.ja...@gmail.com');
 wrote:


 In 9.3 HEAD I am getting what seems to be spurious wrap-around shutdowns.


 postgres=# SELECT datname, datfrozenxid, age(datfrozenxid) FROM
 pg_database;

   datname  | datfrozenxid |age
 ---+--+---
  template1 |   2621759843 | 0
  template0 |   2621759843 | 0
  postgres  |   2571759843 |  5000
  jjanes|   2437230921 | 184528922



While the behavior is weird, it is not a regression (also present in 9.2
with suitable changes in timing) and the shutdown is not spurious.

If I execute the above query immediately after the shutdown, I see what I
would expect, jjanes has an age of about 2^31.

The one table that is holding everything back is already getting autovac
for wraparound at that point, and eventually that vacuum finishes.  When
done, pg_class and pg_database are updated (I don't know how they get
updated without trying to assign another transaction), and then I get the
above query results.

I would think the database would re-allow new transactions at this point,
but it does not.  I don't know why.

Since this isn't a regression in 9.3, I probably won't pursue it any more
at this time, unless encouraged to.

Cheers,

Jeff


Re: [HACKERS] SET work_mem = '1TB';

2013-06-17 Thread Jeff Janes
On Tuesday, May 21, 2013, Simon Riggs wrote:

 I worked up a small patch to support Terabyte setting for memory.
 Which is OK, but it only works for 1TB, not for 2TB or above.


I've incorporated my review into a new version, attached.

Added TB to the docs, added the macro KB_PER_TB, and made show to print
1TB rather than 1024GB.

I tested several of the memory settings to see that it can be set and
retrieved.  I haven't tested actual execution as I don't have that kind of
RAM.

I don't see how it could have a performance impact, it passes make check
etc., and I don't think it warrants a new regression test.

I'll set it to ready for committer.

Cheers,

Jeff
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
new file mode 100644
index c7d84b5..940ed6e
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 39,45 
   For convenience,
   a different unit can also be specified explicitly.  Valid memory units
   are literalkB/literal (kilobytes), literalMB/literal
!  (megabytes), and literalGB/literal (gigabytes); valid time units
   are literalms/literal (milliseconds), literals/literal
   (seconds), literalmin/literal (minutes), literalh/literal
   (hours), and literald/literal (days).  Note that the multiplier
--- 39,45 
   For convenience,
   a different unit can also be specified explicitly.  Valid memory units
   are literalkB/literal (kilobytes), literalMB/literal
!  (megabytes), literalGB/literal (gigabytes), and literalTB/literal (terabytes); valid time units
   are literalms/literal (milliseconds), literals/literal
   (seconds), literalmin/literal (minutes), literalh/literal
   (hours), and literald/literal (days).  Note that the multiplier
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
new file mode 100644
index ea16c64..0e5b0c9
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***
*** 105,110 
--- 105,111 
  
  #define KB_PER_MB (1024)
  #define KB_PER_GB (1024*1024)
+ #define KB_PER_TB (1024*1024*1024)
  
  #define MS_PER_S 1000
  #define S_PER_MIN 60
*** parse_int(const char *value, int *result
*** 4837,4843 
  		{
  			/* Set hint for use if no match or trailing garbage */
  			if (hintmsg)
! *hintmsg = gettext_noop(Valid units for this parameter are \kB\, \MB\, and \GB\.);
  
  #if BLCKSZ  1024 || BLCKSZ  (1024*1024)
  #error BLCKSZ must be between 1KB and 1MB
--- 4838,4844 
  		{
  			/* Set hint for use if no match or trailing garbage */
  			if (hintmsg)
! *hintmsg = gettext_noop(Valid units for this parameter are \kB\, \MB\, \GB\ and \TB\.);
  
  #if BLCKSZ  1024 || BLCKSZ  (1024*1024)
  #error BLCKSZ must be between 1KB and 1MB
*** parse_int(const char *value, int *result
*** 4891,4896 
--- 4892,4913 
  		break;
  }
  			}
+ 			else if (strncmp(endptr, TB, 2) == 0)
+ 			{
+ endptr += 2;
+ switch (flags  GUC_UNIT_MEMORY)
+ {
+ 	case GUC_UNIT_KB:
+ 		val *= KB_PER_TB;
+ 		break;
+ 	case GUC_UNIT_BLOCKS:
+ 		val *= KB_PER_TB / (BLCKSZ / 1024);
+ 		break;
+ 	case GUC_UNIT_XBLOCKS:
+ 		val *= KB_PER_TB / (XLOG_BLCKSZ / 1024);
+ 		break;
+ }
+ 			}
  		}
  		else if (flags  GUC_UNIT_TIME)
  		{
*** _ShowOption(struct config_generic * reco
*** 7384,7390 
  break;
  		}
  
! 		if (result % KB_PER_GB == 0)
  		{
  			result /= KB_PER_GB;
  			unit = GB;
--- 7401,7412 
  break;
  		}
  
! 		if (result % KB_PER_TB == 0)
! 		{
! 			result /= KB_PER_TB;
! 			unit = TB;
! 		}
! 		else if (result % KB_PER_GB == 0)
  		{
  			result /= KB_PER_GB;
  			unit = GB;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch for fail-back without fresh backup

2013-06-17 Thread Amit Kapila
On Tuesday, June 18, 2013 12:18 AM Sawada Masahiko wrote:
 On Sun, Jun 16, 2013 at 2:00 PM, Amit kapila amit.kap...@huawei.com
 wrote:
  On Saturday, June 15, 2013 8:29 PM Sawada Masahiko wrote:
  On Sat, Jun 15, 2013 at 10:34 PM, Amit kapila
 amit.kap...@huawei.com wrote:
 
  On Saturday, June 15, 2013 1:19 PM Sawada Masahiko wrote:
  On Fri, Jun 14, 2013 at 10:15 PM, Amit Kapila
 amit.kap...@huawei.com wrote:
  On Friday, June 14, 2013 2:42 PM Samrat Revagade wrote:
  Hello,
 
  We have already started a discussion on pgsql-hackers for the
 problem of
  taking fresh backup during the failback operation here is the
 link for that:
 
 
  http://www.postgresql.org/message-id/CAF8Q-Gxg3PQTf71NVECe-
 6OzRaew5pWhk7yQtb
  jgwrfu513...@mail.gmail.com
 
  Let me again summarize the problem we are trying to address.
 
 
How will you take care of extra WAL on old master during
 recovery. If it
  plays the WAL which has not reached new-master, it can be a
 problem.
 
  you means that there is possible that old master's data ahead of
 new
  master's data.
 
I mean to say is that WAL of old master can be ahead of new
 master. I understood that
data files of old master can't be ahead, but I think WAL can be
 ahead.
 
  so there is inconsistent data between those server when fail back.
 right?
  if so , there is not possible inconsistent. because if you use GUC
 option
  as his propose (i.g., failback_safe_standby_mode = remote_flush),
  when old master is working fine, all file system level changes
 aren't
  done  before WAL replicated.
 
  Would the propose patch will take care that old master's WAL is
 also not ahead in some way?
  If yes, I think i am missing some point.
 
  yes it will happen that old master's WAL ahead of new master's WAL
 as you said.
  but I think that we can solve them by delete all WAL file when old
  master starts as new standby.
 
  I think ideally, it should reset WAL location at the point where new
 master has forrked off.
  In such a scenario it would be difficult for user who wants to get a
 dump of some data in
  old master which hasn't gone to new master. I am not sure if such a
 need is there for real users, but if it
  is there, then providing this solution will have some drawbacks.

 I think that we can dumping data before all WAL files deleting.  All
 WAL files deleting is done when old master starts as new standby.

  Can we dump data without starting server?

With Regards,
Amit Kapila.  



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers