Re: [HACKERS] 9.2.3 crashes during archive recovery

2013-03-06 Thread Heikki Linnakangas

On 05.03.2013 14:09, KONDO Mitsumasa wrote:

Hi,

Horiguch's patch does not seem to record minRecoveryPoint in ReadRecord();
Attempt patch records minRecoveryPoint.
[crash recovery - record minRecoveryPoint in control file - archive
recovery]
I think that this is an original intention of Heikki's patch.


Yeah. That fix isn't right, though; XLogPageRead() is supposed to return 
true on success, and false on error, and the patch makes it return 
'true' on error, if archive recovery was requested but we're still in 
crash recovery. The real issue here is that I missed the two return 
NULL;s in ReadRecord(), so the code that I put in the 
next_record_is_invalid codepath isn't run if XLogPageRead() doesn't find 
the file at all. Attached patch is the proper fix for this.



I also found a bug in latest 9.2_stable. It does not get latest timeline
and
recovery history file in archive recovery when master and standby
timeline is different.


Works for me.. Can you create a test script for that? Remember to set 
recovery_target_timeline='latest'.


- Heikki
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 92adc4e..74a54f6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4010,7 +4010,7 @@ ReadRecord(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt)
 retry:
 	/* Read the page containing the record */
 	if (!XLogPageRead(RecPtr, emode, fetching_ckpt, randAccess))
-		return NULL;
+		goto next_record_is_invalid;
 
 	pageHeaderSize = XLogPageHeaderSize((XLogPageHeader) readBuf);
 	targetRecOff = RecPtr-xrecoff % XLOG_BLCKSZ;
@@ -4168,7 +4168,7 @@ retry:
 			}
 			/* Wait for the next page to become available */
 			if (!XLogPageRead(pagelsn, emode, false, false))
-return NULL;
+goto next_record_is_invalid;
 
 			/* Check that the continuation record looks valid */
 			if (!(((XLogPageHeader) readBuf)-xlp_info  XLP_FIRST_IS_CONTRECORD))

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Simon Riggs
On 5 March 2013 22:02, Tom Lane t...@sss.pgh.pa.us wrote:

 FWIW, my opinion is that doing anything like this in the planner is
 going to be enormously expensive.  Index matching is already pretty
 expensive, and that has the saving grace that we only do it once per
 base relation.  Your sketch above implies trying to match to MVs once
 per considered join relation, which will be combinatorially worse.
 Even with a lot of sweat over reducing the cost of the matching, it
 will hurt.

As we already said: no MVs = zero overhead = no problem. It costs in
the cases where time savings are possible and not otherwise. The type
of queries and their typical run times are different to the OLTP case,
so any additional planning time is likely to be acceptable. I'm sure
we can have a deep disussion about how to optimise the planner for
this; I'm pretty sure there are reasonable answers to the not-small
difficulties along that road.

Most importantly, I want to make sure we don't swing the door shut on
improvements here, especially if you (Tom) are not personally
convinced enough to do the work yourself.

Making transactional MVs work would be in part justified by the
possible existence of automatic lookaside planning, so yes, the two
things are not linked but certainly closely related.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Simon Riggs
On 5 March 2013 09:35, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 Are there objectors?


 In addition to my hostility towards this patch in general, there are some
 specifics in the patch I'd like to raise (read out in a grumpy voice):

;-)  We all want to make the right choice here, so all viewpoints
gratefully received so we can decide.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Simon Riggs
On 5 March 2013 18:02, Jeff Davis pg...@j-davis.com wrote:

 Fletcher is probably significantly faster than CRC-16, because I'm just
 doing int32 addition in a tight loop.

 Simon originally chose Fletcher, so perhaps he has more to say.

IIRC the research showed Fletcher was significantly faster for only a
small loss in error detection rate.

It was sufficient to make our error detection  1 million times
better, possibly more. That seems sufficient to enable early detection
of problems, since if we missed the first error, a second is very
likely to be caught (etc). So I am assuming that we're trying to catch
a pattern of errors early, rather than guarantee we can catch the very
first error.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Andres Freund
On 2013-03-06 13:21:27 +0900, Michael Paquier wrote:
 Please find attached updated patch realigned with your comments. You can
 find my answers inline...
 The only thing that needs clarification is the comment about
 UNIQUE_CHECK_YES/UNIQUE_CHECK_NO. Except that all the other things are
 corrected or adapted to what you wanted. I am also including now tests for
 matviews.
 
 On Wed, Mar 6, 2013 at 1:49 AM, Andres Freund and...@2ndquadrant.comwrote:
 
  On 2013-03-05 22:35:16 +0900, Michael Paquier wrote:
 
+ for (count = 0; count  num_indexes; count++)
 + index_insert(toastidxs[count], t_values,
  t_isnull,
 +  (toasttup-t_self),
 +  toastrel,
 +
 toastidxs[count]-rd_index-indisunique ?
 +  UNIQUE_CHECK_YES :
UNIQUE_CHECK_NO);
   
The indisunique check looks like a copy  pasto to me, albeit not
yours...
   
   Yes it is the same for all the indexes normally, but it looks more solid
  to
   me to do that as it is. So unchanged.
 
  Hm, if the toast indexes aren't unique anymore loads of stuff would be
  broken. Anyway, not your fault.
 
 I definitely cannot understand where you are going here. Could you be more
 explicit? Why could this be a problem? Without my patch a similar check is
 used for toast indexes.

There's no problem. I just dislike the pointless check which caters for
a situation that doesn't exist...
Forget it, sorry.

 + if (count == 0)
 + snprintf(NewToastName,
NAMEDATALEN, pg_toast_%u_index,
 +  OIDOldHeap);
 + else
 + snprintf(NewToastName,
NAMEDATALEN, pg_toast_%u_index_cct%d,
 +  OIDOldHeap,
count);
 + RenameRelationInternal(lfirst_oid(lc),
 +
 NewToastName);
 + count++;
 + }
   
Hm. It seems wrong that this layer needs to know about _cct.
   
Any other idea? For the time being I removed cct and added only a suffix
   based on the index number...
 
  Hm. It seems like throwing an error would be sufficient, that path is
  only entered for shared catalogs, right? Having multiple toast indexes
  would be a bug.
 
 Don't think so. Even if now those APIs are used only for catalog tables, I
 do not believe that this function has been designed to be used only with
 shared catalogs. Removing the cct suffix makes sense though...

Forget what I said.

 + /*
 +  * Index is considered as a constraint if it is PRIMARY KEY or
EXCLUSION.
 +  */
 + isconstraint =  indexRelation-rd_index-indisprimary ||
 + indexRelation-rd_index-indisexclusion;
   
unique constraints aren't mattering here?
   
   No they are not. Unique indexes are not counted as constraints in the
  case
   of index_create. Previous versions of the patch did that but there are
   issues with unique indexes using expressions.
 
  Hm. index_create's comment says:
   * isconstraint: index is owned by PRIMARY KEY, UNIQUE, or EXCLUSION
  constraint
 
  There are unique indexes that are constraints and some that are
  not. Looking at -indisunique is not sufficient to determine whether its
  one or not.
 
 Hum... OK. I changed that using a method based on get_index_constraint for
 a given index. So if the constraint Oid is invalid, it means that this
 index has no constraints and its concurrent entry won't create an index in
 consequence. It is more stable this way.

Sounds good. Just to make that clear:
To get a unique index without constraint:
CREATE TABLE table_u(id int, data int);
CREATE UNIQUE INDEX table_u__data ON table_u(data);
To get a constraint:
ALTER TABLE table_u ADD CONSTRAINT table_u__id_unique UNIQUE(id);

 + /*
 +  * Phase 3 of REINDEX CONCURRENTLY
 +  *
 +  * During this phase the concurrent indexes catch up with the
INSERT that
 +  * might have occurred in the parent table and are marked as
  valid
once done.
 +  *
 +  * We once again wait until no transaction can have the table
  open
with
 +  * the index marked as read-only for updates. Each index
validation is done
 +  * with a separate transaction to avoid opening transaction
  for an
 +  * unnecessary too long time.
 +  */
   
Maybe I am being dumb because I have the feeling I said differently in
the past, but why do we not need a WaitForMultipleVirtualLocks() here?
The comment seems to say we need to do so.
   
   Yes you said the contrary in a previous review. The purpose of this
   function is to first gather the locks and then 

Re: [HACKERS] WIP: index support for regexp search

2013-03-06 Thread Alexander Korotkov
On Wed, Jan 23, 2013 at 7:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 23.01.2013 09:36, Alexander Korotkov wrote:
  On Wed, Jan 23, 2013 at 6:08 AM, Tom Lanet...@sss.pgh.pa.us  wrote:
  The biggest problem is that I really don't care for the idea of
  contrib/pg_trgm being this cozy with the innards of regex_t.

  The only option I see now is to provide a method like export_cnfa
 which
  would export corresponding CNFA in fixed format.

  Yeah, I think that makes sense. The transformation code in trgm_regexp.c
  would probably be more readable too, if it didn't have to deal with the
  regex guts representation of the CNFA. Also, once you have intermediate
  representation of the original CNFA, you could do some of the
  transformation work on that representation, before building the
  tranformed graph containing trigrams. You could eliminate any
  non-alphanumeric characters, joining states connected by arcs with
  non-alphanumeric characters, for example.

 It's not just the CNFA though; the other big API problem is with mapping
 colors back to characters.  Right now, that not only knows way too much
 about a part of the regex internals we have ambitions to change soon,
 but it also requires pg_wchar2mb_with_len() and lowerstr(), neither of
 which should be known to the regex library IMO.  So I'm not sure how we
 divvy that up sanely.  To be clear: I'm not going to insist that we have
 to have a clean API factorization before we commit this at all.  But it
 worries me if we don't even know how we could get to that, because we
 are going to need it eventually.


Now, we probably don't have enough of time before 9.3 to solve an API
problem :(. It's likely we have to choose either commit to 9.3 without
clean API factorization or postpone it to 9.4.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] Writable foreign tables: how to identify rows

2013-03-06 Thread Andres Freund
On 2013-03-05 19:30:53 -0500, Tom Lane wrote:
 One of the core problems for a writable-foreign-tables feature is how
 to identify a previously-fetched row for UPDATE or DELETE actions.
 In an ordinary Postgres table, we use the ctid system column for that,
 but a remote table doesn't necessarily have such a thing.
 ...
 For postgres_fdw, that would really be enough, since it could just
 cause a ctid column to be created with the usual definition.  Then
 it could put the remote ctid into the usual t_self field in returned
 tuples.
 
 Supporting magic identifiers that aren't of the TID data type is
 considerably harder, mainly because it's not clear how heap_getsysattr()
 could know how to fetch the column value.  I have some rough ideas
 about that, but I suggest that we might want to punt on that extension
 for the time being.

What about just making it a bytea in fdw's? Now its not nice to waste
space for a (probably 1byte) bytea header, but its not too bad either.
The fdw author then needs to ensure only the correct data ends up in
that system column.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Floating point error

2013-03-06 Thread Albe Laurenz
Maciek Sakrejda wrote:
 On Tue, Mar 5, 2013 at 12:03 AM, Albe Laurenz laurenz.a...@wien.gv.at wrote:
 I don't think that it is about looking nice.
 C doesn't promise you more than FLT_DIG or DBL_DIG digits of
 precision, so PostgreSQL cannot either.

 If you allow more, that would mean that if you store the same
 number on different platforms and query it, it might come out
 differently.  Among other things, that would be a problem for
 the regression tests.
 
 Thank you: I think this is what I was missing, and what wasn't clear
 from the proposed doc patch. But then how can pg_dump assume that it's
 always safe to set extra_float_digits = 3? Why the discrepancy between
 default behavior and what pg_dump gets? It can't know whether the dump
 is to be restored into the same system or a different one (and AFAICT,
 there's not even an option to tweak extra_float_digits there).

How about this elaboration?

Yours,
Laurenz Albe


float-2.patch
Description: float-2.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Writable foreign tables: how to identify rows

2013-03-06 Thread Shigeru Hanada
On Wed, Mar 6, 2013 at 12:35 PM, Pavan Deolasee
pavan.deola...@gmail.com wrote:
 In the context of postgres_fdw, I am not sure if we need an additional
 system column like a node_id. Would there be a possibility where
 tuples to-be-modified are coming from different foreign tables and at
 runtime we need to decide where to execute the UPDATE/DELETE operation
 ? If we start supporting inheritance involving foreign tables as child
 tables, this will become a reality.

Foreign tables have tableoid system column which holds pg_class.oid of
the foreign table.  It seems sufficient to determine source server.

-- 
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Materialized View patch broke pg_dump

2013-03-06 Thread Bernd Helmle
It looks like the recent matview patch broke pg_dump in a way, which make 
it impossible to dump 9.1 and 9.2 databases.


it fails with

pg_dump: [Archivierer (DB)] query failed: ERROR:  function 
pg_relation_is_scannable(oid) does not exist


Looking into this issue, it seems the version check in getTables() of 
pg_dump.c is wrong. Shouldn't the check be


if (fout-remoteVersion = 90300)
{

}

since this is where pg_relation_is_scannable() is introduced?

--
Thanks

Bernd


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Performance Improvement by reducing WAL for Update Operation

2013-03-06 Thread Amit Kapila
On Wednesday, March 06, 2013 2:57 AM Heikki Linnakangas wrote:
 On 04.03.2013 06:39, Amit Kapila wrote:
  On Sunday, March 03, 2013 8:19 PM Craig Ringer wrote:
  On 02/05/2013 11:53 PM, Amit Kapila wrote:
  Performance data for the patch is attached with this mail.
  Conclusions from the readings (these are same as my previous
 patch):
 
 
 I've been doing investigating the pglz option further, and doing
 performance comparisons of the pglz approach and this patch. I'll begin
 with some numbers:
 
 unpatched (63d283ecd0bc5078594a64dfbae29276072cdf45):
 
  testname | wal_generated |
 duration
 
 -+---+-
 -
 -+---+
   two short fields, no change |1245525360 |
 9.94613695144653
   two short fields, one changed   |1245536528 |
 10.146910905838
   two short fields, both changed  |1245523160 |
 11.2332470417023
   one short and one long field, no change |1054926504 |
 5.90477800369263
   ten tiny fields, all changed|1411774608 |
 13.4536008834839
   hundred tiny fields, all changed| 635739680 |
 7.57448387145996
   hundred tiny fields, half changed   | 636930560 |
 7.56888699531555
   hundred tiny fields, half nulled| 573751120 |
 6.68991994857788
 
 Amit's wal_update_changes_v10.patch:
 
  testname | wal_generated |
 duration
 
 -+---+-
 -
 -+---+
   two short fields, no change |1249722112 |
 13.0558869838715
   two short fields, one changed   |1246145408 |
 12.9947438240051
   two short fields, both changed  |1245951056 |
 13.0262880325317
   one short and one long field, no change | 678480664 |
 5.70031690597534
   ten tiny fields, all changed|1328873920 |
 20.0167419910431
   hundred tiny fields, all changed| 638149416 |
 14.4236788749695
   hundred tiny fields, half changed   | 635560504 |
 14.8770561218262
   hundred tiny fields, half nulled| 558468352 |
 16.2437210083008
 
 pglz-with-micro-optimizations-1.patch:
 
   testname | wal_generated |
 duration
 -+---+-
 -
 -+---+
   two short fields, no change |1245519008 |
 11.6702048778534
   two short fields, one changed   |1245756904 |
 11.3233819007874
   two short fields, both changed  |1249711088 |
 11.6836447715759
   one short and one long field, no change | 664741392 |
 6.44810795783997
   ten tiny fields, all changed|1328085568 |
 13.9679481983185
   hundred tiny fields, all changed| 635974088 |
 9.15514206886292
   hundred tiny fields, half changed   | 636309040 |
 9.13769292831421
   hundred tiny fields, half nulled| 496396448 |
 8.77351498603821

For some of the tests, it doesn't even execute main part of
compression/encoding. 
The reason is that the length of tuple is less than strategy min length, so
it returns from below check
in function pglz_delta_encode()
if (strategy-match_size_good = 0 || 
slen  strategy-min_input_size || 
slen  strategy-max_input_size) 
return false;

The tests for which it doesn't execute encoding are below:
two short fields, no change
two short fields, one changed 
two short fields, both changed
ten tiny fields, all changed 


For above cases, the reason of difference in timings for both approaches
with original could be due to the reason that
this check is done after some processing. So I think if we check the length
in log_heap_update, then
there should not be timing difference for above test scenario's. I can check
that once.

This optimization helps only when tuple is of length  128~200 bytes and
upto 1800 bytes (till it turns to toast), otherwise it could result in
overhead without any major WAL reduction. 
Infact I think in one of my initial patch there is a check if length of
tuple is greater than 128 bytes then perform the optimization.

I shall try to run both patches for cases when length of tuple is  128~200
bytes, as this optimization has benefits in those cases. 

 
 In each test, a table is created with a large number of identical rows,
 and fillfactor=50. Then a full-table UPDATE is performed, and the
 UPDATE is timed. Duration is the time spent in the UPDATE (lower is
 better), and wal_generated is the amount of WAL generated by the
 updates (lower is better).
 
 The summary is that Amit's patch is a small win in terms of CPU usage,
 in the best case where the table has few columns, with one 

Re: [HACKERS] Writable foreign tables: how to identify rows

2013-03-06 Thread Pavan Deolasee


On 06-Mar-2013, at 4:12 PM, Shigeru Hanada shigeru.han...@gmail.com wrote:

 On Wed, Mar 6, 2013 at 12:35 PM, Pavan Deolasee
 pavan.deola...@gmail.com wrote:
 In the context of postgres_fdw, I am not sure if we need an additional
 system column like a node_id. Would there be a possibility where
 tuples to-be-modified are coming from different foreign tables and at
 runtime we need to decide where to execute the UPDATE/DELETE operation
 ? If we start supporting inheritance involving foreign tables as child
 tables, this will become a reality.
 
 Foreign tables have tableoid system column which holds pg_class.oid of
 the foreign table.  It seems sufficient to determine source server.
 

I think you are right. In the context of foreign tables, tableoid might be 
enough to get the source. Unfortunately, Postgres-XC had not yet leveraged 
foreign tables fully and hence special handling was required.

Thanks,
Pavan


 -- 
 Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Writable foreign tables: how to identify rows

2013-03-06 Thread Shigeru Hanada
On Wed, Mar 6, 2013 at 9:30 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 For postgres_fdw, that would really be enough, since it could just
 cause a ctid column to be created with the usual definition.  Then
 it could put the remote ctid into the usual t_self field in returned
 tuples.

I'm not sure whether postgres_fdw should support, but updatable views
have no system column including ctid.  So, we need magic identifier,
perhaps it would be set of primary key value(s), to support updating
remote updatable views via foreign tables.

Just a heads up.

-- 
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Andres Freund
On 2013-03-06 20:59:37 +0900, Michael Paquier wrote:
 OK. Patches updated... Please see attached.
 With all the work done on those patches, I suppose this is close to being
 something clean...

Yes, its looking good. There are loads of improvements possible but
those can very well be made incrementally.
  I have the feeling we are talking past each other. Unless I miss
  something *there is no* WaitForMultipleVirtualLocks between phase 2 and
  3. But one WaitForMultipleVirtualLocks for all would be totally
  sufficient.
 
 OK, sorry for the confusion. I added a call to WaitForMultipleVirtualLocks
 also before phase 3.
 Honestly, I am still not very comfortable with the fact that the ShareLock
 wait on parent relation is done outside each index transaction for build
 and validation... Changed as requested though...

Could you detail your concerns a bit? I tried to think it through
multiple times now and I still can't see a problem. The lock only
ensures that nobody has the relation open with the old index definition
in mind...

Andres

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Michael Paquier
On Wed, Mar 6, 2013 at 9:09 PM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-03-06 20:59:37 +0900, Michael Paquier wrote:
  OK. Patches updated... Please see attached.
  With all the work done on those patches, I suppose this is close to being
  something clean...

 Yes, its looking good. There are loads of improvements possible but
 those can very well be made incrementally.
   I have the feeling we are talking past each other. Unless I miss
   something *there is no* WaitForMultipleVirtualLocks between phase 2 and
   3. But one WaitForMultipleVirtualLocks for all would be totally
   sufficient.
  
  OK, sorry for the confusion. I added a call to
 WaitForMultipleVirtualLocks
  also before phase 3.
  Honestly, I am still not very comfortable with the fact that the
 ShareLock
  wait on parent relation is done outside each index transaction for build
  and validation... Changed as requested though...

 Could you detail your concerns a bit? I tried to think it through
 multiple times now and I still can't see a problem. The lock only
 ensures that nobody has the relation open with the old index definition
 in mind...

I am making a comparison with CREATE INDEX CONCURRENTLY where the ShareLock
wait is made inside the build and validation transactions. Was there any
particular reason why CREATE INDEX CONCURRENTLY wait is done inside a
transaction block?
That's my only concern.
-- 
Michael


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Andres Freund
On 2013-03-06 21:19:57 +0900, Michael Paquier wrote:
 On Wed, Mar 6, 2013 at 9:09 PM, Andres Freund and...@2ndquadrant.comwrote:
 
  On 2013-03-06 20:59:37 +0900, Michael Paquier wrote:
   OK. Patches updated... Please see attached.
   With all the work done on those patches, I suppose this is close to being
   something clean...
 
  Yes, its looking good. There are loads of improvements possible but
  those can very well be made incrementally.
I have the feeling we are talking past each other. Unless I miss
something *there is no* WaitForMultipleVirtualLocks between phase 2 and
3. But one WaitForMultipleVirtualLocks for all would be totally
sufficient.
   
   OK, sorry for the confusion. I added a call to
  WaitForMultipleVirtualLocks
   also before phase 3.
   Honestly, I am still not very comfortable with the fact that the
  ShareLock
   wait on parent relation is done outside each index transaction for build
   and validation... Changed as requested though...
 
  Could you detail your concerns a bit? I tried to think it through
  multiple times now and I still can't see a problem. The lock only
  ensures that nobody has the relation open with the old index definition
  in mind...
 
 I am making a comparison with CREATE INDEX CONCURRENTLY where the ShareLock
 wait is made inside the build and validation transactions. Was there any
 particular reason why CREATE INDEX CONCURRENTLY wait is done inside a
 transaction block?
 That's my only concern.

Well, it needs to be executed in a transaction because it needs a valid
resource owner and a previous CommitTransactionCommand() will leave that
at NULL. And there is no reason in the single-index case of CREATE INDEX
CONCURRENTLY to do it in a separate transaction.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: index support for regexp search

2013-03-06 Thread Stephen Frost
* Alexander Korotkov (aekorot...@gmail.com) wrote:
 Now, we probably don't have enough of time before 9.3 to solve an API
 problem :(. It's likely we have to choose either commit to 9.3 without
 clean API factorization or postpone it to 9.4.

As much as I'd like this to get in, I don't think there's a question
here- it should be punted to 9.4.  These don't look like small issues to
me.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Heikki Linnakangas

On 06.03.2013 10:41, Simon Riggs wrote:

On 5 March 2013 18:02, Jeff Davispg...@j-davis.com  wrote:


Fletcher is probably significantly faster than CRC-16, because I'm just
doing int32 addition in a tight loop.

Simon originally chose Fletcher, so perhaps he has more to say.


IIRC the research showed Fletcher was significantly faster for only a
small loss in error detection rate.

It was sufficient to make our error detection  1 million times
better, possibly more. That seems sufficient to enable early detection
of problems, since if we missed the first error, a second is very
likely to be caught (etc). So I am assuming that we're trying to catch
a pattern of errors early, rather than guarantee we can catch the very
first error.


Fletcher's checksum is good in general, I was mainly worried about 
truncating the Fletcher-64 into two 8-bit values. I can't spot any 
obvious weakness in it, but if it's indeed faster and as good as a 
straightforward Fletcher-16, I wonder why that method is not more widely 
used.


Another thought is that perhaps something like CRC32C would be faster to 
calculate on modern hardware, and could be safely truncated to 16-bits 
using the same technique you're using to truncate the Fletcher's 
Checksum. Greg's tests showed that the overhead of CRC calculation is 
significant in some workloads, so it would be good to spend some time to 
optimize that. It'd be difficult to change the algorithm in a future 
release without breaking on-disk compatibility, so let's make sure we 
pick the best one.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 5 March 2013 22:02, Tom Lane t...@sss.pgh.pa.us wrote:
 FWIW, my opinion is that doing anything like this in the planner is
 going to be enormously expensive.

 As we already said: no MVs = zero overhead = no problem.

Well, in the first place that statement is false on its face: we'll
still spend cycles looking for relevant MVs, or at least maintaining a
complexly-indexed cache that helps us find out that there are none in
a reasonable amount of time.  In the second place, even if it were
approximately true it wouldn't help the people who were using MVs.

 It costs in
 the cases where time savings are possible and not otherwise.

And that is just complete nonsense: matching costs whether you find a
match or not.  Could we have a little less Pollyanna-ish optimism and
a bit more realism about the likely cost of such a feature?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Floating point error

2013-03-06 Thread Florian Weimer

On 03/05/2013 07:23 PM, Tom Lane wrote:

Maciek Sakrejda m.sakre...@gmail.com writes:

Thank you: I think this is what I was missing, and what wasn't clear
from the proposed doc patch. But then how can pg_dump assume that it's
always safe to set extra_float_digits = 3?


It's been proven (don't have a link handy, but the paper is at least
a dozen years old) that 3 extra digits are sufficient to accurately
reconstruct any IEEE single or double float value, given properly
written conversion functions in libc.  So that's where that number comes
from.  Now, if either end is not using IEEE floats, you may or may not
get equivalent results --- but it's pretty hard to make any guarantees
at all in such a case.


There's also gdtoa, which returns the shortest decimal representation 
which rounds to the same decimal number.  It would print 0.1 as 0.1, but 
0.1 + 0.2 as 0.30004.


--
Florian Weimer / Red Hat Product Security Team


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Joachim Wieland
On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 With these tweaks, I was able to make pglz-based delta encoding perform
 roughly as well as Amit's patch.

Out of curiosity, do we know how pglz compares with other algorithms, e.g. lz4 ?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Andres Freund
On 2013-03-06 13:34:21 +0200, Heikki Linnakangas wrote:
 On 06.03.2013 10:41, Simon Riggs wrote:
 On 5 March 2013 18:02, Jeff Davispg...@j-davis.com  wrote:
 
 Fletcher is probably significantly faster than CRC-16, because I'm just
 doing int32 addition in a tight loop.
 
 Simon originally chose Fletcher, so perhaps he has more to say.
 
 IIRC the research showed Fletcher was significantly faster for only a
 small loss in error detection rate.
 
 It was sufficient to make our error detection  1 million times
 better, possibly more. That seems sufficient to enable early detection
 of problems, since if we missed the first error, a second is very
 likely to be caught (etc). So I am assuming that we're trying to catch
 a pattern of errors early, rather than guarantee we can catch the very
 first error.
 
 Fletcher's checksum is good in general, I was mainly worried about
 truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious
 weakness in it, but if it's indeed faster and as good as a straightforward
 Fletcher-16, I wonder why that method is not more widely used.

I personally am not that convinced that fletcher is a such good choice given
that it afaik doesn't distinguish between all-zero and all-one runs that are
long enough.

 Another thought is that perhaps something like CRC32C would be faster to
 calculate on modern hardware, and could be safely truncated to 16-bits using
 the same technique you're using to truncate the Fletcher's Checksum. Greg's
 tests showed that the overhead of CRC calculation is significant in some
 workloads, so it would be good to spend some time to optimize that. It'd be
 difficult to change the algorithm in a future release without breaking
 on-disk compatibility, so let's make sure we pick the best one.

I had implemented a noticeably faster CRC32 implementation somewhere
around 201005202227.49990.and...@anarazel.de . I have since repeatedly
seen pg's CRC32 implementation being a major limitation, so I think
brushing up that patch would be a good idea.
We might think about switching the polynom for WAL at the same time,
given, as you say, CRC32c is available in hardware. The bigger problem
is probably stuff like the control file et al.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Kevin Grittner
Tatsuo Ishii is...@postgresql.org wrote:

 Was the remaining work on docs done? I would like to test MVs and
 am waiting for the docs completed.

I think they are done.  If you notice anything missing or in need
of clarification please let me know.  At this point missing docs
would be a bug in need of a fix.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized View patch broke pg_dump

2013-03-06 Thread Kevin Grittner
Bernd Helmle maili...@oopsware.de wrote:

 It looks like the recent matview patch broke pg_dump in a way, which make it
 impossible to dump 9.1 and 9.2 databases.

 it fails with

 pg_dump: [Archivierer (DB)] query failed: ERROR:  function
 pg_relation_is_scannable(oid) does not exist

 Looking into this issue, it seems the version check in getTables() of 
 pg_dump.c
 is wrong. Shouldn't the check be

 if (fout-remoteVersion = 90300)
 {

 }

 since this is where pg_relation_is_scannable() is introduced?

Right.  Will fix.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Merlin Moncure
On Wed, Mar 6, 2013 at 8:32 AM, Joachim Wieland j...@mcknight.de wrote:
 On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
 With these tweaks, I was able to make pglz-based delta encoding perform
 roughly as well as Amit's patch.

 Out of curiosity, do we know how pglz compares with other algorithms, e.g. 
 lz4 ?

This has been a subject of much recent discussion. It compares very
poorly, but installing a new compressor tends to be problematic due to
patent concerns (something which I disagree with but it's there).  All
that said, Heikki's proposed changes seem to be low risk and quite
fast.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized View patch broke pg_dump

2013-03-06 Thread Kevin Grittner
Bernd Helmle maili...@oopsware.de wrote:

 Looking into this issue, it seems the version check in getTables() of 
 pg_dump.c
 is wrong. Shouldn't the check be

 if (fout-remoteVersion = 90300)
 {

 }

 since this is where pg_relation_is_scannable() is introduced?

Fixed.

Thanks for the report!

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Peter Eisentraut
Kevin,

I haven't seen a reply to this.  Were you able to give my notes below
any consideration?


On 2/15/13 12:44 PM, Peter Eisentraut wrote:
 On 1/25/13 1:00 AM, Kevin Grittner wrote:
 New patch rebased, fixes issues raised by Thom Brown, and addresses
 some of your points.
 
 This patch doesn't apply anymore, so I just took a superficial look.  I
 think the intended functionality and the interfaces look pretty good.
 Documentation looks complete, tests are there.
 
 I have a couple of notes:
 
 * What you call WITH [NO] DATA, Oracle calls BUILD IMMEDIATE/DEFERRED.
 It might be better to use that as well then.
 
 * You use fields named relkind in the parse nodes, but they don't
 actually contain relkind values, which is confusing.  I'd just name the
 field is_matview or something.
 
 * More generally, I wouldn't be so fond of combining the parse handling
 of CREATE TABLE AS and CREATE MATERIALIZED VIEW.  They are similar, but
 then again so are a lot of other things.
 
 * Some of the terminology is inconsistent.  A materialized view is
 sometimes called valid, populated, or built, with approximately the same
 meaning.  Personally, I would settle on built, as per above, but it
 should be one term only.
 
 * I find the name of the relisvalid column a bit confusing.  Especially
 because it only applies to materialized views, and there is already a
 meaning of valid for indexes.  (Recall that indexes are also stored in
 pg_class, but they are concerned about indisvalid.)  I would name it
 something like relmvbuilt.
 
 
 Btw., half of the patch seems to consist of updating places referring to
 relkind.  Is something wrong with the meaning of relkind that this sort
 of thing is required?  Maybe these places should be operating in terms
 of features, not accessing relkind directly.
 
 
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Greg Stark
On Tue, Mar 5, 2013 at 9:08 PM, Robert Haas robertmh...@gmail.com wrote:
 All that having been said, it's hard for me to imagine that anyone
 really cares about any of this until we have an incremental update
 feature, which right now we don't.  Actually, I'm betting that's going
 to be significantly harder than automatic-query-rewrite, when all is
 said and done.  Automatic-query-rewrite, if and when we get it, will
 not be easy and will require a bunch of work from someone with a good
 understanding of the planner, but it strikes me as the sort of thing
 that might work out to one large project and then it's done.  Whereas,
 incremental update sounds to me like a series of projects over a
 series of releases targeting various special cases, where we can
 always point to some improvements vs. release N-1 but we're never
 actually done and able to move on to the next thing.  As a roadmap
 goes, I think that's OK.  Even a reasonably simplistic and partial
 implementation of incremental update will benefit a lot of users.  But
 in terms of relative difficulty, it's not at all obvious to me that
 that's the easier part of the project.

While true that's true for a lot of Postgres features. The only ones
that are one-shot projects are buried deep in the internals. Anything
with UI implications inevitably has limitations and then other people
come along and and work on removing or extending those features.

I do agree with Tom though -- the most frequently asked for
materialized view in the past has always been select count(*) from
tab. People assume we already do this and are surprised when we
don't. The cookie cutter solution for it is basically exactly what a
incrementally updated materialized view  solution would look like
(with the queue of updates with transacion information that are
periodically flattened into the aggregate). Rewriting this might be a
bit tricky and require heuristics to determine just how much work to
expend trying to match materialized views, this type of view would be
where most of the win would be.

I also can't see implementing query rewriting for
non-transactionally-accurate materialized views. If people want a
snapshot of the data that may be out of date that's great. I can tons
of use cases for that. But then surely they won't be surprised to have
to query the snapshot explicitly. If can't see going to all this
trouble to implement transactions and snapshots and wal logging and so
on and then silently rewriting queries to produce data that is not up
to date. I think users would be surprised to find bog-standard SQL
occasionally producing incorrect results.

That said, there are cases where snapshots might be up to date even
though we don't implement any incremental updates. If the underlying
data is read-only or hasn't received any update commits since the
snapshot was taken then it might still be useful. There are tons of
ETL applications where you load the data once and then build MVs for
it and  never touch the underlying data again.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Garick Hamlin
On Wed, Mar 06, 2013 at 01:34:21PM +0200, Heikki Linnakangas wrote:
 On 06.03.2013 10:41, Simon Riggs wrote:
 On 5 March 2013 18:02, Jeff Davispg...@j-davis.com  wrote:

 Fletcher is probably significantly faster than CRC-16, because I'm just
 doing int32 addition in a tight loop.

 Simon originally chose Fletcher, so perhaps he has more to say.

 IIRC the research showed Fletcher was significantly faster for only a
 small loss in error detection rate.

 It was sufficient to make our error detection  1 million times
 better, possibly more. That seems sufficient to enable early detection
 of problems, since if we missed the first error, a second is very
 likely to be caught (etc). So I am assuming that we're trying to catch
 a pattern of errors early, rather than guarantee we can catch the very
 first error.

 Fletcher's checksum is good in general, I was mainly worried about  
 truncating the Fletcher-64 into two 8-bit values. I can't spot any  
 obvious weakness in it, but if it's indeed faster and as good as a  
 straightforward Fletcher-16, I wonder why that method is not more widely  
 used.

I was wondering about the effectiveness of this resulting truncated hash
function as well.

 Another thought is that perhaps something like CRC32C would be faster to  
 calculate on modern hardware, and could be safely truncated to 16-bits  
 using the same technique you're using to truncate the Fletcher's  
 Checksum. Greg's tests showed that the overhead of CRC calculation is  
 significant in some workloads, so it would be good to spend some time to  
 optimize that. It'd be difficult to change the algorithm in a future  
 release without breaking on-disk compatibility, so let's make sure we  
 pick the best one.

If picking a CRC why not a short optimal one rather than truncate CRC32C?

I've been reading about optimal checksum for small messages for other 
reasons and found this paper quite good.

http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

I was interested in small messages and small checksums so this paper may not be
as much help here.

Other than CRCs and fletcher sums, Pearson hashing with a 16-bit block might
be worth considering.  Either a pearson hash or a 16-CRC is small enough to
implement with a lookup table rather than a formula.  

I've been wondering what kind of errors we expect?  Single bit flips?  Large
swaths of bytes corrupted?  Are we more worried about collisions (the odds 
total garbage has the same checksum) or the odds we detect a flip of n-bits.
I would think since the message is large and a write to the wrong location
seems about as likely as a bit flip a pearson hash be good.

Any choice seems like it would be a nice improvement of noticing a storage stack
problem.  The difference would be subtle.  Can I estimate the odds of 
undetected corruption that occurred since the condition was first detected
accurately or does the checksum/hash perform poorly?

Garick



 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Andres Freund
On 2013-03-06 09:36:19 -0600, Merlin Moncure wrote:
 On Wed, Mar 6, 2013 at 8:32 AM, Joachim Wieland j...@mcknight.de wrote:
  On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
  hlinnakan...@vmware.com wrote:
  With these tweaks, I was able to make pglz-based delta encoding perform
  roughly as well as Amit's patch.
 
  Out of curiosity, do we know how pglz compares with other algorithms, e.g. 
  lz4 ?
 
 This has been a subject of much recent discussion. It compares very
 poorly, but installing a new compressor tends to be problematic due to
 patent concerns (something which I disagree with but it's there).  All
 that said, Heikki's proposed changes seem to be low risk and quite
 fast.

Imo the licensing part is by far the smaller one. The interesting part
is making a compatible change to the way toast compression works that
supports multiple compression schemes. Afaics nobody has done that work.
After that the choice of to-be-integrated compression schemes needs to
be discussed, sure.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Andres Freund
On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:
 If picking a CRC why not a short optimal one rather than truncate CRC32C?

CRC32C is available in hardware since SSE4.2.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Writable foreign tables: how to identify rows

2013-03-06 Thread Tom Lane
Shigeru Hanada shigeru.han...@gmail.com writes:
 I'm not sure whether postgres_fdw should support, but updatable views
 have no system column including ctid.  So, we need magic identifier,
 perhaps it would be set of primary key value(s), to support updating
 remote updatable views via foreign tables.

Yeah, I considered that.  I thought seriously about proposing that we
forget magic row identifiers altogether, and instead make postgres_fdw
require a remote primary key for a foreign table to be updatable.  That
would solve the immediate problem since we'd no longer need any system
columns at all.  On balance though I think it's better for postgres_fdw
to use ctid for this purpose, at least for now: ctid is
better-performing and more reliable than a PK (since someone might
mis-describe the PK, or change a row's PK value underneath us).
Perhaps in a future release we could add a table option to use PK row
identification, but I don't want to try to implement both methods in
postgres_fdw right now.

On the other hand, I don't have a problem with decreeing that
non-Postgres FDWs need to use PK row identification in the first
release; which would be the consequence if we don't do anything about
allowing new system columns in 9.3.  We will certainly need that style
of row identification to be written and tested anyway.  It won't stop
us from extending things later.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Jeff Janes
On Wed, Mar 6, 2013 at 8:53 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-03-06 09:36:19 -0600, Merlin Moncure wrote:
  On Wed, Mar 6, 2013 at 8:32 AM, Joachim Wieland j...@mcknight.de wrote:
   On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
   hlinnakan...@vmware.com wrote:
   With these tweaks, I was able to make pglz-based delta encoding
 perform
   roughly as well as Amit's patch.
  
   Out of curiosity, do we know how pglz compares with other algorithms,
 e.g. lz4 ?
 
  This has been a subject of much recent discussion. It compares very
  poorly, but installing a new compressor tends to be problematic due to
  patent concerns (something which I disagree with but it's there).  All
  that said, Heikki's proposed changes seem to be low risk and quite
  fast.

 Imo the licensing part is by far the smaller one. The interesting part
 is making a compatible change to the way toast compression works that
 supports multiple compression schemes. Afaics nobody has done that work.
 After that the choice of to-be-integrated compression schemes needs to
 be discussed, sure.



Another thing to consider would be some way of recording an exemplar value
for each column which is used to seed whatever compression algorithm is
used.  I think there often a lot of redundancy that does not appear within
any given value, but does appear when viewing all the values of a given
column.  Finding some way to take advantage of that could give a big
improvement in compression ratio.

Cheers,

Jeff


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Fujii Masao
On Wed, Mar 6, 2013 at 8:59 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 OK. Patches updated... Please see attached.

I found odd behavior. After I made REINDEX CONCURRENTLY fail twice,
I found that the index which was not marked as INVALID remained unexpectedly.

=# CREATE TABLE hoge (i int primary key);
CREATE TABLE
=# INSERT INTO hoge VALUES (generate_series(1,10));
INSERT 0 10
=# SET statement_timeout TO '1s';
SET
=# REINDEX TABLE CONCURRENTLY hoge;
ERROR:  canceling statement due to statement timeout
=# \d hoge
 Table public.hoge
 Column |  Type   | Modifiers
+-+---
 i  | integer | not null
Indexes:
hoge_pkey PRIMARY KEY, btree (i)
hoge_pkey_cct PRIMARY KEY, btree (i) INVALID

=# REINDEX TABLE CONCURRENTLY hoge;
ERROR:  canceling statement due to statement timeout
=# \d hoge
 Table public.hoge
 Column |  Type   | Modifiers
+-+---
 i  | integer | not null
Indexes:
hoge_pkey PRIMARY KEY, btree (i)
hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
hoge_pkey_cct_cct PRIMARY KEY, btree (i)


+The recommended recovery method in such cases is to drop the concurrent
+index and try again to perform commandREINDEX CONCURRENTLY/.

If an invalid index depends on the constraint like primary key, drop
the concurrent
index cannot actually drop the index. In this case, you need to issue
alter table
... drop constraint ... to recover the situation. I think this
informataion should be
documented.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Andres Freund
On 2013-03-06 09:08:10 -0800, Jeff Janes wrote:
 On Wed, Mar 6, 2013 at 8:53 AM, Andres Freund and...@2ndquadrant.comwrote:
 
  On 2013-03-06 09:36:19 -0600, Merlin Moncure wrote:
   On Wed, Mar 6, 2013 at 8:32 AM, Joachim Wieland j...@mcknight.de wrote:
On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
With these tweaks, I was able to make pglz-based delta encoding
  perform
roughly as well as Amit's patch.
   
Out of curiosity, do we know how pglz compares with other algorithms,
  e.g. lz4 ?
  
   This has been a subject of much recent discussion. It compares very
   poorly, but installing a new compressor tends to be problematic due to
   patent concerns (something which I disagree with but it's there).  All
   that said, Heikki's proposed changes seem to be low risk and quite
   fast.
 
  Imo the licensing part is by far the smaller one. The interesting part
  is making a compatible change to the way toast compression works that
  supports multiple compression schemes. Afaics nobody has done that work.
  After that the choice of to-be-integrated compression schemes needs to
  be discussed, sure.
 
 
 
 Another thing to consider would be some way of recording an exemplar value
 for each column which is used to seed whatever compression algorithm is
 used.  I think there often a lot of redundancy that does not appear within
 any given value, but does appear when viewing all the values of a given
 column.  Finding some way to take advantage of that could give a big
 improvement in compression ratio.

But then that value cannot be changed/removed because existing values
depend on it. So either its a one of thing - which means you can only
compute it after the table is filled to some extent - or you need to
have a growing list of such values somewhere (refcounting it would be
hard).
I think the more reasonable route for such a thing would be to try and
get page-level compression working. Which is a pretty hard project, I
admit.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Andres Freund
On 2013-03-07 02:09:49 +0900, Fujii Masao wrote:
 On Wed, Mar 6, 2013 at 8:59 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  OK. Patches updated... Please see attached.
 
 I found odd behavior. After I made REINDEX CONCURRENTLY fail twice,
 I found that the index which was not marked as INVALID remained unexpectedly.

Thats to be expected. Indexes need to be valid *before* we can drop the
old one. So if you abort in the right moment you will see those and
thats imo fine.
 
 =# CREATE TABLE hoge (i int primary key);
 CREATE TABLE
 =# INSERT INTO hoge VALUES (generate_series(1,10));
 INSERT 0 10
 =# SET statement_timeout TO '1s';
 SET
 =# REINDEX TABLE CONCURRENTLY hoge;
 ERROR:  canceling statement due to statement timeout
 =# \d hoge
  Table public.hoge
  Column |  Type   | Modifiers
 +-+---
  i  | integer | not null
 Indexes:
 hoge_pkey PRIMARY KEY, btree (i)
 hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
 
 =# REINDEX TABLE CONCURRENTLY hoge;
 ERROR:  canceling statement due to statement timeout
 =# \d hoge
  Table public.hoge
  Column |  Type   | Modifiers
 +-+---
  i  | integer | not null
 Indexes:
 hoge_pkey PRIMARY KEY, btree (i)
 hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
 hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
 hoge_pkey_cct_cct PRIMARY KEY, btree (i)

Huh, why did that go through? It should have errored out?
 
 +The recommended recovery method in such cases is to drop the concurrent
 +index and try again to perform commandREINDEX CONCURRENTLY/.
 
 If an invalid index depends on the constraint like primary key, drop
 the concurrent
 index cannot actually drop the index. In this case, you need to issue
 alter table
 ... drop constraint ... to recover the situation. I think this
 informataion should be
 documented.

I think we just shouldn't set -isprimary on the temporary indexes. Now
we switch only the relfilenodes and not the whole index, that should be
perfectly fine.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] transforms

2013-03-06 Thread Peter Eisentraut
On 3/5/13 5:52 PM, Josh Berkus wrote:
 More on this: the problem appears to be that the symbols for hstore are
 loaded only if I've just just created the extension in that database:

I see.  This is a problem with any kind of dynamically loadable module
that uses another module.  The other module is only loaded either at
creation time or when a function from it is first used (or if a preload
mechanism is used).

At run time, this will sort itself out, because all the required modules
will be loaded.  The problem is when you create the glue extension and
haven't invoked any code from any of the dependent extension in the
session.  Abstractly, the possible solutions are either not to check the
functions when the extension is created (possibly settable by a flag) or
to somehow force a load of all dependent extensions when the new
extension is created.  (I say extension here even though dynamically
loadable modules are attached to functions, which makes this even more
confusing.)

In normal programming languages, this is normally addressed by placing
explicit load/require/import statements before you do anything else.
What we are doing here is more like an autoload functionality that some
environments have.  Those have the same problem.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Merlin Moncure
On Wed, Mar 6, 2013 at 10:53 AM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-03-06 09:36:19 -0600, Merlin Moncure wrote:
 On Wed, Mar 6, 2013 at 8:32 AM, Joachim Wieland j...@mcknight.de wrote:
  On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
  hlinnakan...@vmware.com wrote:
  With these tweaks, I was able to make pglz-based delta encoding perform
  roughly as well as Amit's patch.
 
  Out of curiosity, do we know how pglz compares with other algorithms, e.g. 
  lz4 ?

 This has been a subject of much recent discussion. It compares very
 poorly, but installing a new compressor tends to be problematic due to
 patent concerns (something which I disagree with but it's there).  All
 that said, Heikki's proposed changes seem to be low risk and quite
 fast.

 Imo the licensing part is by far the smaller one. The interesting part
 is making a compatible change to the way toast compression works that
 supports multiple compression schemes. Afaics nobody has done that work.
 After that the choice of to-be-integrated compression schemes needs to
 be discussed, sure.

That wasn't the conversation I remember.  lz4 completely smokes pglz
(Heikki's changes notwithstanding) and is bsd licensed so AFAICS there
it no reason to support multiple compression technologies except for
migration purposes (and even if you did want to, a plausible way to do
that was identified).

...but that's a separate topic for another day.  Heikki's proposal
seems like a win either way.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Fujii Masao
On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund and...@2ndquadrant.com wrote:
 Indexes:
 hoge_pkey PRIMARY KEY, btree (i)
 hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
 hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
 hoge_pkey_cct_cct PRIMARY KEY, btree (i)

 Huh, why did that go through? It should have errored out?

I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
be marked as invalid, I think.

 +The recommended recovery method in such cases is to drop the concurrent
 +index and try again to perform commandREINDEX CONCURRENTLY/.

 If an invalid index depends on the constraint like primary key, drop
 the concurrent
 index cannot actually drop the index. In this case, you need to issue
 alter table
 ... drop constraint ... to recover the situation. I think this
 informataion should be
 documented.

 I think we just shouldn't set -isprimary on the temporary indexes. Now
 we switch only the relfilenodes and not the whole index, that should be
 perfectly fine.

Sounds good. But, what about other constraint case like unique constraint?
Those other cases also can be resolved by not setting -isprimary?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Writable foreign tables: how to identify rows

2013-03-06 Thread Tom Lane
Kohei KaiGai kai...@kaigai.gr.jp writes:
 2013/3/6 Tom Lane t...@sss.pgh.pa.us:
 I think if we're going to support magic row identifiers, they need to
 be actual system columns, complete with negative attnums and entries
 in pg_attribute.

 Sorry, -1 for me.

 The proposed design tried to kill two birds with one stone.
 The reason why the new GetForeignRelWidth() can reserve multiple slot
 for pseudo-columns is, that intends to push-down complex calculation in
 target-list to the remote computing resource.

[ shrug... ]  That's just pie in the sky: there is no such capability
in the submitted patch, nor is it close to being implementable.  When
weighing that against the very real probability that this patch won't
get into 9.3 at all, I have no problem tossing that overboard to try to
get to something committable.

The larger issue here is that the patch is confusing the set of
possibly-computed columns that could be returned by a scan node with
the catalog-specified set of columns for a table.  I don't think that
changing the (representation of the) latter on the fly within the
planner is a tenable approach.  You've had to put in an unreasonable
number of crude hacks already to make that sort-of work, but I have
exactly zero confidence that you've hacked everyplace that would have
to change.  I also have no confidence that there aren't unfixable
problems where the same TupleDesc would need to be in both states
to satisfy the expectations of different bits of code.  (An example
of the risks here is that, IIRC, parts of the planner use the relation's
TupleDesc as a guide to what relname.* means.  So I'm pretty
suspicious that the existing patch breaks behavior for whole-row Vars
in some cases.)  Really the idea is a bit broken in this form anyway,
because if we did have a plan that involved calculating (x-y)^2 at
the scan level, we'd want that expression to be represented using Vars
for x and y, not some made-up Var whose meaning is not apparent from
the system catalogs.

Also, the right way to deal with this desire is to teach the planner in
general, not just FDWs, about pushing expensive calculations down the
plan tree --- basically, resurrecting Joe Hellerstein's thesis work,
which we ripped out more than ten years ago.  I don't think there's that
much that would need to change about the planner's data structures, but
deciding where is cheapest to evaluate an expression is a pretty hard
problem.  Trying to handle that locally within FDWs is doomed to failure
IMO.

So my intention is to get rid of GetForeignRelWidth() and make use of
the existing ctid system column for returning remote TIDs in
postgres_fdw.  (On review I notice that we're creating ctid columns
for foreign tables already, so we don't even need the proposed hook
to let FDWs control that; though we will certainly want one in future
if we are to support non-TID magic row identifiers.)

If you find that unacceptable, I'm quite willing to mark this patch
Returned With Feedback and get on with dealing with some of the other
forty-odd patches in the CF queue.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] transforms

2013-03-06 Thread Josh Berkus
Peter,

 At run time, this will sort itself out, because all the required modules
 will be loaded.  The problem is when you create the glue extension and
 haven't invoked any code from any of the dependent extension in the
 session.  

Just invoking code doesn't seem to be enough.  I tried just using the
Hstore data type, and then loading hstore_plperl, but that still didn't
work.  It seems like only CREATE EXTENSION loads *all* the symbols.

 Abstractly, the possible solutions are either not to check the
 functions when the extension is created (possibly settable by a flag) or
 to somehow force a load of all dependent extensions when the new
 extension is created. 

The latter would be ideal, but I don't know that we currently have any
mechanism for it.

--Josh

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing pglz compressor

2013-03-06 Thread Andres Freund
On 2013-03-06 11:31:06 -0600, Merlin Moncure wrote:
 On Wed, Mar 6, 2013 at 10:53 AM, Andres Freund and...@2ndquadrant.com wrote:
  On 2013-03-06 09:36:19 -0600, Merlin Moncure wrote:
  On Wed, Mar 6, 2013 at 8:32 AM, Joachim Wieland j...@mcknight.de wrote:
   On Tue, Mar 5, 2013 at 8:32 AM, Heikki Linnakangas
   hlinnakan...@vmware.com wrote:
   With these tweaks, I was able to make pglz-based delta encoding perform
   roughly as well as Amit's patch.
  
   Out of curiosity, do we know how pglz compares with other algorithms, 
   e.g. lz4 ?
 
  This has been a subject of much recent discussion. It compares very
  poorly, but installing a new compressor tends to be problematic due to
  patent concerns (something which I disagree with but it's there).  All
  that said, Heikki's proposed changes seem to be low risk and quite
  fast.
 
  Imo the licensing part is by far the smaller one. The interesting part
  is making a compatible change to the way toast compression works that
  supports multiple compression schemes. Afaics nobody has done that work.
  After that the choice of to-be-integrated compression schemes needs to
  be discussed, sure.
 
 That wasn't the conversation I remember.  lz4 completely smokes pglz
 (Heikki's changes notwithstanding) and is bsd licensed so AFAICS there
 it no reason to support multiple compression technologies except for
 migration purposes (and even if you did want to, a plausible way to do
 that was identified).

Well, we need to permanently support at least two technologies -
otherwise we will break pg_upgrade.
And there is absolutely no reason to believe nothing better than lz4
will come along in the future so just supporting two seems like a bad
idea.  And sure, there are rough ideas on how to support different
compression schemes in toast, but nobody has implemented anything
afaics.
It would be possible to reuse what I proposed for indirect toast tuples
for that purpose although I am not sure whether I like using
varatt_external's in multiple types as indicated by
varatt1_be.va_len_1be (renamed to va_type or such) apointing to
different types. Which means that an uncompressible datum would
potentially have multiple representations.

 ...but that's a separate topic for another day.  Heikki's proposal
 seems like a win either way.

Yes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] transforms

2013-03-06 Thread Andres Freund
On 2013-03-06 09:53:29 -0800, Josh Berkus wrote:
 Peter,
 
  At run time, this will sort itself out, because all the required modules
  will be loaded.  The problem is when you create the glue extension and
  haven't invoked any code from any of the dependent extension in the
  session.  
 
 Just invoking code doesn't seem to be enough.  I tried just using the
 Hstore data type, and then loading hstore_plperl, but that still didn't
 work.  It seems like only CREATE EXTENSION loads *all* the symbols.

Your error looks like youre erroring out in plperl not in hstore?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Robert Haas
On Mon, Mar 4, 2013 at 3:13 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 04.03.2013 20:58, Greg Smith wrote:

 There
 is no such thing as a stable release of btrfs, and no timetable for when
 there will be one. I could do some benchmarks of that but I didn't think
 they were very relevant. Who cares how fast something might run when it
 may not work correctly? btrfs might as well be /dev/null to me right
 now--sure it's fast, but maybe the data won't be there at all.

 This PostgreSQL patch hasn't seen any production use, either. In fact, I'd
 consider btrfs to be more mature than this patch. Unless you think that
 there will be some major changes to the worse in performance in btrfs, it's
 perfectly valid and useful to compare the two.

 A comparison with ZFS would be nice too. That's mature, and has checksums.

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.
So I think telling people who want checksums go use ZFS is a lot
like telling them oh, I see you have a hangnail, we recommend that
you solve that by cutting your arm off with a rusty saw.

There may be good reasons to reject this patch.  Or there may not.
But I completely disagree with the idea that asking them to solve the
problem at the filesystem level is sensible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange Windows problem, lock_timeout test request

2013-03-06 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Wed, Feb 27, 2013 at 8:58 AM, Stephen Frost sfr...@snowman.net wrote:
 It's still entirely possible to get 99% done and then hit that last
 tuple that you need a lock on and just tip over the lock_timeout_stmt
 limit due to prior waiting and ending up wasting a bunch of work, hence
 why I'm not entirely sure that this is that much better than
 statement_timeout.

 I tend to agree that this should be based on the length of any
 individual lock wait, not the cumulative duration of lock waits.
 Otherwise, it seems like it'll be very hard to set this to a
 meaningful value.  For example, if you set this to 1 minute, and that
 means the length of any single wait, then you basically know that
 it'll only kick in if there is some other, long-running transaction
 that's holding the lock.  But if it means the cumulative length of all
 waits, it's not so clear, because now you might also have this kick in
 if you wait for 100ms on 600 different occasions.  In other words,
 complex queries that lock or update many tuples may get killed even if
 they never wait very long at all for any single lock.  That seems like
 it will be almost indistinguishable from random, unprincipled query
 cancellations.

Yeah.  I'm also unconvinced that there's really much use-case territory
here that statement_timeout doesn't cover well enough.  To have a case
that statement-level lock timeout covers and statement_timeout doesn't,
you need to suppose that you know how long the query can realistically
wait for all locks together, but *not* how long it's going to run in the
absence of lock delays.  That seems a bit far-fetched, particularly when
thinking of row-level locks, whose cumulative timeout would presumably
need to scale with the number of rows the query will visit.

If statement-level lock timeouts were cheap to add, that would be one
thing; but given that they're complicating the code materially, I think
we need a more convincing argument for them.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Strange Windows problem, lock_timeout test request

2013-03-06 Thread Boszormenyi Zoltan

2013-03-06 19:53 keltezéssel, Tom Lane írta:

Robert Haas robertmh...@gmail.com writes:

On Wed, Feb 27, 2013 at 8:58 AM, Stephen Frost sfr...@snowman.net wrote:

It's still entirely possible to get 99% done and then hit that last
tuple that you need a lock on and just tip over the lock_timeout_stmt
limit due to prior waiting and ending up wasting a bunch of work, hence
why I'm not entirely sure that this is that much better than
statement_timeout.

I tend to agree that this should be based on the length of any
individual lock wait, not the cumulative duration of lock waits.
Otherwise, it seems like it'll be very hard to set this to a
meaningful value.  For example, if you set this to 1 minute, and that
means the length of any single wait, then you basically know that
it'll only kick in if there is some other, long-running transaction
that's holding the lock.  But if it means the cumulative length of all
waits, it's not so clear, because now you might also have this kick in
if you wait for 100ms on 600 different occasions.  In other words,
complex queries that lock or update many tuples may get killed even if
they never wait very long at all for any single lock.  That seems like
it will be almost indistinguishable from random, unprincipled query
cancellations.

Yeah.  I'm also unconvinced that there's really much use-case territory
here that statement_timeout doesn't cover well enough.  To have a case
that statement-level lock timeout covers and statement_timeout doesn't,
you need to suppose that you know how long the query can realistically
wait for all locks together, but *not* how long it's going to run in the
absence of lock delays.  That seems a bit far-fetched, particularly when
thinking of row-level locks, whose cumulative timeout would presumably
need to scale with the number of rows the query will visit.

If statement-level lock timeouts were cheap to add, that would be one
thing; but given that they're complicating the code materially, I think
we need a more convincing argument for them.


OK, so it's not wanted. Surprise, surprise, it was already dropped
from the patch. Can you _please_ review the last patch and comment
on it instead of the state of past?

Thanks and best regards,
Zoltán Böszörményi



regards, tom lane





--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Josh Berkus

 There may be good reasons to reject this patch.  Or there may not.
 But I completely disagree with the idea that asking them to solve the
 problem at the filesystem level is sensible.

Yes, can we get back to the main issues with the patch?

1) argument over whether the checksum is sufficient to detect most
errors, or if it will give users false confidence.

2) performance overhead.

Based on Smith's report, I consider (2) to be a deal-killer right now.
The level of overhead reported by him would prevent the users I work
with from ever employing checksums on production systems.

Specifically, the writing checksums for a read-only query is a defect I
think is prohibitively bad.  When we first talked about this feature for
9.2, we were going to exclude hint bits from checksums, in order to
avoid this issue; what happened to that?

(FWIW, I still support the idea of moving hint bits to a separate
filehandle, as we do with the FSM, but clearly that's not happening for
9.3 ...)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Michael Paquier
On Thu, Mar 7, 2013 at 2:09 AM, Fujii Masao masao.fu...@gmail.com wrote:

 On Wed, Mar 6, 2013 at 8:59 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
  OK. Patches updated... Please see attached.

 I found odd behavior. After I made REINDEX CONCURRENTLY fail twice,
 I found that the index which was not marked as INVALID remained
 unexpectedly.

 =# CREATE TABLE hoge (i int primary key);
 CREATE TABLE
 =# INSERT INTO hoge VALUES (generate_series(1,10));
 INSERT 0 10
 =# SET statement_timeout TO '1s';
 SET
 =# REINDEX TABLE CONCURRENTLY hoge;
 ERROR:  canceling statement due to statement timeout
 =# \d hoge
  Table public.hoge
  Column |  Type   | Modifiers
 +-+---
  i  | integer | not null
 Indexes:
 hoge_pkey PRIMARY KEY, btree (i)
 hoge_pkey_cct PRIMARY KEY, btree (i) INVALID

 =# REINDEX TABLE CONCURRENTLY hoge;
 ERROR:  canceling statement due to statement timeout
 =# \d hoge
  Table public.hoge
  Column |  Type   | Modifiers
 +-+---
  i  | integer | not null
 Indexes:
 hoge_pkey PRIMARY KEY, btree (i)
 hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
 hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
 hoge_pkey_cct_cct PRIMARY KEY, btree (i)

Invalid indexes cannot be reindexed concurrently and are simply bypassed
during process, so _cct_cct has no reason to exist. For example here is
what I get with a relation having an invalid index:
ioltas=# \d aa
  Table public.aa
 Column |  Type   | Modifiers
+-+---
 a  | integer |
Indexes:
aap btree (a)
aap_cct btree (a) INVALID

ioltas=# reindex table concurrently aa;
WARNING:  cannot reindex concurrently invalid index public.aap_cct,
skipping
REINDEX


 +The recommended recovery method in such cases is to drop the
 concurrent
 +index and try again to perform commandREINDEX CONCURRENTLY/.

 If an invalid index depends on the constraint like primary key, drop
 the concurrent
 index cannot actually drop the index. In this case, you need to issue
 alter table
 ... drop constraint ... to recover the situation. I think this
 information should be
 documented.

You are right. I'll add a note in the documentation about that. Personally
I find it more instinctive to use DROP CONSTRAINT for a primary key as the
image I have of a concurrent index is a twin of the index it rebuilds.
-- 
Michael


Re: [HACKERS] Bug in tm2timestamp

2013-03-06 Thread Michael Meskes
On Mon, Mar 04, 2013 at 05:08:26PM -0300, Alvaro Herrera wrote:
 Another point worth considering is that most of this is duplicated by
 ecpg's libpgtypes.  Do we want to fix that one too, or do we just let it
 continue to be broken?  I note that other bugs are already unfixed in
 ecpg's copy.  One other idea to consider is moving these things to

Meaning that a fix wasn't put there, too?

 But in light of this bug and other already fixed date/time bugs, perhaps
 it's warranted?  Opinions please.

I'd love to go to a single source. Most of libpgtypes was taken from the
backend back when it was developed.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in tm2timestamp

2013-03-06 Thread Michael Meskes
On Mon, Mar 04, 2013 at 05:55:26PM -0300, Alvaro Herrera wrote:
 error codes for the caller to figure out. Maybe we could create a layer
 on top of ereport, that gets both the error message, sqlstate etc, and
 ...

Couldn't we just create ecpg's own version of ereport, that does the right
thing for ecpg?

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Michael at BorussiaFan dot De, Meskes at (Debian|Postgresql) dot Org
Jabber: michael.meskes at gmail dot com
VfL Borussia! Força Barça! Go SF 49ers! Use Debian GNU/Linux, PostgreSQL


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in tm2timestamp

2013-03-06 Thread Alvaro Herrera
Michael Meskes wrote:
 On Mon, Mar 04, 2013 at 05:08:26PM -0300, Alvaro Herrera wrote:
  Another point worth considering is that most of this is duplicated by
  ecpg's libpgtypes.  Do we want to fix that one too, or do we just let it
  continue to be broken?  I note that other bugs are already unfixed in
  ecpg's copy.  One other idea to consider is moving these things to
 
 Meaning that a fix wasn't put there, too?

Yes, a fix was put there by Tom (which is why I retracted my comment
initially).  I did note that the ecpg code has diverged from the backend
code; it's not unlikely that other bug fixes have not gone to the ecpg
copy.  But I didn't investigate each difference in detail.

  But in light of this bug and other already fixed date/time bugs, perhaps
  it's warranted?  Opinions please.
 
 I'd love to go to a single source. Most of libpgtypes was taken from the
 backend back when it was developed.

I will keep that in mind, if I get back to moving the timestamp/datetime
code to src/common.  It's not a high priority item right now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Andres Freund
On 2013-03-07 02:34:54 +0900, Fujii Masao wrote:
 On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund and...@2ndquadrant.com wrote:
  Indexes:
  hoge_pkey PRIMARY KEY, btree (i)
  hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
  hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
  hoge_pkey_cct_cct PRIMARY KEY, btree (i)
 
  Huh, why did that go through? It should have errored out?
 
 I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
 be marked as invalid, I think.

Hm. Yea.

I am still not sure yet why hoge_pkey_cct_cct sprung into existance, but
that hoge_pkey_cct1 springs into existance makes sense.

I see a problem here, there is a moment here between phase 3 and 4 where
both the old and the new indexes are valid and ready. Thats not good
because if we abort in that moment we essentially have doubled the
amount of indexes.

Options:
a) we live with it
b) we only mark the new index as valid within phase 4. That should be
   fine I think?
c) we invent some other state to mark indexes that are in-progress to
   replace another one.

I guess b) seems fine?

  +The recommended recovery method in such cases is to drop the 
  concurrent
  +index and try again to perform commandREINDEX CONCURRENTLY/.
 
  If an invalid index depends on the constraint like primary key, drop
  the concurrent
  index cannot actually drop the index. In this case, you need to issue
  alter table
  ... drop constraint ... to recover the situation. I think this
  informataion should be
  documented.
 
  I think we just shouldn't set -isprimary on the temporary indexes. Now
  we switch only the relfilenodes and not the whole index, that should be
  perfectly fine.
 
 Sounds good. But, what about other constraint case like unique constraint?
 Those other cases also can be resolved by not setting -isprimary?

Unique indexes can exist without a constraint attached, so thats fine. I
need to read a bit more code whether its safe to unset it, although
indisexclusion, indimmediate might be more important.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Kevin Grittner
Kevin Grittner kgri...@ymail.com wrote:
 Tatsuo Ishii is...@postgresql.org wrote:

 Was the remaining work on docs done? I would like to test MVs and
 am waiting for the docs completed.

 I think they are done.  If you notice anything missing or in need
 of clarification please let me know.  At this point missing docs
 would be a bug in need of a fix.

I decided to take another pass through to try to spot anyplace I
might have missed.  I found that I had missed documenting the new
pg_matviews system view, so I have just pushed that.

I also think that something should be done about the documentation
for indexes.  Right now that always refers to a table.  It would
clearly be awkward to change that to table or materialized view
everywhere.  I wonder if most of thosse should be changed to
relation with a few mentions that the relation could be a table
or a materialized view, or whether some less intrusive change would
be better.  Opinions welcome.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Andres Freund
On 2013-03-07 05:26:31 +0900, Michael Paquier wrote:
 On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao masao.fu...@gmail.com wrote:
 
  On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund and...@2ndquadrant.com
  wrote:
   Indexes:
   hoge_pkey PRIMARY KEY, btree (i)
   hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
   hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
   hoge_pkey_cct_cct PRIMARY KEY, btree (i)
  
   Huh, why did that go through? It should have errored out?
 
  I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
  be marked as invalid, I think.
 
 CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in
 case process is interrupted by user. This has been mentioned in a pas
 review but it was missing, so it might have slipped out during a
 refactoring or smth. Btw, I am surprised to see that this *_cct_cct index
 has been created knowing that hoge_pkey_cct is invalid. I tried with the
 latest version of the patch and even the patch attached but couldn't
 reproduce it.

The strange think about hoge_pkey_cct_cct is that it seems to imply
that an invalid index was reindexed concurrently?

But I don't see how it could happen either. Fujii, can you reproduce it?
 
  +The recommended recovery method in such cases is to drop the
  concurrent
   +index and try again to perform commandREINDEX CONCURRENTLY/.
  
   If an invalid index depends on the constraint like primary key, drop
   the concurrent
   index cannot actually drop the index. In this case, you need to issue
   alter table
   ... drop constraint ... to recover the situation. I think this
   informataion should be
   documented.
  
   I think we just shouldn't set -isprimary on the temporary indexes. Now
   we switch only the relfilenodes and not the whole index, that should be
   perfectly fine.
 
  Sounds good. But, what about other constraint case like unique constraint?
  Those other cases also can be resolved by not setting -isprimary?
 
 We should stick with the concurrent index being a twin of the index it
 rebuilds for consistency.

I don't think its legal. We cannot simply have two indexes with
'indisprimary'. Especially not if bot are valid.
Also, there will be no pg_constraint row that refers to it which
violates very valid expectations that both users and pg may have.

 Also, I think that it is important from the session viewpoint to perform a
 swap with 2 valid indexes. If the process fails just before swapping
 indexes user might want to do that himself and drop the old index, then use
 the concurrent one.

The most likely outcome will be to rerun REINDEX CONCURRENTLY. Which
will then reindex one more index since it now has the old valid index
and the new valid index. Also, I don't think its fair game to expose
indexes that used to belong to a constraint without a constraint
supporting it as valid indexes.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread David E. Wheeler
On Mar 6, 2013, at 1:51 PM, Kevin Grittner kgri...@ymail.com wrote:

 I also think that something should be done about the documentation
 for indexes.  Right now that always refers to a table.  It would
 clearly be awkward to change that to table or materialized view
 everywhere.  I wonder if most of thosse should be changed to
 relation with a few mentions that the relation could be a table
 or a materialized view, or whether some less intrusive change would
 be better.  Opinions welcome.

Isn’t a materialized view really just a table that gets updated periodically? 
And isn’t a non-matierialized view also thought of as a “relation”?

If the answer to both those questions is “yes,” I think the term should remain 
“table,” with a few mentions that the term includes materialized views (and 
excludes foreign tables).

Best,

David

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views WIP patch

2013-03-06 Thread Tatsuo Ishii
 Kevin Grittner kgri...@ymail.com wrote:
 Tatsuo Ishii is...@postgresql.org wrote:

 Was the remaining work on docs done? I would like to test MVs and
 am waiting for the docs completed.

 I think they are done.  If you notice anything missing or in need
 of clarification please let me know.  At this point missing docs
 would be a bug in need of a fix.

Ok.

 I decided to take another pass through to try to spot anyplace I
 might have missed.  I found that I had missed documenting the new
 pg_matviews system view, so I have just pushed that.
 
 I also think that something should be done about the documentation
 for indexes.  Right now that always refers to a table.  It would
 clearly be awkward to change that to table or materialized view
 everywhere.  I wonder if most of thosse should be changed to
 relation with a few mentions that the relation could be a table
 or a materialized view, or whether some less intrusive change would
 be better.  Opinions welcome.

You might want to add small description about MVs to Tutorial
documentation 3.2 Views. Here is the current description of views in
the doc.

-
3.2. Views

Refer back to the queries in Section 2.6. Suppose the combined listing
of weather records and city location is of particular interest to your
application, but you do not want to type the query each time you need
it. You can create a view over the query, which gives a name to the
query that you can refer to like an ordinary table:

CREATE VIEW myview AS
SELECT city, temp_lo, temp_hi, prcp, date, location
FROM weather, cities
WHERE city = name;

SELECT * FROM myview;

Making liberal use of views is a key aspect of good SQL database
design. Views allow you to encapsulate the details of the structure of
your tables, which might change as your application evolves, behind
consistent interfaces.

Views can be used in almost any place a real table can be
used. Building views upon other views is not uncommon.
-
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Josh Berkus
Robert,

 We've had a few EnterpriseDB customers who have had fantastically
 painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
 ZFS block size to the PostgreSQL block size is supposed to make these
 problems go away, but in my experience it does not have that effect.
 So I think telling people who want checksums go use ZFS is a lot
 like telling them oh, I see you have a hangnail, we recommend that
 you solve that by cutting your arm off with a rusty saw.

Wow, what platform are you using ZFS on?

(we have a half-dozen clients on ZFS ...)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Robert Haas
On Wed, Mar 6, 2013 at 2:14 PM, Josh Berkus j...@agliodbs.com wrote:
 Based on Smith's report, I consider (2) to be a deal-killer right now.

I was pretty depressed by those numbers, too.

 The level of overhead reported by him would prevent the users I work
 with from ever employing checksums on production systems.

Agreed.

 Specifically, the writing checksums for a read-only query is a defect I
 think is prohibitively bad.

That particular part doesn't bother me so much as some of the others -
but let's step back and look at the larger issue.  I suspect we can
all agree that the performance of this feature is terrible.  The
questions I think we should be asking are:

1. Are the problems fundamental, or things where we can reasonable
foresee future improvement?  The latter situation wouldn't bother me
very much even if the current situation is pretty bad, but if there's
no real hope of improvement, that's more of a problem.

2. Are the performance results sufficiently bad that we think this
would be more of a liability than an asset?

 When we first talked about this feature for
 9.2, we were going to exclude hint bits from checksums, in order to
 avoid this issue; what happened to that?

I don't think anyone ever thought that was a particularly practical
design.  I certainly don't.

 (FWIW, I still support the idea of moving hint bits to a separate
 filehandle, as we do with the FSM, but clearly that's not happening for
 9.3 ...)

Or, most likely, ever.  The whole benefit of hint bits is that the
information you need is available in the same bytes you have to read
anyway.  Moving the information to another fork (not filehandle) would
probably give up most of the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Robert Haas
On Wed, Mar 6, 2013 at 6:00 PM, Josh Berkus j...@agliodbs.com wrote:
 We've had a few EnterpriseDB customers who have had fantastically
 painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
 ZFS block size to the PostgreSQL block size is supposed to make these
 problems go away, but in my experience it does not have that effect.
 So I think telling people who want checksums go use ZFS is a lot
 like telling them oh, I see you have a hangnail, we recommend that
 you solve that by cutting your arm off with a rusty saw.

 Wow, what platform are you using ZFS on?

 (we have a half-dozen clients on ZFS ...)

Not us, customers.  But as to platform, I have yet to run across
anyone running ZFS on anything but Solaris.  I'd be interested to hear
your experiences.   Mine rhyme with sun a play dreaming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Joshua D. Drake


On 03/06/2013 03:06 PM, Robert Haas wrote:


On Wed, Mar 6, 2013 at 6:00 PM, Josh Berkus j...@agliodbs.com wrote:

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.
So I think telling people who want checksums go use ZFS is a lot
like telling them oh, I see you have a hangnail, we recommend that
you solve that by cutting your arm off with a rusty saw.


Wow, what platform are you using ZFS on?

(we have a half-dozen clients on ZFS ...)


Not us, customers.  But as to platform, I have yet to run across
anyone running ZFS on anything but Solaris.  I'd be interested to hear
your experiences.   Mine rhyme with sun a play dreaming.


I would guess he meant on X86_64 or Sparc.

JD







--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:
 If picking a CRC why not a short optimal one rather than truncate CRC32C?

 CRC32C is available in hardware since SSE4.2.

I think that should be at most a fourth-order consideration, since we
are not interested solely in Intel hardware, nor do we have any portable
way of getting at such a feature even if the hardware has it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] odd behavior in materialized view

2013-03-06 Thread Kevin Grittner
Fujii Masao masao.fu...@gmail.com wrote:
 On Tue, Mar 5, 2013 at 7:36 AM, Kevin Grittner kgri...@ymail.com wrote:
 Fujii Masao masao.fu...@gmail.com wrote:

 When I accessed the materialized view in the standby server,

 I got the following ERROR message. Looks odd to me. Is this a bug?

 ERROR:  materialized view hogeview has not been populated
 HINT:  Use the REFRESH MATERIALIZED VIEW command.

 The procedure to reproduce this error message is:

 In the master server:
 CREATE TABLE hoge (i int);
 INSERT INTO hoge VALUES (generate_series(1,100));
 CREATE MATERIALIZED VIEW hogeview AS SELECT * FROM hoge;
 DELETE FROM hoge;
 REFRESH MATERIALIZED VIEW hogeview;
 SELECT count(*) FROM hogeview;

 In the standby server
 SELECT count(*) FROM hogeview;

 SELECT count(*) goes well in the master, and expectedly returns 0.
 OTOH, in the standby, it emits the error message.

 Will investigate.

 Thanks!

 And I found another problem. When I ran the following SQLs in the master,
 PANIC error occurred in the standby.

 CREATE TABLE hoge (i int);
 INSERT INTO hoge VALUES (generate_series(1,100));
 CREATE MATERIALIZED VIEW hogeview AS SELECT * FROM hoge;
 VACUUM ANALYZE;

 The PANIC error messages that I got in the standby are

 WARNING:  page 0 of relation base/12297/16387 is uninitialized
 CONTEXT:  xlog redo visible: rel 1663/12297/16387; blk 0
 PANIC:  WAL contains references to invalid pages
 CONTEXT:  xlog redo visible: rel 1663/12297/16387; blk 0

 base/12297/16387 is the file of the materialized view 'hogeview'.

I was able to replicate both bugs, and they both appear to be fixed
by the attached, which I have just pushed.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company*** a/src/backend/commands/matview.c
--- b/src/backend/commands/matview.c
***
*** 14,19 
--- 14,20 
   */
  #include postgres.h
  
+ #include access/heapam_xlog.h
  #include access/multixact.h
  #include access/relscan.h
  #include access/xact.h
***
*** 68,77  SetRelationIsScannable(Relation relation)
  	Assert(relation-rd_rel-relkind == RELKIND_MATVIEW);
  	Assert(relation-rd_isscannable == false);
  
- 	RelationOpenSmgr(relation);
  	page = (Page) palloc(BLCKSZ);
  	PageInit(page, BLCKSZ, 0);
  	smgrextend(relation-rd_smgr, MAIN_FORKNUM, 0, (char *) page, true);
  	pfree(page);
  
  	smgrimmedsync(relation-rd_smgr, MAIN_FORKNUM);
--- 69,83 
  	Assert(relation-rd_rel-relkind == RELKIND_MATVIEW);
  	Assert(relation-rd_isscannable == false);
  
  	page = (Page) palloc(BLCKSZ);
  	PageInit(page, BLCKSZ, 0);
+ 
+ 	if (RelationNeedsWAL(relation))
+ 		log_newpage((relation-rd_node), MAIN_FORKNUM, 0, page);
+ 
+ 	RelationOpenSmgr(relation);
  	smgrextend(relation-rd_smgr, MAIN_FORKNUM, 0, (char *) page, true);
+ 
  	pfree(page);
  
  	smgrimmedsync(relation-rd_smgr, MAIN_FORKNUM);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Small patch for CREATE TRIGGER documentation

2013-03-06 Thread Ian Lawrence Barwick
I found this sentence somewhat confusing:

It is possible for a column's value to change even when the trigger
is not fired,

http://www.postgresql.org/docs/devel/static/sql-createtrigger.html#SQL-CREATETRIGGER-NOTES

More precise would be something along the lines It is possible that
changes to the
column's value will not cause the trigger to be fired.

The attached patch hopefully rewords the entire paragraph for clarity.

Regards

Ian Barwick


create-trigger-doc-notes-2013-03-07.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Craig Ringer
On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
 It'd be difficult to change the algorithm in a future release without
 breaking on-disk compatibility,
On-disk compatibility is broken with major releases anyway, so I don't
see this as a huge barrier.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Andres Freund
On 2013-03-07 08:37:40 +0800, Craig Ringer wrote:
 On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
  It'd be difficult to change the algorithm in a future release without
  breaking on-disk compatibility,
 On-disk compatibility is broken with major releases anyway, so I don't
 see this as a huge barrier.

Uh, pg_upgrade?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/6/13 1:34 PM, Robert Haas wrote:

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.


There are a couple of major tuning issues you have to get right for good 
ZFS performance, like its tendency to gobble more RAM than is 
necessarily appropriate for a PostgreSQL host.  If you nail down all 
those and carefully setup everything it can work OK.  When Sun had a 
bunch of good engineers working on the problem they certainly pulled it 
off.  I managed a 3TB database on a ZFS volume for a while myself. 
Being able to make filesystem snapshots cleanly and easily was very nice.


As for the write performance implications of COW, though, at a couple of 
points I was only able to keep that system ingesting data fast enough if 
I turned fsync off :(  It's not as if even ZFS makes all the filesystem 
issues the database worries about go away either.  Take a look at 
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html 
as an example.  That should leave you with a healthy concern over ZFS 
handling of power interruption and lying drives.  [NTFS and ext3] have 
the same problem, but it has different effects, that aren't as visible 
as in ZFS.  ext4 actually fixed this for most hardware though, and I 
believe ZFS still has the same uberblock concern.  ZFS reliability and 
its page checksums are good, but they're not magic for eliminating torn 
page issues.


Normally I would agree with Heikki's theory of let's wait a few years 
and see if the filesystem will take care of it idea.  But for me, the 
when do we get checksums? clock started ticking in 2006 when ZFS 
popularized its implementation, and now it's gone off and it keeps 
ringing at new places.  I would love it if FreeBSD had caught a massive 
popularity wave in the last few years, so ZFS was running in a lot more 
places.  Instead what I keep seeing is deployments Linux with filesystem 
choices skewed toward conservative.  Forget about the leading edge--I'd 
be happy if I could get one large customer to migrate off of ext3...


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Michael Paquier
On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-03-07 05:26:31 +0900, Michael Paquier wrote:
  On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao masao.fu...@gmail.com
 wrote:
 
   On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund and...@2ndquadrant.com
   wrote:
Indexes:
hoge_pkey PRIMARY KEY, btree (i)
hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
hoge_pkey_cct_cct PRIMARY KEY, btree (i)
   
Huh, why did that go through? It should have errored out?
  
   I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
   be marked as invalid, I think.
  
  CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in
  case process is interrupted by user. This has been mentioned in a pas
  review but it was missing, so it might have slipped out during a
  refactoring or smth. Btw, I am surprised to see that this *_cct_cct index
  has been created knowing that hoge_pkey_cct is invalid. I tried with the
  latest version of the patch and even the patch attached but couldn't
  reproduce it.

 The strange think about hoge_pkey_cct_cct is that it seems to imply
 that an invalid index was reindexed concurrently?

 But I don't see how it could happen either. Fujii, can you reproduce it?

Curious about that also.


   +The recommended recovery method in such cases is to drop the
   concurrent
+index and try again to perform commandREINDEX
 CONCURRENTLY/.
   
If an invalid index depends on the constraint like primary key,
 drop
the concurrent
index cannot actually drop the index. In this case, you need to
 issue
alter table
... drop constraint ... to recover the situation. I think this
informataion should be
documented.
   
I think we just shouldn't set -isprimary on the temporary indexes.
 Now
we switch only the relfilenodes and not the whole index, that should
 be
perfectly fine.
  
   Sounds good. But, what about other constraint case like unique
 constraint?
   Those other cases also can be resolved by not setting -isprimary?
  
  We should stick with the concurrent index being a twin of the index it
  rebuilds for consistency.

 I don't think its legal. We cannot simply have two indexes with
 'indisprimary'. Especially not if bot are valid.
 Also, there will be no pg_constraint row that refers to it which
 violates very valid expectations that both users and pg may have.

So what to do with that?
Mark the concurrent index as valid, then validate it and finally mark it as
invalid inside the same transaction at phase 4?
That's moving 2 lines of code...
-- 
Michael


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Craig Ringer
On 03/07/2013 08:41 AM, Andres Freund wrote:
 On 2013-03-07 08:37:40 +0800, Craig Ringer wrote:
 On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
 It'd be difficult to change the algorithm in a future release without
 breaking on-disk compatibility,
 On-disk compatibility is broken with major releases anyway, so I don't
 see this as a huge barrier.
 Uh, pg_upgrade?
Yeah. I was thinking that pg_upgrade copes with a lot of
incompatibilities already, but this is lower-level. Darn.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for REINDEX CONCURRENTLY

2013-03-06 Thread Michael Paquier
On Thu, Mar 7, 2013 at 9:48 AM, Michael Paquier
michael.paqu...@gmail.comwrote:



 On Thu, Mar 7, 2013 at 7:19 AM, Andres Freund and...@2ndquadrant.comwrote:

 On 2013-03-07 05:26:31 +0900, Michael Paquier wrote:
  On Thu, Mar 7, 2013 at 2:34 AM, Fujii Masao masao.fu...@gmail.com
 wrote:
 
   On Thu, Mar 7, 2013 at 2:17 AM, Andres Freund and...@2ndquadrant.com
 
   wrote:
Indexes:
hoge_pkey PRIMARY KEY, btree (i)
hoge_pkey_cct PRIMARY KEY, btree (i) INVALID
hoge_pkey_cct1 PRIMARY KEY, btree (i) INVALID
hoge_pkey_cct_cct PRIMARY KEY, btree (i)
   
Huh, why did that go through? It should have errored out?
  
   I'm not sure why. Anyway hoge_pkey_cct_cct should not appear or should
   be marked as invalid, I think.
  
  CHECK_FOR_INTERRUPTS were not added at each phase and they are needed in
  case process is interrupted by user. This has been mentioned in a pas
  review but it was missing, so it might have slipped out during a
  refactoring or smth. Btw, I am surprised to see that this *_cct_cct
 index
  has been created knowing that hoge_pkey_cct is invalid. I tried with the
  latest version of the patch and even the patch attached but couldn't
  reproduce it.

 The strange think about hoge_pkey_cct_cct is that it seems to imply
 that an invalid index was reindexed concurrently?

 But I don't see how it could happen either. Fujii, can you reproduce it?

 Curious about that also.


   +The recommended recovery method in such cases is to drop the
   concurrent
+index and try again to perform commandREINDEX
 CONCURRENTLY/.
   
If an invalid index depends on the constraint like primary key,
 drop
the concurrent
index cannot actually drop the index. In this case, you need to
 issue
alter table
... drop constraint ... to recover the situation. I think this
informataion should be
documented.
   
I think we just shouldn't set -isprimary on the temporary indexes.
 Now
we switch only the relfilenodes and not the whole index, that
 should be
perfectly fine.
  
   Sounds good. But, what about other constraint case like unique
 constraint?
   Those other cases also can be resolved by not setting -isprimary?
  
  We should stick with the concurrent index being a twin of the index it
  rebuilds for consistency.

 I don't think its legal. We cannot simply have two indexes with
 'indisprimary'. Especially not if bot are valid.
 Also, there will be no pg_constraint row that refers to it which
 violates very valid expectations that both users and pg may have.

 So what to do with that?
 Mark the concurrent index as valid, then validate it and finally mark it
 as invalid inside the same transaction at phase 4?
 That's moving 2 lines of code...

Sorry phase 4 is the swap phase. Validation happens at phase 3.
-- 
Michael


Re: [HACKERS] Strange Windows problem, lock_timeout test request

2013-03-06 Thread Robert Haas
On Wed, Feb 27, 2013 at 8:58 AM, Stephen Frost sfr...@snowman.net wrote:
 * Boszormenyi Zoltan (z...@cybertec.at) wrote:
 But unlike statement_timeout,
 with lock_timeout_stmt the statement can still finish after this limit
 as it does useful work besides waiting for locks.

 It's still entirely possible to get 99% done and then hit that last
 tuple that you need a lock on and just tip over the lock_timeout_stmt
 limit due to prior waiting and ending up wasting a bunch of work, hence
 why I'm not entirely sure that this is that much better than
 statement_timeout.

I tend to agree that this should be based on the length of any
individual lock wait, not the cumulative duration of lock waits.
Otherwise, it seems like it'll be very hard to set this to a
meaningful value.  For example, if you set this to 1 minute, and that
means the length of any single wait, then you basically know that
it'll only kick in if there is some other, long-running transaction
that's holding the lock.  But if it means the cumulative length of all
waits, it's not so clear, because now you might also have this kick in
if you wait for 100ms on 600 different occasions.  In other words,
complex queries that lock or update many tuples may get killed even if
they never wait very long at all for any single lock.  That seems like
it will be almost indistinguishable from random, unprincipled query
cancellations.

Now, you could try to work around that by varying the setting based on
the complexity of the query you're about to run, but that seems like a
pain in the neck, to put it mildly.  And it might still not give you
the behavior that you want.  Personally, I'd think a big part of the
appeal of this is to make sure that you don't hang waiting for a
RELATION level lock for too long before giving up.  And for that,
scaling with the complexity of the query would be exactly the wrong
thing to do, even if you could figure out some system for it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.3] OAT_POST_ALTER object access hooks

2013-03-06 Thread Robert Haas
On Sun, Jan 27, 2013 at 1:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 The part-1 patch adds catalog/objectaccess.c to have entrypoints of
 object_access_hook, instead of simple macro definition, to simplify
 invocations with arguments. It is just a replacement of existing
 OAT_POST_CREATE and OAT_DROP invocation with new style,

Looks good.  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Jim Nasby

On 3/4/13 7:04 PM, Daniel Farina wrote:

Corruption has easily occupied more than one person-month of time last
year for us.


Just FYI for anyone that's experienced corruption... we've looked into doing 
row-level checksums at work. The only challenge we ran into was how to check 
them when reading data back. I don't remember the details but there was an 
issue with doing this via SELECT rules. It would be possible if you were 
willing to put writable views on all your tables (which isn't actually as 
horrible as it sounds; it wouldn't be hard to write a function to automagically 
do that for you).


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Jim Nasby

On 3/6/13 1:14 PM, Josh Berkus wrote:



There may be good reasons to reject this patch.  Or there may not.
But I completely disagree with the idea that asking them to solve the
problem at the filesystem level is sensible.


Yes, can we get back to the main issues with the patch?

1) argument over whether the checksum is sufficient to detect most
errors, or if it will give users false confidence.

2) performance overhead.

Based on Smith's report, I consider (2) to be a deal-killer right now.
The level of overhead reported by him would prevent the users I work
with from ever employing checksums on production systems.


FWIW, the write workload most likely wouldn't be a problem for us. I am 
concerned about the reported 24-32% hit when reading back in from FS cache... 
that might kill this for us.

I'm working on doing a test to see how bad it actually is for us... but getting 
stuff like that done at work is like pulling teeth, so we'll see...


Specifically, the writing checksums for a read-only query is a defect I
think is prohibitively bad.  When we first talked about this feature for
9.2, we were going to exclude hint bits from checksums, in order to
avoid this issue; what happened to that?

(FWIW, I still support the idea of moving hint bits to a separate
filehandle, as we do with the FSM, but clearly that's not happening for
9.3 ...)


+1


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using indexes for partial index builds

2013-03-06 Thread Jim Nasby

On 2/2/13 4:05 AM, Paul Norman wrote:

Hello,
After a discussion on IRC in #postgresql, I had a feature suggestion and it
was suggested I write it up here.

I have a large (200GB, 1.7b rows) table with a number of columns, but the
two of interest here are a hstore column, tags and a postgis geometry
column, geom.

There is a GIN index on tags and a gist index on geom. These took about
36-48 hours to build in total. Obviously index building on a table this size
is not trivial.

Periodically I want to do a number of specialized queries on objects with a
particular tag or in a particular area. To do this I often want to create a
partial index. For example, I created the index btree ((tags -
'name_1'::text) text_pattern_ops) WHERE tags ? 'name_1'::text. My
understanding is to create this index PostgreSQL does a scan of the entire
table, even though the GIN index on tags could be used to identify which
rows could belong in the index. Where the WHERE condition selects only a
small portion of the table this is scanning a lot more data than is
necessary.

Another case where it would be useful is when I am conducting a detailed
analysis of some aspect of the rows in a particular city. This leads to all
the queries being of the form SELECT ... FROM ... WHERE
is_in_my_area(geom)[1].

My current project is doing analysis involving addresses. The ability to
create an index like btree((tags - 'addr:housenumber'), (tags -
'addr:street'), (tags - 'addr:city')) WHERE is_in_my_area(geom) in a
reasonable time would allow me to use a view instead of copying the local
area to a temporary table and indexing that table. The local area is about
350k rows, or about 0.02% of the database.

[1] The actual function for determining if it's in my area is long and not
really essential to the point here.


Something worth considering on this... I suspect it's possible to use an 
index-only scan to do this, regardless of whether the heap page is all visible. 
The reason is that the newly created index would just use the same access 
methodology as the original index, so any dead rows would be ignored.

We'd almost certainly need to block vacuums during the build however. Obviously 
not an issue for regular index builds, but it would be for concurrent ones.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal 9.4 plpgsql: allows access to call stack from GET DIAGNOSTICS statement

2013-03-06 Thread Jim Nasby

On 2/2/13 3:23 AM, Pavel Stehule wrote:

Hello

I propose enhancing GET DIAGNOSTICS statement about new field
PG_CONTEXT. It is similar to GET STACKED DIAGNOSTICS'
PG_EXCEPTION_CONTEXT.

Motivation for this proposal is possibility to get  call stack for
debugging without raising exception.

This code is based on cleaned code from Orafce, where is used four
years without any error reports.

CREATE OR REPLACE FUNCTION public.inner(integer)
  RETURNS integer
  LANGUAGE plpgsql
AS $function$
declare _context text;
begin
   get diagnostics _context = pg_context;
   raise notice '***%***', _context;
   return 2 * $1;
end;
$function$

postgres=# select outer_outer(10);
NOTICE:  ***PL/pgSQL function inner(integer) line 4 at GET DIAGNOSTICS
PL/pgSQL function outer(integer) line 3 at RETURN
PL/pgSQL function outer_outer(integer) line 3 at RETURN***
CONTEXT:  PL/pgSQL function outer(integer) line 3 at RETURN
PL/pgSQL function outer_outer(integer) line 3 at RETURN


That could *really* stand for some indentation in the output instead of the *** 
business. But other than that, this definitely seems useful. I had no idea 
_context was even an option... :/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.3] OAT_POST_ALTER object access hooks

2013-03-06 Thread Robert Haas
On Sun, Jan 27, 2013 at 1:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 The part-2 patch adds new OAT_POST_ALTER event type, and
 its relevant permission checks on contrib/sepgsql.

This documentation hunk is unclear:

+On xref linkend=sql-createfunction, literalinstall/ permission
+will be checked if literalleakproof/ attribute was given, not only
+literalcreate/ on the new function.

Based on previous experience reading your patches, I'm guessing that
what you actually mean is that both things are checked, but the
wording doesn't make that clear.  Also, if permissions are now checked
on functions, doesn't the previous sentence need an update?

+In addition, literaladd_name/ and literalremove_name/ permission
+will be checked towards relevant schema when we try to rename or set
+new schema on the altered object.

Suggest: In addition, literalremove_name/ and literaladd_name/
will be checked on the old and new schemas, respectively, when an
object is moved to a new schema.

+A few additional checks are applied depending on object types.

For certain object types, additional checks are performed.

+On xref linkend=sql-alterfunction, literalinstall/ permission
+will be checked if literalleakproof/ attribute was turned on, not
+only literalsetattr/ on the new function.

This is a near-duplicate of the previous hunk and suffers from the
same awkwardness.

+ * is_internal: TRUE if constraint is constructed unless user's intention

I dunno what this means.  What's the difference between an internal
constraint and a non-internal constraint, and why do we need that
distinction?  This seems to percolate to a lot of places; I'd rather
not do that without a real good reason.

+   /* XXX - should be checked at caller side */

XXX should be used only for things that really ought to be revisited
and changed.  See the wording I used in the just-committed part 1
patch.

+#include catalog/objectaccess.h

This is the only hunk in collationcmds.c, hence presumably not needed.

+   /* Post create hook of this access method operator */
+   InvokeObjectPostCreateHook(AccessMethodOperatorRelationId,
+  entryoid, 0);

I suggest uniformly adding a blank line before each of these hunks,
rather than adding it for some and not others.  I think, though, that
we could probably ditch the comments throughout.  They don't add
anything, really.

@@ -3330,7 +3342,6 @@ ATExecCmd(List **wqueue, AlteredTableInfo *tab,
Relation rel,
 */
break;
case AT_SetTableSpace:  /* SET TABLESPACE */
-
/*
 * Nothing to do here; Phase 3 does the work
 */

Spurious whitespace hunk.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/6/13 6:34 AM, Heikki Linnakangas wrote:

Another thought is that perhaps something like CRC32C would be faster to
calculate on modern hardware, and could be safely truncated to 16-bits
using the same technique you're using to truncate the Fletcher's
Checksum. Greg's tests showed that the overhead of CRC calculation is
significant in some workloads, so it would be good to spend some time to
optimize that. It'd be difficult to change the algorithm in a future
release without breaking on-disk compatibility, so let's make sure we
pick the best one.


Simon sent over his first rev of this using a quick to compute 16 bit 
checksum as a reasonable trade-off, one that it's possible to do right 
now.  It's not optimal in a few ways, but it catches single bit errors 
that are missed right now, and Fletcher-16 computes quickly and without 
a large amount of code.  It's worth double-checking that the code is 
using the best Fletcher-16 approach available.  I've started on that, 
but I'm working on your general performance concerns first, with the 
implementation that's already there.


From what I've read so far, I think picking Fletcher-16 instead of the 
main alternative, CRC-16-IBM AKA CRC-16-ANSI, is a reasonable choice. 
There's a good table showing the main possibilities here at 
https://en.wikipedia.org/wiki/Cyclic_redundancy_check


One day I hope that in-place upgrade learns how to do page format 
upgrades, with the sort of background conversion tools and necessary 
tracking metadata we've discussed for that work.  When that day comes, I 
would expect it to be straightforward to upgrade pages from 16 bit 
Fletcher checksums to 32 bit CRC-32C ones.  Ideally we would be able to 
jump on the CRC-32C train today, but there's nowhere to put all 32 bits. 
 Using a Fletcher 16 bit checksum for 9.3 doesn't prevent the project 
from going that way later though, once page header expansion is a solved 
problem.


The problem with running CRC32C in software is that the standard fast 
approach uses a slicing technique that requires a chunk of 
pre-computed data be around, a moderately large lookup table.  I don't 
see that there's any advantage to having all that baggage around if 
you're just going to throw away half of the result anyway.  More on 
CRC32Cs in my next message.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/6/13 1:24 PM, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:

If picking a CRC why not a short optimal one rather than truncate CRC32C?



CRC32C is available in hardware since SSE4.2.


I think that should be at most a fourth-order consideration, since we
are not interested solely in Intel hardware, nor do we have any portable
way of getting at such a feature even if the hardware has it.


True, but that situation might actually improve.

The Castagnoli CRC-32C that's accelerated on the better Intel CPUs is 
also used to protect iSCSI and SCTP (a streaming protocol).  And there 
is an active project to use a CRC32C to checksum ext4 metadata blocks on 
Linux:  https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums 
https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/APKfoMzjgdY


Now, that project doesn't make the Postgres feature obsolete, because 
there's nowhere to put checksum data for every block on ext4 without 
whacking block alignment.  The filesystem can't make an extra 32 bits 
appear on every block any more than we can.  It's using a similar trick 
to the PG checksum feature, grabbing some empty space just for the 
metadata then shoving the CRC32C into there.  But the fact that this is 
going on means that there are already Linux kernel modules built with 
both software/hardware accelerated versions of the CRC32C function.  And 
the iSCSI/SCTP use cases means it's not out of the question this will 
show up in other useful forms one day.  Maybe two years from now, there 
will be a common Linux library that autoconf can find to compute the CRC 
for us--with hardware acceleration when available, in software if not.


The first of those ext4 links above even discusses the exact sort of 
issue we're facing.  The author wonders if the easiest way to proceed 
for 16 bit checksums is to compute the CRC32C, then truncate it, simply 
because CRC32C creation is so likely to get hardware help one day.  I 
think that logic doesn't really apply to the PostgreSQL case as strongly 
though, as the timetime before we can expect a hardware accelerated 
version to be available is much further off than a Linux kernel 
developer's future.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Stark
On Wed, Mar 6, 2013 at 11:04 PM, Robert Haas robertmh...@gmail.com wrote:
 When we first talked about this feature for
 9.2, we were going to exclude hint bits from checksums, in order to
 avoid this issue; what happened to that?

 I don't think anyone ever thought that was a particularly practical
 design.  I certainly don't.

Really? I thought it was pretty much the consensus for a good while.

The main problem it ran into was that we kept turning up hint bits
that we didn't realize we had. Index line pointers turned out to have
hint bits, page headers have one, and so on. As long as it was just
the heap page per-tuple transaction hint bits it seemed plausible to
just skip them or move them all to a contiguous blocks. Once it
started to look like the checksumming code had to know about every
data structure on every page it seemed a bit daunting. But that wasn't
something we realized for quite a long time.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith
TL;DR summary:  on a system I thought was a fair middle of the road 
server, pgbench tests are averaging about a 2% increase in WAL writes 
and a 2% slowdown when I turn on checksums.  There are a small number of 
troublesome cases where that overhead rises to closer to 20%, an upper 
limit that's shown up in a few tests aiming to stress this feature now.


On 3/4/13 10:09 PM, Jeff Davis wrote:

= Test 2 - worst-case overhead for calculating checksum while reading data =

Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because
the hit is going to happen during the very common situation where data
is shuffling a lot between a larger OS cache and shared_buffers taking a
relatively small fraction.


I believe that test 1 and test 2 can be improved a little, if there is a
need. Right now we copy the page and then calculate the checksum on the
copy. If we instead calculate as we're copying, I believe it will make
it significantly faster.


It's good to know there's at least some ideas for optimizing this one 
further.  I think the situation where someone has:


shared_buffers  database  total RAM

is fairly common for web applications.  For people on Amazon EC2 
instances for example, giving out the performance tuning advice of get 
a bigger instance until the database fits in RAM works amazingly well. 
 If the hotspot of that data set fits in shared_buffers, those people 
will still be in good shape even with checksums enabled.  If the hot 
working set is spread out more randomly, though, it's not impossible to 
see how they could suffer regularly from this ~20% OS cache-shared 
buffers movement penalty.


Regardless, Jeff's three cases are good synthetic exercises to see 
worst-case behavior, but they are magnifying small differences.  To see 
a more general case, I ran through a series of pgbench tests in its 
standard write mode.  In order to be useful, I ended up using a system 
with a battery-backed write cache, but with only a single drive 
attached.  I needed fsync to be fast to keep that from being the 
bottleneck.  But I wanted physical I/O to be slow.  I ran three test 
sets at various size/client loads:  one without the BBWC (which I kept 
here because it gives some useful scale to the graphs), one with the 
baseline 9.3 code, and one with checksums enabled on the cluster.  I did 
only basic postgresql.conf tuning:


 checkpoint_segments| 64
 shared_buffers | 2GB

There's two graphs comparing sets attached, you can see that the 
slowdown of checksums for this test is pretty minor.  There is a clear 
gap between the two plots, but it's not a very big one, especially if 
you note how much difference a BBWC makes.


I put the numeric results into a spreadsheet, also attached.  There's so 
much noise in pgbench results that I found it hard to get a single 
number for the difference; they bounce around about +/-5% here. 
Averaging across everything gives a solid 2% drop when checksums are on 
that looked detectable above the noise.


Things are worse on the bigger data sets.  At the highest size I tested, 
the drop was more like 7%.  The two larger size / low client count 
results I got were really bad, 25% and 16% drops.  I think this is 
closing in on the range of things:  perhaps only 2% when most of your 
data fits in shared_buffers, more like 10% if your database is bigger, 
and in the worst case 20% is possible.  I don't completely trust those 
25/16% numbers though, I'm going to revisit that configuration.


The other thing I track now in pgbench-tools is how many bytes of WAL 
are written.  Since the total needs to be measured relative to work 
accomplished, the derived number that looks useful there is average 
bytes of WAL per transaction.  On smaller database this is around 6K, 
while larger databases topped out for me at around 22K WAL 
bytes/transaction.  Remember that the pgbench transaction is several 
statements.  Updates touch different blocks in pgbench_accounts, index 
blocks, and the small tables.


The WAL increase from checksumming is a bit more consistent than the TPS 
rates.  Many cases were 3 to 5%.  There was one ugly case were it hit 
30%, and I want to dig into where that came from more.  On average, 
again it was a 2% increase over the baseline.


Cases where you spew hint bit WAL data where before none were written 
(Jeff's test #3) remain a far worst performer than any of these.  Since 
pgbench does a VACUUM before starting, none of those cases were 
encountered here though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: clients-sets.pngattachment: scaling-sets.png

Checksum-pgbench.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Daniel Farina
On Wed, Mar 6, 2013 at 8:17 PM, Greg Smith g...@2ndquadrant.com wrote:
 TL;DR summary:  on a system I thought was a fair middle of the road server,
 pgbench tests are averaging about a 2% increase in WAL writes and a 2%
 slowdown when I turn on checksums.  There are a small number of troublesome
 cases where that overhead rises to closer to 20%, an upper limit that's
 shown up in a few tests aiming to stress this feature now.

I have only done some cursory research, but cpu-time of 20% seem to
expected for InnoDB's CRC computation[0].  Although a galling number,
this comparison with other systems may be a way to see how much of
that overhead is avoidable or just the price of entry.  It's unclear
how this 20% cpu-time compares to your above whole-system results, but
it's enough to suggest that nothing comes for (nearly) free.

[0]: http://mysqlha.blogspot.com/2009/05/innodb-checksum-performance.html

--
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-03-06 Thread Greg Smith

On 3/5/13 9:07 AM, Amit Kapila wrote:

In v11 patch, I have changed name of directory to config.
For file name, currently I have not changed, but if you feel it needs to be
changed, kindly suggest any one of above or if any other better you have in
mind.


This seems fine for now.  Hashing out exactly which name is best is not 
going to be the thing that determines whether this can be committed or not.


Your v11 applies cleanly for me too, and the code does look a lot nicer. 
 The code diff itself is down to 1300 lines and not nearly as 
disruptive, the rest is docs or the many regression tests you've added. 
 That's still not the sort of small change Tom was hoping this was 
possible, but it's closer to the right range.  It may not really be 
possible to make this much smaller.


I'm out of time for today, but I'll try to do some functional review 
work on this tomorrow if no one beats me to it.  I would find it very 
useful to me personally if this feature were committed, and we know 
there's plenty of user demand for it too.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/7/13 12:15 AM, Daniel Farina wrote:

I have only done some cursory research, but cpu-time of 20% seem to
expected for InnoDB's CRC computation[0].  Although a galling number,
this comparison with other systems may be a way to see how much of
that overhead is avoidable or just the price of entry.  It's unclear
how this 20% cpu-time compares to your above whole-system results, but
it's enough to suggest that nothing comes for (nearly) free.


That does provide a useful measuring point:  how long does the 
computation take compared to the memcpy that moves the buffer around. 
It looks like they started out with 3.2 memcpy worth of work, and with 
enough optimization ended up at 1.27 worth.


The important thing to keep in mind is that shared_buffers works pretty 
well at holding on to the most frequently accessed information.  A 
typical server I see will show pg_statio information suggesting 90%+ of 
block requests are coming from hits there, the rest misses suggesting a 
mix of OS cache and real disk reads.  Let's say 90% are hits, 5% are 
fetches at this 20% penalty, and 5% are real reads where the checksum 
time is trivial compared to physical disk I/O.  That works out to be a 
real average slowdown of 6%.  I think way more deployments are going to 
be like that case, which matches most of my pgbench runs, than the worse 
case workloads.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Proposal for Allow postgresql.conf values to be changed via SQL [review]

2013-03-06 Thread Amit Kapila
On Thursday, March 07, 2013 10:54 AM Greg Smith wrote:
 On 3/5/13 9:07 AM, Amit Kapila wrote:
  In v11 patch, I have changed name of directory to config.
  For file name, currently I have not changed, but if you feel it needs
 to be
  changed, kindly suggest any one of above or if any other better you
 have in
  mind.
 
 This seems fine for now.  Hashing out exactly which name is best is not
 going to be the thing that determines whether this can be committed or
 not.
 
 Your v11 applies cleanly for me too, and the code does look a lot
 nicer.
   The code diff itself is down to 1300 lines and not nearly as
 disruptive, the rest is docs or the many regression tests you've added.

I also think the tests added for regression may be more than required.
Currently it is organized as below:

1. 3 type of parameters to use with set persistent. 
   a. parameters for which new connection is required (log_connections) 
   b. parameters for which server restart is required
   c. parameters for which reload is required

   I think in third category c, there are many parameters, I think may be
3-4 parameters should be sufficient.
   The initial intention to keep more parameters is to have test it for
various type(Boolean, integer, float, string)

2. Set some of the parameters to default, then try reload and reconnect.
3. Again set those parameters to some value and re-verify by reload and
reconnect.
4. Test for set persistent command inside functions and transaction.
5. Set all parameters value to default.

I think tests for step-3 can be removed and tests for step-5 can be merged
to step-2.

If you think above optimization's to reduce number of tests are okay, then I
will 
update the patch.

   That's still not the sort of small change Tom was hoping this was
 possible, but it's closer to the right range.  It may not really be
 possible to make this much smaller.
 
 I'm out of time for today, but I'll try to do some functional review
 work on this tomorrow if no one beats me to it.  I would find it very
 useful to me personally if this feature were committed, and we know
 there's plenty of user demand for it too.

With Regards,
Amit Kapila.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers