Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Fujii Masao
On Wed, Sep 7, 2011 at 11:53 PM, Robert Treat r...@xzilla.net wrote:
 On Tue, Sep 6, 2011 at 10:11 PM, Fujii Masao masao.fu...@gmail.com wrote:
 I agree that basically archive_command should not overwrite an existing file.
 But if the size of existing file is less than 16MB, it should do that.
 Otherwise,
 that WAL file would be lost forever.

 I think best practice in this case is that if you ever find an
 existing file with the same name already in place, you should error
 and investigate. We don't ship around partially completed WAL files,
 and finding an existing one probably means something went wrong. (Of
 course, we use rsync instead of copy/move, so we have some better
 guarantees about this).

That's an option. But I don't think that finding an existing file is so serious
problem. The most common cases which cause a partially-filled archived
file are;

1. The server crashes while WAL file is being archived, and then the server
restarts. In this case, the restarted server would find partially-filled
archived file.

2. In replication environment, the master crashes while WAL file is being
archived, and then a failover happens. In this case, new master would
find partially-filled archived file.

In these cases, I don't think it's so unsafe to overwrite an existing file.

OTOH, the practice you explained might fill up an archive area and
pg_xlog directory and then cause a PANIC error. Such a PANIC error
is more serious thing at least for me. So I'd like to overwrite an exiting
file when its size is not 16MB.

 I have another feature request;
 (5) Maybe not in the initial version, but eventually it might be
 nice to support calling posix_fadvise(POSIX_FADV_DONTNEED)
 after copying a WAL file.


 Can you go into more details on how you envision this working. I'm
 mostly curious because I think rsync might already support this, which
 would make it easy to incorporate.

I'm expecting that the executable is written in C, it calls posix_fadvice
against the file descriptor created when opening the WAL file in pg_xlog
directory, just before closing that descriptor.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Simon Riggs
On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Please lets not waste effort on refactoring efforts in mid dev cycle.

 Say what?  When else would you have us do it?

When else would you have us develop?

Major changes happen at start of a dev cycle, after some discussion.
Not in the middle and especially not for low priority items. It's not
even a bug, just code beautification.

We've all accepted the need for some change, but I would like to see
it happen slowly and carefully because of the very high risk of
introducing bugs into stable code that doesn't have a formal
verification mechanism currently. Anybody that could be trusted to
make those changes ought to have something better to do. If they don't
then that is in itself a much more serious issue than this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Simon Riggs
On Thu, Sep 8, 2011 at 7:05 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Sep 7, 2011 at 11:53 PM, Robert Treat r...@xzilla.net wrote:
 On Tue, Sep 6, 2011 at 10:11 PM, Fujii Masao masao.fu...@gmail.com wrote:
 I agree that basically archive_command should not overwrite an existing 
 file.
 But if the size of existing file is less than 16MB, it should do that.
 Otherwise,
 that WAL file would be lost forever.

 I think best practice in this case is that if you ever find an
 existing file with the same name already in place, you should error
 and investigate. We don't ship around partially completed WAL files,
 and finding an existing one probably means something went wrong. (Of
 course, we use rsync instead of copy/move, so we have some better
 guarantees about this).

 That's an option. But I don't think that finding an existing file is so 
 serious
 problem.

The recommendation should be that the archived files are never
overwritten because that prevents a huge range of data loss bugs and
kills them stone dead.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-08 Thread daveg
On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
 daveg da...@sonic.net writes:
  On Wed, Sep 07, 2011 at 07:39:15PM -0400, Tom Lane wrote:
  BTW ... what were the last versions you were running on which you had
  *not* seen the problem?  (Just wondering about the possibility that we
  back-patched some fix that broke things.  It would be good to have
  a version range before trawling the commit logs.)
 
  The first version we saw it on was 8.4.7.
 
 Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
since shortly after it was released. Prior to that we have had all the major
and most of the minor releases since 7.1.
 
-dg

-- 
David Gould   da...@sonic.net  510 536 1443510 282 0869
If simplicity worked, the world would be overrun with insects.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Fujii Masao
On Thu, Sep 8, 2011 at 3:26 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Sep 8, 2011 at 7:05 AM, Fujii Masao masao.fu...@gmail.com wrote:
 That's an option. But I don't think that finding an existing file is so 
 serious
 problem.

 The recommendation should be that the archived files are never
 overwritten because that prevents a huge range of data loss bugs and
 kills them stone dead.

I'm OK with that default behavior of the executable. It's helpful if
the executable
supports overwrite-if-filesize-is-not-16MB option.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Fujii Masao
On Thu, Sep 8, 2011 at 7:06 AM, Chris Redekop ch...@replicon.com wrote:
 Is there anything available to get the last time a transaction
 occurred?like say pg_last_xact_timestamp?  In order to accurately
 calculate how far behind my slave is I need to do something like
 master::pg_last_xact_timestamp() -
 slave::pg_last_xact_replay_timestamp()currently I'm using now() instead
 of the pg_last_xact_timestamp() call, but then when the master is not busy
 the slave appears to lag behind.  I'm considering writing a C module to get
 the last modified file time of the xlog, but I'm hoping there is a better
 alternative that I haven't found yet

The above has been posted in pgsql-general. The reason why Chris thinks
a counterpart of pg_last_xact_replay_timestamp() is required sounds
reasonable to me. So I'd like to propose new function
pg_last_xact_insert_timestamp as the counterpart, which returns the
timestamp of the last inserted commit or abort WAL record. I'll add the
attached patch into the next CF.

Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 13996,14001  SELECT set_config('log_statement_stats', 'off', false);
--- 13996,14004 
  primarypg_current_xlog_location/primary
 /indexterm
 indexterm
+ primarypg_last_xact_insert_timestamp/primary
+/indexterm
+indexterm
  primarypg_start_backup/primary
 /indexterm
 indexterm
***
*** 14049,14054  SELECT set_config('log_statement_stats', 'off', false);
--- 14052,14064 
/row
row
 entry
+ literalfunctionpg_last_xact_insert_timestamp()/function/literal
+ /entry
+entrytypetimestamp with time zone/type/entry
+entryGet current transaction log insert time stamp/entry
+   /row
+   row
+entry
  literalfunctionpg_start_backup(parameterlabel/ typetext/ optional, parameterfast/ typeboolean/ /optional)/function/literal
  /entry
 entrytypetext/type/entry
***
*** 14175,14180  postgres=# SELECT * FROM pg_xlogfile_name_offset(pg_stop_backup());
--- 14185,14199 
 /para
  
 para
+ functionpg_last_xact_insert_timestamp/ displays the time stamp of last inserted
+ transaction. This is the time at which the commit or abort WAL record for that transaction.
+ This is initialized with the time stamp of last transaction replayed during recovery (i.e.,
+ the return value of functionpg_last_xact_replay_timestamp/). If no transactions
+ have been replayed during recovery, functionpg_last_xact_insert_timestamp/
+ returns NULL until at least one commit or abort WAL record has been inserted.
+/para
+ 
+para
  For details about proper usage of these functions, see
  xref linkend=continuous-archiving.
 /para
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***
*** 867,872  primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 867,881 
   commandps/ command (see xref linkend=monitoring-ps for details).
  /para
  para
+  You can also calculate the lag in time stamp by comparing the last
+  WAL insert time stamp on the primary with the last WAL replay
+  time stamp on the standby. They can be retrieved using
+  functionpg_last_xact_insert_timestamp/ on the primary and
+  the functionpg_last_xact_replay_timestamp/ on the standby,
+  respectively (see xref linkend=functions-admin-backup-table and
+  xref linkend=functions-recovery-info-table for details).
+ /para
+ para
   You can retrieve a list of WAL sender processes via the
   link linkend=monitoring-stats-views-table
   literalpg_stat_replication//link view. Large differences between
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***
*** 1041,1046  RecordTransactionCommit(void)
--- 1041,1049 
  			rdata[lastrdata].next = NULL;
  
  			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT, rdata);
+ 
+ 			/* Save timestamp of latest transaction commit record */
+ 			SetLastInsertXTime(xlrec.xact_time);
  		}
  		else
  		{
***
*** 1064,1069  RecordTransactionCommit(void)
--- 1067,1075 
  			rdata[lastrdata].next = NULL;
  
  			(void) XLogInsert(RM_XACT_ID, XLOG_XACT_COMMIT_COMPACT, rdata);
+ 
+ 			/* Save timestamp of latest transaction commit record */
+ 			SetLastInsertXTime(xlrec.xact_time);
  		}
  	}
  
***
*** 1433,1438  RecordTransactionAbort(bool isSubXact)
--- 1439,1447 
  
  	(void) XLogInsert(RM_XACT_ID, XLOG_XACT_ABORT, rdata);
  
+ 	/* Save timestamp of latest transaction abort record */
+ 	SetLastInsertXTime(xlrec.xact_time);
+ 
  	/*
  	 * Report the latest async abort LSN, so that the WAL writer knows to
  	 * flush this abort. 

Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Simon Riggs
On Thu, Sep 8, 2011 at 9:36 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Thu, Sep 8, 2011 at 7:06 AM, Chris Redekop ch...@replicon.com wrote:
 Is there anything available to get the last time a transaction
 occurred?like say pg_last_xact_timestamp?  In order to accurately
 calculate how far behind my slave is I need to do something like
 master::pg_last_xact_timestamp() -
 slave::pg_last_xact_replay_timestamp()currently I'm using now() instead
 of the pg_last_xact_timestamp() call, but then when the master is not busy
 the slave appears to lag behind.  I'm considering writing a C module to get
 the last modified file time of the xlog, but I'm hoping there is a better
 alternative that I haven't found yet

 The above has been posted in pgsql-general. The reason why Chris thinks
 a counterpart of pg_last_xact_replay_timestamp() is required sounds
 reasonable to me. So I'd like to propose new function
 pg_last_xact_insert_timestamp as the counterpart, which returns the
 timestamp of the last inserted commit or abort WAL record. I'll add the
 attached patch into the next CF.

For reasons stated on the original thread, I don't think we need this.

The OP can calculate what he wants without this.

The code already exists in recovery and has some purpose there. Adding
similar code to the mainline just so somebody can run this function
occasionally is not good. Based on this I will be looking to see if we
can optimise the recovery path some more.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Fujii Masao
On Thu, Sep 8, 2011 at 5:55 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, Sep 8, 2011 at 9:36 AM, Fujii Masao masao.fu...@gmail.com wrote:
 The above has been posted in pgsql-general. The reason why Chris thinks
 a counterpart of pg_last_xact_replay_timestamp() is required sounds
 reasonable to me. So I'd like to propose new function
 pg_last_xact_insert_timestamp as the counterpart, which returns the
 timestamp of the last inserted commit or abort WAL record. I'll add the
 attached patch into the next CF.

 For reasons stated on the original thread, I don't think we need this.

 The OP can calculate what he wants without this.

 The code already exists in recovery and has some purpose there. Adding
 similar code to the mainline just so somebody can run this function
 occasionally is not good. Based on this I will be looking to see if we
 can optimise the recovery path some more.

Okay. Let me explain another use case of pg_last_xact_insert_timestamp().
The existing functions might be enough for checking whether the standby
has already caught up with the master. But I think that new function would be
very useful to calculate *how much* the standby is behind from the master.

Of course, we can do that by using the existing functions. But a problem is
that they return LSN and the calculated delay is represented as the size of
WAL. For users, it's not easy to handle LSN (e.g., there is no function to do
calculation of LSN), and the delay in WAL size is not intuitive. I've sometimes
received such a complaint.

OTOH, new function enables users to monitor the delay as a timestamp.
For users, a timestamp is obviously easier to handle than LSN, and the delay
as a timestamp is more intuitive. So, I think that it's worth adding
something like pg_last_xact_insert_timestamp into core for improvement
of user-friendness.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Back branch update releases this week; beta postponed

2011-09-08 Thread Simon Riggs
On Wed, Sep 7, 2011 at 2:43 AM, Bruce Momjian br...@momjian.us wrote:
 Tom Lane wrote:
 Bernd Helmle maili...@oopsware.de writes:
  --On 10. April 2011 13:53:52 -0400 Tom Lane t...@sss.pgh.pa.us wrote:
  The core team has therefore decided to wrap back-branch
  update releases this Thursday for release Monday 4/18.

  Hmm, I would like to see the patch for
  http://archives.postgresql.org/pgsql-bugs/2011-03/msg00261.php
  going in for 8.4.8.

 Simon, was there a reason you only back-patched that to 9.0?

 Was this addressed?

Fix applied, following positive feedback in production testing.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] EXPLAIN and nfiltered, take two

2011-09-08 Thread Marko Tiikkaja

Hi,

Here's a patch for $SUBJECT based on the feedback here:
http://archives.postgresql.org/message-id/9053.1295888...@sss.pgh.pa.us

I intentionally decided to omit the information for Join Filter, since 
the information can already be deduced from EXPLAIN ANALYZE output, and 
for Left Joins and Anti Joins Rows Removed by Join Filter didn't 
really make much sense.


The Rows Removed by .. information is always shown by default (when 
there is a Filter or Recheck Cond, of course), and I didn't feel like it 
was worth it to add a new option for EXPLAIN to turn that information off.


As for documentation..  I'm really at a loss here.  I tried numerous 
different things for doc/src/sgml/perform.sgml, but I didn't feel like 
any of them added anything.  The EXPLAIN ANALYZE output seems quite 
self-explanatory after all.


Attached are also the tests I used, and an example output.


--
Marko Tiikkajahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
BEGIN
CREATE TABLE
CREATE TABLE
CREATE TABLE
INSERT 0 20
INSERT 0 10
INSERT 0 10
CREATE INDEX
CREATE SERVER
CREATE FOREIGN TABLE
CREATE FUNCTION
SET
SET
SET
SET
SET
SET
SET
SET
SET
   QUERY PLAN   
 
-
 Index Scan using foo_a_index on big_table  (cost=0.00..2588.27 rows=333 
width=8) (actual time=1.488..47.456 rows=65000 loops=1)
   Index Cond: (a = 2)
   Filter: (b  5000)
   Rows Removed by Filter: 1667
 Total runtime: 57.127 ms
(5 rows)

SET
SET
SET
  QUERY PLAN
  
--
 Bitmap Heap Scan on big_table  (cost=19.86..961.93 rows=333 width=8) (actual 
time=13.673..67.979 rows=65000 loops=1)
   Recheck Cond: (a = 2)
   Rows Removed by Recheck Cond: 62526
   Filter: (b  5000)
   Rows Removed by Filter: 1667
   -  Bitmap Index Scan on foo_a_index  (cost=0.00..19.77 rows=1000 width=0) 
(actual time=12.309..12.309 rows=7 loops=1)
 Index Cond: (a = 2)
 Total runtime: 78.757 ms
(8 rows)

RESET
SET
SET
   QUERY PLAN   


 Seq Scan on foo  (cost=0.00..36.75 rows=713 width=8) (actual time=0.008..0.010 
rows=5 loops=1)
   Filter: (a  5)
   Rows Removed by Filter: 5
 Total runtime: 0.031 ms
(4 rows)

  QUERY PLAN
   
---
 Values Scan on *VALUES*  (cost=0.00..0.15 rows=3 width=4) (actual 
time=0.021..0.080 rows=5 loops=1)
   Filter: (column1  5)
   Rows Removed by Filter: 5
 Total runtime: 0.189 ms
(4 rows)

QUERY PLAN  
  
--
 CTE Scan on t  (cost=3279.93..4052.35 rows=11443 width=4) (actual 
time=0.020..0.033 rows=5 loops=1)
   Filter: (a  5)
   Rows Removed by Filter: 5
   CTE t
 -  Recursive Union  (cost=0.00..3279.93 rows=34330 width=4) (actual 
time=0.009..0.025 rows=10 loops=1)
   -  Function Scan on generate_series i  (cost=0.00..10.00 rows=1000 
width=4) (actual time=0.009..0.009 rows=5 loops=1)
   -  WorkTable Scan on t  (cost=0.00..258.33 rows= width=4) 
(actual time=0.002..0.003 rows=2 loops=2)
 Filter: ((a + 5) = 10)
 Rows Removed by Filter: 2
 Total runtime: 0.080 ms
(10 rows)

  QUERY PLAN
  
--
 Subquery Scan on ss  (cost=0.00..58.15 rows=713 width=8) (actual 
time=0.008..0.010 rows=5 loops=1)
   Filter: (ss.i  5)
   Rows Removed by Filter: 5
   -  Limit  (cost=0.00..31.40 rows=2140 width=8) (actual time=0.004..0.004 
rows=10 loops=1)
 -  Seq Scan on foo  (cost=0.00..31.40 rows=2140 width=8) (actual 
time=0.003..0.003 rows=10 loops=1)
 Total runtime: 0.056 ms
(6 rows)

QUERY PLAN  
   
---
 Function Scan on generate_series i  (cost=0.00..12.50 rows=333 width=4) 

Re: [HACKERS] [PATCH] Don't truncate integer part in to_char for 'FM99.'

2011-09-08 Thread Marti Raudsepp
On Wed, Sep 7, 2011 at 23:48, Tom Lane t...@sss.pgh.pa.us wrote:
 Also, the way yours is set up, I'm dubious that it
 does the right thing when the last '0' specifier is to the left of the
 decimal point.

When the last '0' is left of the decimal point, Num-zero_end is set
to 0, so the branch dealing with that is never executed.

 I'm currently testing this patch:

Looks good to me. It's potentially less efficient, but I'm not worried
about that.

Regards,
Marti

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Robert Haas
On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 OTOH, new function enables users to monitor the delay as a timestamp.
 For users, a timestamp is obviously easier to handle than LSN, and the delay
 as a timestamp is more intuitive. So, I think that it's worth adding
 something like pg_last_xact_insert_timestamp into core for improvement
 of user-friendness.

It seems very nice from a usability point of view, but I have to agree
with Simon's concern about performance.  Actually, as of today,
WALInsertLock is such a gigantic bottleneck that I suspect the
overhead of this additional bookkeeping would be completely
unnoticeable.  But I'm still reluctant to add more centralized
spinlocks that everyone has to fight over, having recently put a lot
of effort into getting rid of some of the ones we've traditionally
had.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Aidan Van Dyk
On Thu, Sep 8, 2011 at 2:05 AM, Fujii Masao masao.fu...@gmail.com wrote:

 That's an option. But I don't think that finding an existing file is so 
 serious
 problem. The most common cases which cause a partially-filled archived
 file are;

 1. The server crashes while WAL file is being archived, and then the server
    restarts. In this case, the restarted server would find partially-filled
    archived file.

 2. In replication environment, the master crashes while WAL file is being
    archived, and then a failover happens. In this case, new master would
    find partially-filled archived file.

 In these cases, I don't think it's so unsafe to overwrite an existing file.

Personally, I think both of these show examples of why PG should be
looking hard at either providing a simple robust local directory based
archive_command, or very seriously pointing users at properly written
tools like omniptr, or ptrtools, walmgr, etc...

Neither of those cases should ever happen.  If you're copying a file
into the archive, and making it appear non-atomically in your archive,
your doing something wrong.

Period.

No excuses.

a.
-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] concurrent snapshots

2011-09-08 Thread Ants Aasma
Hi,

I have been thinking about how to handle long running transactions with
Robert’s commit sequence number (CSN) idea.
http://archives.postgresql.org/message-id/CA%2BTgmoaAjiq%3Dd%3DkYt3qNj%2BUvi%2BMB-aRovCwr75Ca9egx-Ks9Ag%40mail.gmail.com

I just started to go through transaction management code, so I would
appreciate if anyone could point out any gaping holes in my ideas.

There are two reasons why XID-to-CSN values need to be kept around. One
is that long running transactions need a place to publish their CSN for
CSN based snapshots. The other is that CSN based snapshots need to know
which transactions between xmin and xmax are visible and which are
invisible.

My idea is to use hybrid storage for the XID-to-CSN storage. For recent
transactions shared memory contains a dense XID indexed ring buffer of
CSN values. Head of the buffer moves with nextXid. The tail is also
moved on xid assignment, but done in larger chunks to reduce lock
contention. When moving the tail up, any XID that might still be
invisible to a snapshot is moved to a sparse buffer. To keep the size of
the sparse buffer within limits, too old snapshots are converted into
list-of-running-xids snapshots. This allows us to move up the visible to
all csn-based snapshots horizon (csnmin) and release xids below that
horizon from the sparse buffer. CSNs themselves can be uint32s, this
means that no transaction can be older than 2 billion writing
transactions, which is already enforced by xids being 32 bit values.

Taking a snapshot using this idea would consist of a Xmin, Xmax and CSN.
Snapshot Xmin guarantees that no xid below Xmin can have a CSN bigger
than the snapshots. One way to get Xmin would be to maintain a global
value when doing the tail update of the dense array. Snapshot Xmax
guarantees that no xid above Xmax can have a CSN less than the
snapshots. Xmax can be easily obtained from nextXid. CSN is simply the
current value of the CSN counter. After the snapshot is taken, if
necessary the csnmin of the proc needs to be updated and snapshot added
to the list of acquired snapshots.

All of the guarantees of taking a snapshot can be maintained without
global locking if we have atomic reads/writes of CSN and XID values. We
need to obtain them in the order Xmin - CSN - Xmax, inserting read
barriers between the loads. Write side ordering is guaranteed by the CSN
lock. Updating the proc entry can be done under the per proc lock.

To check visibility of a xid between snapshots xmin and xmax, we first
check if its above the dense xid array tail. If it is in dense, just
read the CSN value and compare it to the snapshot value. If xid falls
under the ringbuffer tail, go through the sparse buffer. If the xid is
found in the sparse buffer, just read the CSN value and compare. If it
is not found, then it is guaranteed to be completed before the snapshot,
just read the clog status for the transaction. This is all done without
locking. When looking something up from the dense array, we have to
issue a read barrier and recheck the tail location to ensure that it
hasn’t moved out from under us. If it has, retry.

To commit a transaction we take a lock on the CSN, stamp all our XIDs
with the value, update clog, a write barrier to ensure that lock free
readers see the changes, increment the csn and relase the lock. For each
of our XIDs we need to check if it is above or below the dense
ringbuffer tail and update the respective value.

Moving of the dense ringbuffer tail should be done when assigning XIDs.
This allows us to enforce a hard limit on the size of the ringbuffer. To
reduce contention on the CSN lock, this should be done in large
increments. 1/8 of the ringbuffer at a time seems like a good compromise
between frequency and buffer space efficiency.

To move the tail, we first find the global minimum value of snapshot
CSNs (csnmin). If every proc maintains its own csnmin under per proc
lock, we can just get the current csn and go through all the procs, take
the per proc lock and fetch the csnmin to get the global value. Then we
go through XIDs between old tail and new tail and find all that are
still running or not visible to csnmin. We insert the XIDs into the
sparse buffer, issue a write barrier so they are visible everyone and
then update the tail location. Because we are trawling through the
procarray here anyway, its also a good time to update global xmin.

Because we want to avoid unbounded growth of the sparse buffer, we need
to get rid of old CSN based snapshots. Tying this into XID assignment is
a good way to get some provable limits, if subtransactions can be
overflowed somehow. For instance, we can keep track of the CSN during
last few tail updates, and convert any too old snapshots - this puts the
cap on the sparse buffer at O(num_backends*non-overflowed-sub-txs +
n_steps*tail-update-step) entries. When go try to find the new csnmin
and discover that a backend has a csnmin that is too old, we go through
the snapshots of that backend 

Re: [HACKERS] postgresql.conf archive_command example

2011-09-08 Thread Kevin Grittner
Aidan Van Dyk ai...@highrise.ca wrote:
 
 If you're copying a file into the archive, and making it appear
 non-atomically in your archive, your doing something wrong.
 
 Period.
 
 No excuses.
 
+1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cheaper snapshots redux

2011-09-08 Thread Robert Haas
On Tue, Sep 6, 2011 at 11:06 PM, Amit Kapila amit.kap...@huawei.com wrote:
 1. With the above, you want to reduce/remove the concurrency issue between
 the GetSnapshotData() [used at begining of sql command execution] and
 ProcArrayEndTransaction() [used at end transaction]. The concurrency issue
 is mainly ProcArrayLock which is taken by GetSnapshotData() in Shared mode
 and by ProcArrayEndTransaction() in X mode.
 There may be other instances for similar thing, but this the main thing
 which you want to resolve.

Yep.

 2. You want to resolve it by using ring buffer such that readers don't need
 to take any lock.

Yep.  Actually, they're still going to need some spinlocks at least in
the first go round, to protect the pointers.  I'm hoping those can
eventually be eliminated on machines with 8-byte atomic reads using
appropriate memory barrier primitives.

 1. 2 Writers; Won't 2 different sessions who try to commit at same time will
 get the same write pointer.
    I assume it will be protected as even indicated in one of your replies
 as I understood?

Yes, commits have to be serialized.  No way around that.  The best
we'll ever be able to do is shorten the critical section.

 2. 1 Reader, 1 Writter; It might be case that some body has written a new
 snapshot and advanced the stop pointer and at that point of time one reader
 came and read between start pointer and stop pointer. Now the reader will
 see as follows:
   snapshot, few XIDs, snapshot

    So will it handle this situation such that it will only read latest
 snapshot?

In my prototype implementation that can't happen because the start and
stop pointers are protected by a single spinlock and are moved
simultaneously.  But I think we can get rid on machines with 8-byte
atomic writes of that and just move the stop pointer first and then
the start pointer.  If you get more than one snapshot in the middle
you just ignore the first part of the data you read and start with the
beginning of the last snapshot.

 3. How will you detect overwrite.

If the write pointer is greater than the start pointer by more than
the ring size, you've wrapped.

 4. Won't it effect if we don't update xmin everytime and just noting the
 committed XIDs. The reason I am asking is that it is used in tuple
 visibility check
    so with new idea in some cases instead of just returning from begining
 by checking xmin it has to go through the committed XID list.
    I understand that there may be less cases or the improvement by your
 idea can supesede this minimal effect. However some cases can be defeated.

The snapshot xmin has to be up to date.  I'm not planning to break
that because it would be wrong.

RecentGlobalXmin doesn't need to be completely up to date, and in fact
recomputing it on every snapshot becomes prohibitively expensive with
this approach.  I'm still struggling with the best way to handle that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Please lets not waste effort on refactoring efforts in mid dev cycle.

 Say what?  When else would you have us do it?

 When else would you have us develop?

In my eyes that sort of activity *is* development.  I find the
distinction you are drawing entirely artificial, and more calculated to
make sure refactoring never happens than to add any safety.  Any
significant development change carries a risk of breakage.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Robert Haas
On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma ants.aa...@eesti.ee wrote:
 When go try to find the new csnmin
 and discover that a backend has a csnmin that is too old, we go through
 the snapshots of that backend and convert every snapshot under the
 desired csnmin to a traditional snapshot.

I thought about something along these lines (though I didn't flesh out
the details as much as you have here), but rejected it because the
step described above would require all snapshots to be stored in
shared memory.  That's problematic for several reasons:

1. A backend can have lots of snapshots, potentially requiring an
unbounded amount of shared memory.  We can't accommodate that.

2. You'd need to protect all of those snapshots with spinlocks or
something, which would be wicked expensive, because the owning process
would need to take and release that spinlock every time it touched the
snapshot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Bruce Momjian
Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Please lets not waste effort on refactoring efforts in mid dev cycle.
 
  Say what? ?When else would you have us do it?
 
  When else would you have us develop?
 
 In my eyes that sort of activity *is* development.  I find the
 distinction you are drawing entirely artificial, and more calculated to
 make sure refactoring never happens than to add any safety.  Any
 significant development change carries a risk of breakage.

I ran pgrminclude a week ago and that is certainly a larger change than
this.  Early in the development cycle people are merging in their saved
patches, so now is a fine time to do refactoring.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Robert Haas
On Thu, Sep 8, 2011 at 10:29 AM, Bruce Momjian br...@momjian.us wrote:
 Tom Lane wrote:
 Simon Riggs si...@2ndquadrant.com writes:
  On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Please lets not waste effort on refactoring efforts in mid dev cycle.

  Say what?  When else would you have us do it?

  When else would you have us develop?

 In my eyes that sort of activity *is* development.  I find the
 distinction you are drawing entirely artificial, and more calculated to
 make sure refactoring never happens than to add any safety.  Any
 significant development change carries a risk of breakage.

 I ran pgrminclude a week ago and that is certainly a larger change than
 this.  Early in the development cycle people are merging in their saved
 patches, so now is a fine time to do refactoring.

+1.

I'd feel more comfortable refactoring it if we had some automated
testing of those code paths, but I don't see anything wrong with doing
it now from a timing perspective.  We still have 4 months until the
start of the final CommitFest.  I wouldn't be too enthusiastic about
starting a project like this in January, but now seems fine.  A bigger
problem is that I don't hear anyone volunteering to do the work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Heikki Linnakangas

On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.


I've committed the patch now, including that fix. Thanks for a great 
GSoC project!


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Robert Haas
On Thu, Sep 8, 2011 at 10:59 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 06.09.2011 01:18, Alexander Korotkov wrote:

 Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
 doesn't exceed maximal offset number.

 I've committed the patch now, including that fix. Thanks for a great GSoC
 project!

Wow, major congrats, Alexander!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FATAL: lock AccessShareLock on object 0/1260/0 is already held

2011-09-08 Thread Tom Lane
daveg da...@sonic.net writes:
 On Wed, Sep 07, 2011 at 09:02:04PM -0400, Tom Lane wrote:
 daveg da...@sonic.net writes:
 The first version we saw it on was 8.4.7.

 Yeah, you said that.  I was wondering what you'd last run before 8.4.7.

 Sorry, misunderstood. We were previously running 8.4.4, but have been on 8.4.7
 since shortly after it was released. Prior to that we have had all the major
 and most of the minor releases since 7.1.

Well, I groveled through the commit logs from 8.4.4 to 8.4.7, and
I can't find anything that looks like it could possibly be related.
So at this point I'm inclined to think that the bug is older than
that, but your usage patterns changed so that you started to tickle it.

Can you think of any changes in your usage that might date to around
that time, and would somehow be connected to backend startup/shutdown?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Ants Aasma
On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma ants.aa...@eesti.ee wrote:
 When go try to find the new csnmin
 and discover that a backend has a csnmin that is too old, we go through
 the snapshots of that backend and convert every snapshot under the
 desired csnmin to a traditional snapshot.

 I thought about something along these lines (though I didn't flesh out
 the details as much as you have here), but rejected it because the
 step described above would require all snapshots to be stored in
 shared memory.  That's problematic for several reasons:

 1. A backend can have lots of snapshots, potentially requiring an
 unbounded amount of shared memory.  We can't accommodate that.

If PostgreSQL gets POSIX shared memory capability at some point in the
future, would it be enough to accommodate snapshots in shared memory?

 2. You'd need to protect all of those snapshots with spinlocks or
 something, which would be wicked expensive, because the owning process
 would need to take and release that spinlock every time it touched the
 snapshot.

It seems to me that converting a transactions type can be done
lock-free. The process that does the converting needs to ensure that
the new transaction type flag is visible before releasing any xids.
For visibility checking, the additional cost is a read barrier, two
volatile reads (recheck snapshot type and dense horizon) and occasional
retry after getting a visibility result.

Maybe I'm missing something. I'll take a deeper look at the snapshot
handling code tonight to see if anything else might have any
implications.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Tom Lane
Ants Aasma ants.aa...@eesti.ee writes:
 On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote:
 1. A backend can have lots of snapshots, potentially requiring an
 unbounded amount of shared memory.  We can't accommodate that.

 If PostgreSQL gets POSIX shared memory capability at some point in the
 future, would it be enough to accommodate snapshots in shared memory?

Don't hold your breath.  Even if we moved over to using mmap() in place
of shmat(), it's a *very* long way from there to being able to
dynamically expand shared memory and have all backends able to access
the additional space.  Doing that in a way that's cheap enough and
transparent enough that we'd actually accept it is even more challenging
--- for instance, it seems hard to impossible to guarantee that all
backends would be able to map the added space at the same addresses.

Most of the interest in mmap is not about any dynamic-expansion ideas,
it's just about getting out from under the ridiculously small limits on
SysV shmem space that are still default on many platforms.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Robert Haas
On Thu, Sep 8, 2011 at 11:33 AM, Ants Aasma ants.aa...@eesti.ee wrote:
 On Thu, Sep 8, 2011 at 5:28 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 8, 2011 at 9:26 AM, Ants Aasma ants.aa...@eesti.ee wrote:
 When go try to find the new csnmin
 and discover that a backend has a csnmin that is too old, we go through
 the snapshots of that backend and convert every snapshot under the
 desired csnmin to a traditional snapshot.

 I thought about something along these lines (though I didn't flesh out
 the details as much as you have here), but rejected it because the
 step described above would require all snapshots to be stored in
 shared memory.  That's problematic for several reasons:

 1. A backend can have lots of snapshots, potentially requiring an
 unbounded amount of shared memory.  We can't accommodate that.

 If PostgreSQL gets POSIX shared memory capability at some point in the
 future, would it be enough to accommodate snapshots in shared memory?

By itself, that particular change would not help with this problem.

 2. You'd need to protect all of those snapshots with spinlocks or
 something, which would be wicked expensive, because the owning process
 would need to take and release that spinlock every time it touched the
 snapshot.

 It seems to me that converting a transactions type can be done
 lock-free. The process that does the converting needs to ensure that
 the new transaction type flag is visible before releasing any xids.
 For visibility checking, the additional cost is a read barrier, two
 volatile reads (recheck snapshot type and dense horizon) and occasional
 retry after getting a visibility result.

 Maybe I'm missing something. I'll take a deeper look at the snapshot
 handling code tonight to see if anything else might have any
 implications.

I'm not convinced it's anywhere near that easy.  For one thing, on at
least one big server I'm playing with, memory latency on shared memory
is vastly higher (like 10x!) than on backend-local memory due to NUMA
effects.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] force_not_null option support for file_fdw

2011-09-08 Thread Kohei Kaigai
Hi Hanada-san.

 ISTM that your results are reasonable for each collation setting.
 Former ordering is same as C locale, and in latter case alphabetical order 
 has priority over case
 distinctions.  Do you mean that ordering used in file_fdw is affected from 
 something unexpected, without
 collation or locale setting?
 
 BTW, I found a thread which is related to this issue.
   http://archives.postgresql.org/pgsql-hackers/2011-09/msg00130.php
 
 I changed the test data so that it uses only upper case alphabets, because 
 case distinction is not
 important for that test.  I also removed digits to avoid test failure in some 
 locales which sort
 alphabets before digits.

OK, Thanks for this clarification. This change of regression test case seems to 
me reasonable
to avoid unnecessary false-positive.

I found one other point to be fixed:
On get_force_not_null(), it makes a list of attribute names with force_not_null 
option.

+   foreach (cell, options)
+   {
+   DefElem*def = (DefElem *) lfirst(cell);
+   const char *value = defGetString(def);
+
+   if (strcmp(def-defname, force_not_null) == 0 
+   strcmp(value, true) == 0)
+   {
+   columns = lappend(columns, makeString(NameStr(attr-attname)));
+   elog(DEBUG1, %s: force_not_null, NameStr(attr-attname));
+   }
+
+   }

makeString() does not copy the supplied string itself, so it is not preferable 
to reference
NameStr(attr-attname) across ReleaseSysCache().
I'd like to suggest to supply a copied attname using pstrdup for makeString

Thanks,
--
NEC Europe Ltd, SAP Global Competence Center
KaiGai Kohei kohei.kai...@emea.nec.com


 -Original Message-
 From: Shigeru Hanada [mailto:shigeru.han...@gmail.com]
 Sent: 8. September 2011 06:19
 To: Kohei Kaigai
 Cc: Kohei KaiGai; PostgreSQL-development
 Subject: Re: [HACKERS] force_not_null option support for file_fdw
 
 (2011/09/05 22:05), Kohei Kaigai wrote:
  In my usual environment that test passed, but finally I've reproduced
  the failure with setting $LC_COLLATE to es_ES.UTF-8.  Do you have set 
  any $LC_COLLATE in your
 test environment?
 
  It is not set in my environment.
 
  I checked the behavior of ORDER BY when we set a collation on the regular 
  relation, not a foreign
 table.
  Do we hit something other unexpected bug in collation here?
 
  postgres=# CREATE TABLE t1 (word1 text); CREATE TABLE postgres=#
  INSERT INTO t1 VALUES ('ABC'),('abc'),('123'),('NULL'); INSERT 0 4
  postgres=# ALTER TABLE t1 ALTER word1 TYPE text COLLATE ja_JP.utf8;
  ALTER TABLE postgres=# SELECT * FROM t1 ORDER BY word1;
word1
  ---
123
ABC
NULL
abc
  (4 rows)
 
  postgres=# ALTER TABLE t1 ALTER word1 TYPE text COLLATE en_US.utf8;
  ALTER TABLE postgres=# SELECT * FROM t1 ORDER BY word1;
word1
  ---
123
abc
ABC
NULL
  (4 rows)
 
 Thanks for the checking.  FYI, I mainly use Fedora 15 box with Japanese 
 environment for my development.
 
 ISTM that your results are reasonable for each collation setting.
 Former ordering is same as C locale, and in latter case alphabetical order 
 has priority over case
 distinctions.  Do you mean that ordering used in file_fdw is affected from 
 something unexpected, without
 collation or locale setting?
 
 BTW, I found a thread which is related to this issue.
   http://archives.postgresql.org/pgsql-hackers/2011-09/msg00130.php
 
 I changed the test data so that it uses only upper case alphabets, because 
 case distinction is not
 important for that test.  I also removed digits to avoid test failure in some 
 locales which sort
 alphabets before digits.
 
 Regards,
 --
 Shigeru Hanada
 
 
  Click
 https://www.mailcontrol.com/sr/fB6Wmr8zmMzTndxI!oX7Uo9cpkuWnNqkEgc!P6cHvBhGb4EkL1te5Ky38yYzoE4Mra
 3ljAyIpUlPbv5+FCDqIw==  to report this email as spam.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fast GiST index build - further improvements

2011-09-08 Thread Heikki Linnakangas
Now that the main GiST index build patch has been committed, there's a 
few further improvements that could make it much faster still:


Better management of the buffer pages on disk. At the moment, the 
temporary file is used as a heap of pages belonging to all the buffers 
in random order. I think we could speed up the algorithm considerably by 
reading/writing the buffer pages sequentially. For example, when an 
internal page is split, and all the tuples in its buffer are relocated, 
that would be a great chance to write the new pages of the new buffers 
in sequential order, instead of writing them back to the pages freed up 
by the original buffer, which can be spread all around the temp file. I 
wonder if we could use a separate file for each buffer? Or at least, a 
separate file for all buffers that are larger than, say 100 MB in size.


Better management of in-memory buffer pages. When we start emptying a 
buffer, we currently flush all the buffer pages in memory to the 
temporary file, to make room for new buffer pages. But that's a waste of 
time, if some of the pages we had in memory belonged to the buffer we're 
about to empty next, or that we empty tuples to. Also, if while emptying 
a buffer, all the tuples go to just one or two lower level buffers, it 
would be beneficial to keep more than one page in-memory for those buffers.


Buffering leaf pages. This I posted on a separate thread already: 
http://archives.postgresql.org/message-id/4e5350db.3060...@enterprisedb.com



Also, at the moment there is one issue with the algorithm that we have 
glossed over this far: For each buffer, we keep some information in 
memory, in the hash table, and in the auxiliary lists. That means that 
the amount of memory needed for the build scales with the size of the 
index. If you're dealing with very large indexes, hopefully you also 
have a lot of RAM in your system, so I don't think this is a problem in 
practice. Still, it would be nice to do something about that. A 
straightforward idea would be to swap some of the information to disk. 
Another idea that, simpler to implement, would be to completely destroy 
a buffer, freeing all the memory it uses, when it becomes completely 
empty. Then, if you're about to run out of memory (as defined by 
maintenance_work_mem), you can empty some low level buffers to disk to 
free up some.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Simon Riggs
On Thu, Sep 8, 2011 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Please lets not waste effort on refactoring efforts in mid dev cycle.

 Say what?  When else would you have us do it?

 When else would you have us develop?

 In my eyes that sort of activity *is* development.  I find the
 distinction you are drawing entirely artificial, and more calculated to
 make sure refactoring never happens than to add any safety.  Any
 significant development change carries a risk of breakage.

You clearly have the bit between your teeth on this.

That doesn't make it worthwhile or sensible though.

I've offered to do it slowly and carefully over time, but that seems
not enough for some reason. What is the real reason for this?

I assume whoever does it will be spending significant time on testing
and bug fixing afterwards. I foresee lots of while I'm there, I
thought I'd just mend X, so we'll spend lots of time fighting to keep
functionality that's already there. Look at the discussion around
archive_command for an example of that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Bruce Momjian
Simon Riggs wrote:
 On Thu, Sep 8, 2011 at 3:25 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  On Wed, Sep 7, 2011 at 9:02 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Simon Riggs si...@2ndquadrant.com writes:
  Please lets not waste effort on refactoring efforts in mid dev cycle.
 
  Say what? ?When else would you have us do it?
 
  When else would you have us develop?
 
  In my eyes that sort of activity *is* development. ?I find the
  distinction you are drawing entirely artificial, and more calculated to
  make sure refactoring never happens than to add any safety. ?Any
  significant development change carries a risk of breakage.
 
 You clearly have the bit between your teeth on this.

So I guess Tom, Robert Haas, and I all have bits then.

 That doesn't make it worthwhile or sensible though.

We are not saying do it now.  We are disputing your suggestion that now
is not a good time --- there is a difference.

 I've offered to do it slowly and carefully over time, but that seems
 not enough for some reason. What is the real reason for this?

Oh, if we told you, then, well, you would know, and the big secret would
be out.  (Hint:  when three people are telling you the same thing, odds
are there is not some secret motive.)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Oleg Bartunov

My congratulations too, Alexander ! Hope to work on SP-GiST together !

Oleg

On Thu, 8 Sep 2011, Heikki Linnakangas wrote:


On 06.09.2011 01:18, Alexander Korotkov wrote:

Small bugfix: in gistBufferingFindCorrectParent check that downlinkoffnum
doesn't exceed maximal offset number.


I've committed the patch now, including that fix. Thanks for a great GSoC 
project!





Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Protecting against multiple instances per cluster

2011-09-08 Thread Thom Brown
Hi all,

I've come across a PostgreSQL set up where there are 2 servers, each
with the same version of PostgreSQL on, both mounting the same SAN
onto their respective file systems.  It was intended that only 1 of
the servers would be running an instance of PostgreSQL at a time as
they both point to the same pgdata.  This was dubbed a high
availability set up, where if one server went down, they could start
PostgreSQL on the other.  (yes, I know what you're thinking)  Now
normally there is protection against 2 instances running only if the
instances on the same server as it would reference shared memory.  But
in this case, neither server has access to the other's shared memory,
so it has to rely on the pid file.  But the pid file isn't enough by
itself.  In this set up, if someone were to inadvertently start up a
Postgres instance on the 2nd server whilst the 1st was still running,
it would do very bad things.

For example, when I set up the same scenario on my own network, it
indeed let me start up the 2nd instance.  I then tried setting up a
table and generating lots of data for it, then... KABOOM:

postgres=# create table things (id serial, things int);
NOTICE:  CREATE TABLE will create implicit sequence things_id_seq
for serial column things.id
CREATE TABLE
postgres=# insert into things (things) select x from
generate_series(1,50) a(x);
LOG:  could not link file pg_xlog/xlogtemp.28426 to
pg_xlog/00010002 (initialization of log file 0,
segment 2): Operation not supported
STATEMENT:  insert into things (things) select x from
generate_series(1,50) a(x);
PANIC:  could not open file pg_xlog/00010002 (log
file 0, segment 2): No such file or directory
STATEMENT:  insert into things (things) select x from
generate_series(1,50) a(x);
PANIC:  could not open file pg_xlog/00010002 (log
file 0, segment 2): No such file or directory
PANIC:  could not open file pg_xlog/00010002 (log
file 0, segment 2): No such file or directory
The connection to the server was lost. Attempting reset: LOG:  server
process (PID 28426) was terminated by signal 6: Abort trap
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
Failed.
! LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2011-09-08 19:04:47 BST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  consistent recovery state reached at 0/1755268
LOG:  redo starts at 0/1755268
LOG:  could not open file pg_xlog/00010002 (log file
0, segment 2): No such file or directory
LOG:  redo done at 0/1A8
LOG:  last completed transaction was at log time 2011-09-08 19:05:14.098496+01
LOG:  could not link file pg_xlog/xlogtemp.28429 to
pg_xlog/00010002 (initialization of log file 0,
segment 2): Operation not supported
PANIC:  could not open file pg_xlog/00010002 (log
file 0, segment 2): No such file or directory
LOG:  startup process (PID 28429) was terminated by signal 6: Abort trap
LOG:  aborting startup due to startup process failure

Now obviously no-one should ever set up their system in such a
fashion, but some have, as I've witnessed it.  I suspect this is
potentially the cause of their continued database corruption and
outages, where expected WAL files aren't in the pg_xlog directory, so
recovery can't finish, and clog files going missing etc.

While I appreciate that this isn't a bug, and that no-one should
actually be setting things up in this way, it does introduce the
ability to hose one's own cluster without realising (until it's
possibly too late).

Would there be a way to prevent this abhorrent scenario from coming
into existence?  One idea is to have a configuration option to be
strict about the presence of a pid file in the data directory, and
force manual intervention, but I'm not sure this would solve the
problem in most cases where this problem exists as someone would have
had to specifically sought out the option and set it.  It might also
encourage some to just delete the pid file thinking that would make
the nasty errors go away.

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: 

Re: [HACKERS] Protecting against multiple instances per cluster

2011-09-08 Thread Tom Lane
Thom Brown t...@linux.com writes:
 I've come across a PostgreSQL set up where there are 2 servers, each
 with the same version of PostgreSQL on, both mounting the same SAN
 onto their respective file systems.  It was intended that only 1 of
 the servers would be running an instance of PostgreSQL at a time as
 they both point to the same pgdata.  This was dubbed a high
 availability set up, where if one server went down, they could start
 PostgreSQL on the other.  (yes, I know what you're thinking)

Multiply by about ten and you might have an idea what I'm thinking.

 Now
 normally there is protection against 2 instances running only if the
 instances on the same server as it would reference shared memory.  But
 in this case, neither server has access to the other's shared memory,
 so it has to rely on the pid file.  But the pid file isn't enough by
 itself.

The pid file is not merely not enough, it's entirely useless, since
the two machines aren't sharing a process ID space.  If somebody starts
a postmaster on machine 2, it will almost certainly see the pid in the
pidfile as not running (on machine 2).  So you have no safety interlock
whatsoever in this configuration.

It is possible to build configurations of this type safely, but you need
some external dead-man-switch or STONITH arrangement that forcibly kills
machine 1 (or at least disconnects it from the SAN) before machine 2 is
allowed to start.  Postgres can't do it on its own, and as you've
undoubtedly already found out, human DBAs can't be trusted to get it
right either.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 You clearly have the bit between your teeth on this.

Personally, I'm neither intending to break up xlog.c right now, nor
asking you to do it.  I'm just objecting to your claim that there
should be some project-policy restriction on when refactoring gets done.
I do have other refactoring-ish plans in mind, eg
http://archives.postgresql.org/message-id/9232.1315441...@sss.pgh.pa.us
and am not going to be happy if you claim I can't do that now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_dump.c

2011-09-08 Thread Andrew Dunstan


In the refactoring Large C files discussion one of the biggest files 
Bruce mentioned is pg_dump.c. There has been discussion in the past of 
turning lots of the knowledge currently embedded in this file into a 
library, which would make it available to other clients (e.g. psql). I'm 
not sure what a reasonable API for that would look like, though. Does 
anyone have any ideas?


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Fast GiST index build

2011-09-08 Thread Alexander Korotkov
Thanks for congratulations!
Tnanks to Heikki for mentoring and his work on patch!

--
With best regards,
Alexander Korotkov.


[HACKERS] Re: [COMMITTERS] pgsql: Add missing format argument to ecpg_log() call

2011-09-08 Thread Peter Eisentraut
On tor, 2011-09-08 at 15:32 -0400, Tom Lane wrote:
 Peter Eisentraut pete...@gmx.net writes:
  Add missing format argument to ecpg_log() call
 
 Oh, fun.  So why isn't there an __attribute__((format...)) decoration
 on ecpg_log()?

I have a larger patch for that.  I just wanted to get the one piece of
obvious breakage backpatched.

I'm also proposing to add -Wmissing-format-attribute to the standard
options set.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory-related bugs

2011-09-08 Thread Daniel Farina
On Tue, Sep 6, 2011 at 12:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 [ Sorry for letting this slip through the cracks ... I think I got
  distracted by collation bugs :-( ]

 Noah Misch n...@leadboat.com writes:
 On Sat, Mar 12, 2011 at 12:44:29PM -0500, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
 A suitably-instrumented run of make installcheck-world under valgrind 
 turned
 up a handful of memory-related bugs:

 Hmm, interesting work, but I don't think I believe in the necessity for
 this kluge:

 +     else if (attributeName != (att-attname))
 +             namestrcpy((att-attname), attributeName);

 I'm still of the opinion that there's no real need to avoid memcpy with
 identical source and destination, so I didn't apply this first patch.

I am leery of memcpy with overlapping regions.  I know that it can
cause an infinite loop on ssse3 architectures, having to do with some
backwards-iteration it does:

https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/609290

I have spotted this in the wild in PostgreSQL (which is how I happened
to produce this bug report link so readily), yet very rarely; I mailed
a more detailed report to the security list.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Peter Geoghegan
On 8 September 2011 15:43, Robert Haas robertmh...@gmail.com wrote:
 I wouldn't be too enthusiastic about
 starting a project like this in January, but now seems fine.  A bigger
 problem is that I don't hear anyone volunteering to do the work.

You seem to have a fairly strong opinion on the xlog.c code. It would
be useful to hear any preliminary thoughts that you might have on what
we'd end up with when this refactoring work is finished. If I'm not
mistaken, you think that it is a good candidate for being refactored
not so much because of its size, but for other reasons -  could you
please elaborate on those? In particular, I'd like to know what
boundaries it is envisaged that the code should be refactored along to
increase its conceptual integrity, or to better separate concerns. I
assume that that's the idea, since each new .c file would have to have
a discrete purpose.

On 7 September 2011 19:12, Tom Lane t...@sss.pgh.pa.us wrote:
 The pgrminclude-induced bug you just fixed shows a concrete way in which
 moving code from one file to another might silently break it, ie, it
 still compiles despite lack of definition of some symbol it's intended
 to see.

Of course, the big unknown here is the C preprocessor. There is
nothing in principle that stops a header file from slightly or utterly
altering the semantics of any file it is included in. That said, it
wouldn't surprise me if the nm-diff shell script I proposed (or a
slight variant intended to catch pgrminclude type bugs) caught many of
those problems in practice.

Attached is simple POC nm-diff.sh, intended to detect
pgrminclude-induced bugs (split c file variant may follow if there is
interest). I tested this on object files compiled before and after the
bug fix to catalog.h in this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f81fb4f690355bc88fee69624103956fb4576fe5

or rather, I tested things after that commit with and without the
supposedly unneeded header; I removed catalog version differences,
applying Ockham's razor, so there was exactly one difference. That
didn't change anything; I just saw the same thing as before, which
was:

[peter@localhost commands]$ nm-diff.sh tablespace.o tablespace.old.o
Symbol table differences:
50c50
 05e7 r __func__.15690
---
 05ef r __func__.15690
WARNING: Symbol tables don't match!


I decided to add a bunch of obviously superfluous headers (a bunch of
xlog stuff) to the same file to verify that they *didn't* change
anything, figuring that it was very unlikely that adding these headers
could *really* change something. However, they did, but I don't think
that that proves that the script is fundamentally flawed (they didn't
*really* change anything):

[peter@localhost commands]$ nm-diff.sh tablespace.o tablespace.old.o
Symbol table differences:
41,50c41,50
 06f0 r __func__.15719 -- symbol name differs, offset does not
 06d0 r __func__.15730
 06be r __func__.15750
 06a0 r __func__.15759
 0680 r __func__.15771
 0660 r __func__.15793
 0640 r __func__.15803
 0620 r __func__.15825
 0600 r __func__.15886
 05e7 r __func__.15903  -- value/local offset the same as before
---
 06f0 r __func__.15506
 06d0 r __func__.15517
 06be r __func__.15537
 06a0 r __func__.15546
 0680 r __func__.15558
 0660 r __func__.15580
 0640 r __func__.15590
 0620 r __func__.15612
 0600 r __func__.15673
 05ef r __func__.15690  -- Also, value/local offset the same as 
 before (same inconsistency too)
WARNING: Symbol tables don't match!

My analysis is that the fact that the arbitrary symbol names assigned
within this read-only data section differ likely has no significance
(it looks like the compiler assigns them at an early optimisation
stage like Postgres assigns sequence values, before some are
eliminated at a later stage - I read ELF docs, but couldn't confirm
this, but maybe should have read GCC docs), so the next revision of
this script should simply ignore them using a regex if they look like
this; only the internal offsets/values matter, and they have been
observed to differ based on whether or not the header is included per
Bruce's revision (importantly, the difference in offset/values of the
last __func__ remains just the same regardless of whether the
superfluous xlog headers are included).

Does that seem reasonable? Is there interest in developing this idea further?

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


nm-diff.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] memory-related bugs

2011-09-08 Thread Tom Lane
Daniel Farina dan...@heroku.com writes:
 On Tue, Sep 6, 2011 at 12:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'm still of the opinion that there's no real need to avoid memcpy with
 identical source and destination, so I didn't apply this first patch.

 I am leery of memcpy with overlapping regions.  I know that it can
 cause an infinite loop on ssse3 architectures, having to do with some
 backwards-iteration it does:
 https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/609290

The linked page offers no support for your claim of an infinite loop,
and in any case it's discussing a case involving *overlapping* regions,
not *identical* regions.

The reason why this concern is irrelevant for identical regions is that
no matter what order the memcpy routine copies the bytes in, it's
necessarily storing the exact same data into each byte that was there
before.  The only way that strange results could be produced is if the
routine transiently set some byte to a value other than its final value,
which would mean that it must be intending to store to that location
more than once, which would be silly and inefficient.

So I'm not going to believe that there's a problem here without a lot
more rigorous evidence than anyone has offered.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Protecting against multiple instances per cluster

2011-09-08 Thread Magnus Hagander
On Thu, Sep 8, 2011 at 20:40, Thom Brown t...@linux.com wrote:
 Hi all,

 I've come across a PostgreSQL set up where there are 2 servers, each
 with the same version of PostgreSQL on, both mounting the same SAN
 onto their respective file systems.  It was intended that only 1 of
 the servers would be running an instance of PostgreSQL at a time as
 they both point to the same pgdata.  This was dubbed a high

snip

 Would there be a way to prevent this abhorrent scenario from coming
 into existence?  One idea is to have a configuration option to be
 strict about the presence of a pid file in the data directory, and
 force manual intervention, but I'm not sure this would solve the
 problem in most cases where this problem exists as someone would have
 had to specifically sought out the option and set it.  It might also
 encourage some to just delete the pid file thinking that would make
 the nasty errors go away.

There are plenty of clustering products out there that are really
designed for one thing pimarily, and that's dealing with this kind of
fencing. The proper solution is to use one of those. There's no way we
can do what's required from inside postgresql, and I see no reason why
we should re-invent generic clustering software. (for example, if you
do this, we can't prevent the two kernels from corrupting the
filesystem on the shared storage, which we rely on working..)

Such software is often marketed as an easy way to set up high
availability. It's easy to set up. It requires some actual expertise
to set up *right*. But once you've done that, it works well, and it
prevents this kind of situation to happen.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] concurrent snapshots

2011-09-08 Thread Ants Aasma
On Thu, Sep 8, 2011 at 6:46 PM, Robert Haas robertmh...@gmail.com wrote:
 I'm not convinced it's anywhere near that easy.  For one thing, on at
 least one big server I'm playing with, memory latency on shared memory
 is vastly higher (like 10x!) than on backend-local memory due to NUMA
 effects.

I wonder if both the shared mem and non-local memory issue can be
circumvented by using a slru like mechanism as a side channel to
publish taken snapshots and make concurrent xids available with a
sinval/hasmessages like per proc flag in shared memory to notify of
migrated snapshots.

I'll have to think through the space, locking and performance
considerations. That might take a small while though, I just managed
to contract the flu and can't really think straight.

Sorry to waste your time if this whole approach is completely untenable.
It seemed like a interesting topic to sink my teeth in, but in hindsight
seems a bit too much for a first try.

--
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Josh Berkus
Simon, Robert, Bruce, Tom,

  Say what?  When else would you have us do it?
 
  When else would you have us develop?
 
  In my eyes that sort of activity *is* development.  I find the
  distinction you are drawing entirely artificial, and more calculated to
  make sure refactoring never happens than to add any safety.  Any
  significant development change carries a risk of breakage.
 You clearly have the bit between your teeth on this.
 
 That doesn't make it worthwhile or sensible though.

The above really seems like a non-argument over trivial wording of
stuff.  Can we please give each other the benefit of the doubt and not
read objectionable content into offhand comments?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Chris Redekop
Thanks for all the feedback guys.  Just to throw another monkey wrench in
here - I've been playing with Simon's proposed solution of returning 0 when
the WAL positions match, and I've come to the realizatiion that even if
using pg_last_xact_insert_timestamp, although it would help, we still
wouldn't be able to get a 100% accurate how far behind? counternot
that this is a big deal, but I know my ops team is going to bitch to me
about it :).take this situation: there's a lull of 30 seconds where
there are no transactions committed on the serverthe slave is totally
caught up, WAL positions match, I'm reporting 0, everything is happy.  Then
a transaction is committed on the masterbefore the slave gets it my
query hits it and sees that we're 30 seconds behind (when in reality we're
1sec behind).Because of this affect my graph is a little spikey...I
mean it's not a huge deal or anything - I can put some sanity checking in my
number reporting (if 1 second ago you were 0 seconds behind, you can't be
more than 1 second behind now sorta thing).  But if we wanted to go for
super-ideal solution there would be a way to get the timestamp of
pg_stat_replication.replay_location+1 (the first transaction that the slave
does not have).


On Thu, Sep 8, 2011 at 7:03 AM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
  OTOH, new function enables users to monitor the delay as a timestamp.
  For users, a timestamp is obviously easier to handle than LSN, and the
 delay
  as a timestamp is more intuitive. So, I think that it's worth adding
  something like pg_last_xact_insert_timestamp into core for improvement
  of user-friendness.

 It seems very nice from a usability point of view, but I have to agree
 with Simon's concern about performance.  Actually, as of today,
 WALInsertLock is such a gigantic bottleneck that I suspect the
 overhead of this additional bookkeeping would be completely
 unnoticeable.  But I'm still reluctant to add more centralized
 spinlocks that everyone has to fight over, having recently put a lot
 of effort into getting rid of some of the ones we've traditionally
 had.

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company



[HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-08 Thread George Barnett
Hi Hackers,


postgresql-writeall.patch
Description: Binary data

I run a number of postgresql installations on NFS and on the whole I find this 
to be very reliable.  I have however run into a few issues when there is 
concurrent writes from multiple processes.

I see errors such as the following:

2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection authorized: 
user=postgres database=postgres
2011-07-31 22:13:35 ESTERROR:  could not write block 1 of relation 
global/2671: wrote only 4096 of 8192 bytes
2011-07-31 22:13:35 ESTHINT:  Check free disk space.
2011-07-31 22:13:35 ESTCONTEXT:  writing block 1 of relation global/2671
2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
host=[local]

I have also seen similar errors coming out of the WAL writer, however they 
occur at the level PANIC, which is a little more drastic.

After spending some time with debug logging turned on and even more time 
staring at strace, I believe this occurs when one process was writing to a data 
file and it received a SIGINT from another process, eg:
(These logs are from another similar run)

[pid  1804] ... fsync resumed )   = 0
[pid 10198] kill(1804, SIGINT unfinished ...
[pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
[pid  1804] write(3, 
c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0..., 32768 
unfinished ...
[pid 10198] ... kill resumed )= 0
[pid  1804] ... write resumed )   = 4096
[pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
[pid  1804] rt_sigreturn(0x2)   = 4096
[pid  1804] write(2, \0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999..., 260) = 260
[pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  unfinished ...
[pid  1802] ... select resumed )  = 1 (in [5], left {0, 999000})
[pid  1804] ... rt_sigprocmask resumed NULL, 8) = 0
[pid  1804] tgkill(1804, 1804, SIGABRT) = 0
[pid  1802] read(5,  unfinished ...
[pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
Process 1804 detached

After finding this, I came up with the following test case which easily 
replicated our issue:

#!/bin/bash

name=$1
number=1
while true; do 
  /usr/bin/psql -c CREATE USER \$name$number\ WITH NOSUPERUSER INHERIT 
NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';
  /usr/bin/createdb -E UNICODE -O $name$number $name$number
  if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
exit
  fi
  let number=$number+1
done

When I run a single copy of this script, I have no issues, however when I start 
up a few more copies to simultaneously hit the DB, it crashes quiet quickly - 
usually within 20 or 30 seconds.

After looking through the code I found that when postgres calls write() it 
doesn't retry.  In order to address the issue with the PANIC in the WAL writer 
I set the sync method to o_sync which solved the issue in that part of the 
code, however I was still seeing failures in other areas of the code (such as 
the FileWrite function).  Following this, I spoke to an NFS guru who pointed 
out that writes under linux are not guaranteed to complete unless you open up 
O_SYNC or similar on the file handle.  I had a look in the libc docs and found 
this:

http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html


The write function writes up to size bytes from buffer to the file with 
descriptor filedes. The data in buffer is not necessarily a character string 
and a null character is output like any other character.

The return value is the number of bytes actually written. This may be size, but 
can always be smaller. Your program should always call write in a loop, 
iterating until all the data is written.


After finding this, I checked a number of other pieces of software that we see 
no issues with on NFS (such as the JVM) for their usage of write().  I 
confirmed they write in a while loop and set about patching the postgres source.

I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
see on our systems.  I have also checked that make check still passes. 

As my C is terrible, I would welcome any comments on the implementation of this 
patch.

Best regards,

George





 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to improve reliability of postgresql on linux nfs

2011-09-08 Thread Josh Berkus
George,

I'm quoting you here because in the version of your email which got
posted to the list your whole explanation got put below the patch text,
making it hard to find the justification for the patch.  Follows:

 I run a number of postgresql installations on NFS and on the whole I find 
 this to be very reliable.  I have however run into a few issues when there is 
 concurrent writes from multiple processes.
 
 I see errors such as the following:
 
 2011-07-31 22:13:35 EST postgres postgres [local] LOG:  connection 
 authorized: user=postgres database=postgres
 2011-07-31 22:13:35 ESTERROR:  could not write block 1 of relation 
 global/2671: wrote only 4096 of 8192 bytes
 2011-07-31 22:13:35 ESTHINT:  Check free disk space.
 2011-07-31 22:13:35 ESTCONTEXT:  writing block 1 of relation global/2671
 2011-07-31 22:13:35 EST [unknown] [unknown]  LOG:  connection received: 
 host=[local]
 
 I have also seen similar errors coming out of the WAL writer, however they 
 occur at the level PANIC, which is a little more drastic.
 
 After spending some time with debug logging turned on and even more time 
 staring at strace, I believe this occurs when one process was writing to a 
 data file and it received a SIGINT from another process, eg:
 (These logs are from another similar run)
 
 [pid  1804] ... fsync resumed )   = 0
 [pid 10198] kill(1804, SIGINT unfinished ...
 [pid  1804] lseek(3, 4915200, SEEK_SET) = 4915200
 [pid  1804] write(3, 
 c\320\1\0\1\0\0\0\0\0\0\0\0\0K\2\6\1\0\0\0\0\373B\0\0\0\0\2\0m\0..., 32768 
 unfinished ...
 [pid 10198] ... kill resumed )= 0
 [pid  1804] ... write resumed )   = 4096
 [pid  1804] --- SIGINT (Interrupt) @ 0 (0) ---
 [pid  1804] rt_sigreturn(0x2)   = 4096
 [pid  1804] write(2, \0\0\373\0\f\7\0\0t2011-08-30 20:29:52.999..., 260) = 
 260
 [pid  1804] rt_sigprocmask(SIG_UNBLOCK, [ABRT],  unfinished ...
 [pid  1802] ... select resumed )  = 1 (in [5], left {0, 999000})
 [pid  1804] ... rt_sigprocmask resumed NULL, 8) = 0
 [pid  1804] tgkill(1804, 1804, SIGABRT) = 0
 [pid  1802] read(5,  unfinished ...
 [pid  1804] --- SIGABRT (Aborted) @ 0 (0) ---
 Process 1804 detached
 
 After finding this, I came up with the following test case which easily 
 replicated our issue:
 
 #!/bin/bash
 
 name=$1
 number=1
 while true; do 
   /usr/bin/psql -c CREATE USER \$name$number\ WITH NOSUPERUSER INHERIT 
 NOCREATEROLE NOCREATEDB LOGIN PASSWORD 'pass';
   /usr/bin/createdb -E UNICODE -O $name$number $name$number
   if `grep -q PANIC /data/postgresql/data/pg_log/*`; then
 exit
   fi
   let number=$number+1
 done
 
 When I run a single copy of this script, I have no issues, however when I 
 start up a few more copies to simultaneously hit the DB, it crashes quiet 
 quickly - usually within 20 or 30 seconds.
 
 After looking through the code I found that when postgres calls write() it 
 doesn't retry.  In order to address the issue with the PANIC in the WAL 
 writer I set the sync method to o_sync which solved the issue in that part of 
 the code, however I was still seeing failures in other areas of the code 
 (such as the FileWrite function).  Following this, I spoke to an NFS guru who 
 pointed out that writes under linux are not guaranteed to complete unless you 
 open up O_SYNC or similar on the file handle.  I had a look in the libc docs 
 and found this:
 
 http://www.gnu.org/s/libc/manual/html_node/I_002fO-Primitives.html
 
 
 The write function writes up to size bytes from buffer to the file with 
 descriptor filedes. The data in buffer is not necessarily a character string 
 and a null character is output like any other character.
 
 The return value is the number of bytes actually written. This may be size, 
 but can always be smaller. Your program should always call write in a loop, 
 iterating until all the data is written.
 
 
 After finding this, I checked a number of other pieces of software that we 
 see no issues with on NFS (such as the JVM) for their usage of write().  I 
 confirmed they write in a while loop and set about patching the postgres 
 source.
 
 I have made this patch against 8.4.8 and confirmed that it fixes the issue we 
 see on our systems.  I have also checked that make check still passes. 
 
 As my C is terrible, I would welcome any comments on the implementation of 
 this patch.
 
 Best regards,
 
 George


-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Robert Haas
On Thu, Sep 8, 2011 at 4:45 PM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 On 8 September 2011 15:43, Robert Haas robertmh...@gmail.com wrote:
 I wouldn't be too enthusiastic about
 starting a project like this in January, but now seems fine.  A bigger
 problem is that I don't hear anyone volunteering to do the work.

 You seem to have a fairly strong opinion on the xlog.c code. It would
 be useful to hear any preliminary thoughts that you might have on what
 we'd end up with when this refactoring work is finished. If I'm not
 mistaken, you think that it is a good candidate for being refactored
 not so much because of its size, but for other reasons -  could you
 please elaborate on those? In particular, I'd like to know what
 boundaries it is envisaged that the code should be refactored along to
 increase its conceptual integrity, or to better separate concerns. I
 assume that that's the idea, since each new .c file would have to have
 a discrete purpose.

I'm not sure how strong my opinions are, but one thing that I think
could be improved is that StartupXLOG() is really, really long, and it
does a lot of different stuff:

- Read the control file and sanity check it.
- Read the backup label if there is one or otherwise inspect the
control file's checkpoint information.
- Set up for REDO (including Hot Standby) if required.
- Main REDO loop.
- Check whether we reached consistency and/or whether we need a new TLI.
- Prepare to write WAL.
- Post-recovery cleanup.
- Initialize for normal running.

It seems to me that we could improve the readability of that function
by separating out some of the larger chunks of functionality into
their own static functions.  That wouldn't make xlog.c any shorter,
but it would make that function easier to understand.

Another gripe I have is that recoveryStopsHere() has non-obvious side
effects.  Not sure what to do about that, exactly, but I found it
extremely surprising when first picking my way through this code.

One pretty easy thing to pull out of this file wold be all the
user-callable functions.  pg_xlog_recovery_replay_{pause,resume},
pg_is_xlog_recovery_paused, pg_last_xact_replay_timestamp,
pg_is_in_recovery, pg_{start,stop}_backup(), pg_switch_xlog(),
pg_create_restore_point(), pg_current_xlog_{insert_,}location,
pg_last_xlog_{receive,replay}_location(), pg_xlogfile_name_offset(),
pg_xlogfile_name().

Another chunk that seems like it would be pretty simple to separate
out is the checkpoint-related stuff: LogCheckpointStart(),
LogCheckpointEnd(), CreateCheckPoint(), CheckPointGuts(),
RecoveryRestartPoint(), CreateRestartPoint(), KeepLogSeg().  Not a lot
of code, maybe, but it seems clearly distinguishable from what the
rest of the file is about.

I'm sure there are other good ways to do it, too...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_insert_timestamp

2011-09-08 Thread Fujii Masao
On Thu, Sep 8, 2011 at 10:03 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Sep 8, 2011 at 6:14 AM, Fujii Masao masao.fu...@gmail.com wrote:
 OTOH, new function enables users to monitor the delay as a timestamp.
 For users, a timestamp is obviously easier to handle than LSN, and the delay
 as a timestamp is more intuitive. So, I think that it's worth adding
 something like pg_last_xact_insert_timestamp into core for improvement
 of user-friendness.

 It seems very nice from a usability point of view, but I have to agree
 with Simon's concern about performance.  Actually, as of today,
 WALInsertLock is such a gigantic bottleneck that I suspect the
 overhead of this additional bookkeeping would be completely
 unnoticeable.

The patch I've posted adds one acquisition and release of spinlock per
a commit or abort. But it's done outside of WALInsertLock. So I don't
think that the patch degrades a performance.

 But I'm still reluctant to add more centralized
 spinlocks that everyone has to fight over, having recently put a lot
 of effort into getting rid of some of the ones we've traditionally
 had.

You are against adding new acquisition and release of spinlock itself
even if it has nothing to do with WALInsertLock? The patch uses
XLogCtl-info_lck spinlock to save the last insert timestamp in the
shmem. XLogCtl-info_lck already protects many shmem variables
related to XLOG. So using XLogCtl-info_lck additionally might make
its contention heavier and degrade a performance. If the patch
defines new spinlock and uses it to save the timestamp to avoid
such a contention, you feel satisfied with the patch?

Another idea to avoid spinlock contention is save the timestamp in
PgBackendStatus (which contains information for pg_stat_activity).
This enables us to write and read the timestamp without spinlock.
Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] force_not_null option support for file_fdw

2011-09-08 Thread Shigeru Hanada
Thanks for the review, Kaigai-san.

(2011/09/09 0:47), Kohei Kaigai wrote:
 I found one other point to be fixed:
 On get_force_not_null(), it makes a list of attribute names with 
 force_not_null option.
 
 +   foreach (cell, options)
 +   {
 +   DefElem*def = (DefElem *) lfirst(cell);
 +   const char *value = defGetString(def);
 +
 +   if (strcmp(def-defname, force_not_null) == 0
 +   strcmp(value, true) == 0)
 +   {
 +   columns = lappend(columns, 
 makeString(NameStr(attr-attname)));
 +   elog(DEBUG1, %s: force_not_null, NameStr(attr-attname));
 +   }
 +
 +   }
 
 makeString() does not copy the supplied string itself, so it is not 
 preferable to reference
 NameStr(attr-attname) across ReleaseSysCache().
 I'd like to suggest to supply a copied attname using pstrdup for makeString

Oops, fixed.
[ I should check some of my projects for this issue... ]

Attached patch also includes some cosmetic changes such as removing
useless blank lines.

Regards,
-- 
Shigeru Hanada
diff --git a/contrib/file_fdw/data/text.csv b/contrib/file_fdw/data/text.csv
index ...09827e7 .
*** a/contrib/file_fdw/data/text.csv
--- b/contrib/file_fdw/data/text.csv
***
*** 0 
--- 1,4 
+ AAA,AAA
+ XYZ,XYZ
+ NULL,NULL
+ ABC,ABC
diff --git a/contrib/file_fdw/file_fdw.c b/contrib/file_fdw/file_fdw.c
index 224e74f..4174919 100644
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 23,30 
--- 23,32 
  #include foreign/fdwapi.h
  #include foreign/foreign.h
  #include miscadmin.h
+ #include nodes/makefuncs.h
  #include optimizer/cost.h
  #include utils/rel.h
+ #include utils/syscache.h
  
  PG_MODULE_MAGIC;
  
*** static struct FileFdwOption valid_option
*** 57,73 
{escape, ForeignTableRelationId},
{null, ForeignTableRelationId},
{encoding, ForeignTableRelationId},
  
/*
 * force_quote is not supported by file_fdw because it's for COPY TO.
 */
  
-   /*
-* force_not_null is not supported by file_fdw.  It would need a parser
-* for list of columns, not to mention a way to check the column list
-* against the table.
-*/
- 
/* Sentinel */
{NULL, InvalidOid}
  };
--- 59,70 
{escape, ForeignTableRelationId},
{null, ForeignTableRelationId},
{encoding, ForeignTableRelationId},
+   {force_not_null, AttributeRelationId},/* specified as boolean 
value */
  
/*
 * force_quote is not supported by file_fdw because it's for COPY TO.
 */
  
/* Sentinel */
{NULL, InvalidOid}
  };
*** static void fileGetOptions(Oid foreignta
*** 112,118 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! 
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
--- 109,115 
  static void estimate_costs(PlannerInfo *root, RelOptInfo *baserel,
   const char *filename,
   Cost *startup_cost, Cost *total_cost);
! static List * get_force_not_null(Oid relid);
  
  /*
   * Foreign-data wrapper handler function: return a struct with pointers
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 145,150 
--- 142,148 
List   *options_list = untransformRelOptions(PG_GETARG_DATUM(0));
Oid catalog = PG_GETARG_OID(1);
char   *filename = NULL;
+   char   *force_not_null = NULL;
List   *other_options = NIL;
ListCell   *cell;
  
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 198,204 
 buf.data)));
}
  
!   /* Separate out filename, since ProcessCopyOptions won't allow 
it */
if (strcmp(def-defname, filename) == 0)
{
if (filename)
--- 196,205 
 buf.data)));
}
  
!   /*
!* Separate out filename and force_not_null, since 
ProcessCopyOptions
!* won't allow them.
!*/
if (strcmp(def-defname, filename) == 0)
{
if (filename)
*** file_fdw_validator(PG_FUNCTION_ARGS)
*** 207,212 
--- 208,227 
 errmsg(conflicting or 
redundant options)));
filename = defGetString(def);
}
+   else if (strcmp(def-defname, force_not_null) == 0)
+   {
+   if (force_not_null)
+   ereport(ERROR,
+   

Re: [HACKERS] pg_dump.c

2011-09-08 Thread Pavel Golub
Hello, Andrew.

You wrote:


AD In the refactoring Large C files discussion one of the biggest files
AD Bruce mentioned is pg_dump.c. There has been discussion in the past of
AD turning lots of the knowledge currently embedded in this file into a 
AD library, which would make it available to other clients (e.g. psql).

+1
It would be great to have library with such functionality!

AD I'm
AD not sure what a reasonable API for that would look like, though. Does 
AD anyone have any ideas?




AD cheers

AD andrew




-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Large C files

2011-09-08 Thread Heikki Linnakangas

On 08.09.2011 23:45, Peter Geoghegan wrote:

On 8 September 2011 15:43, Robert Haasrobertmh...@gmail.com  wrote:

I wouldn't be too enthusiastic about
starting a project like this in January, but now seems fine.  A bigger
problem is that I don't hear anyone volunteering to do the work.


You seem to have a fairly strong opinion on the xlog.c code. It would
be useful to hear any preliminary thoughts that you might have on what
we'd end up with when this refactoring work is finished. If I'm not
mistaken, you think that it is a good candidate for being refactored
not so much because of its size, but for other reasons -  could you
please elaborate on those? In particular, I'd like to know what
boundaries it is envisaged that the code should be refactored along to
increase its conceptual integrity, or to better separate concerns. I
assume that that's the idea, since each new .c file would have to have
a discrete purpose.


I'd like to see it split into routines involved in writing WAL, and 
those involved in recovery. And maybe a third file for archiving-related 
stuff.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers