Re: [HACKERS] On-the-fly index tuple deletion vs. hot_standby

2010-12-09 Thread Simon Riggs
On Fri, 2010-12-03 at 21:43 +0200, Heikki Linnakangas wrote:
 On 29.11.2010 08:10, Noah Misch wrote:
  I have a hot_standby system and use it to bear the load of various reporting
  queries that take 15-60 minutes each.  In an effort to avoid long pauses in
  recovery, I set a vacuum_defer_cleanup_age constituting roughly three hours 
  of
  the master's transactions.  Even so, I kept seeing recovery pause for the
  duration of a long-running query.  In each case, the culprit record was an
  XLOG_BTREE_DELETE arising from on-the-fly deletion of an index tuple.  The
  attached test script demonstrates the behavior (on HEAD); the index tuple
  reclamation conflicts with a concurrent SELECT pg_sleep(600) on the 
  standby.
 
  Since this inserting transaction aborts, HeapTupleSatisfiesVacuum reports
  HEAPTUPLE_DEAD independent of vacuum_defer_cleanup_age.  We go ahead and 
  remove
  the index tuples.  On the standby, btree_xlog_delete_get_latestRemovedXid 
  does
  not regard the inserting-transaction outcome, so btree_redo proceeds to 
  conflict
  with snapshots having visibility over that transaction.  Could we correctly
  improve this by teaching btree_xlog_delete_get_latestRemovedXid to ignore 
  tuples
  of aborted transactions and tuples inserted and deleted within one 
  transaction?

@Noah Easily the best bug reported submitted in a long time. Thanks.

 Seems reasonable. HeapTupleHeaderAdvanceLatestRemovedXid() will need 
 similar treatment. Actually, btree_xlog_delete_get_latestRemovedXid() 
 could just call HeapTupleHeaderAdvanceLatestRemoveXid().

Yes, it applies to other cases also. Thanks for the suggestion.

Fix committed. Please double-check my work, committed early since I'm
about to jump on a plane.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] To Signal The postmaster

2010-12-09 Thread Fujii Masao
On Wed, Dec 8, 2010 at 6:22 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Dec 8, 2010 at 4:59 PM, Heikki Linnakangas
 heikki.linnakan...@enterprisedb.com wrote:
 For 9.1, we should think of a better way to do this, perhaps using SIGUSR1
 to wake up. Maybe we won't even need the trigger file anymore.

 If we use SIGUSR1, the mechanism to allow the users to specify the event type
 seems to be required. For example, we should make the SIGUSR1 handler
 check not only the shmem (i.e., PMSignalStat) but also the file?

What I'm thinking is to make something like pg_ctl promote create the event
file specifying the standby promotion in $PGDATA/pg_event or elsewhere,
and send SIGUSR1 to postmaster. OTOH, when SIGUSR1 arrives, postmaster
checks whether that event file exists. If it does, postmaster removes it and
sends the signal to startup process for standby promotion.

Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot Standby tuning for btree_xlog_vacuum()

2010-12-09 Thread Simon Riggs
Just wanted to say thanks for the review, since I haven't yet managed to
fix and commit this. I expect to later this month.

On Mon, 2010-09-27 at 23:06 -0400, Robert Haas wrote:
 On Thu, Apr 29, 2010 at 4:12 PM, Simon Riggs si...@2ndquadrant.com wrote:
  Simple tuning of btree_xlog_vacuum() using an idea I had a while back,
  just never implemented. XXX comments removed.
 
  Allows us to avoid reading in blocks during VACUUM replay that are only
  required for correctness of index scans.
 
 Review:
 
 1. The block comment in XLogConfirmBufferIsUnpinned appears to be
 copied from somewhere else, and doesn't really seem appropriate for a
 new function since it refers to the original coding of this routine.
  I think you could just delete the parenthesized portion of the
 comment.
 
 2. This bit from ConfirmBufferIsUnpinned looks odd to me.
 
 + /*
 +  * Found it.  Now, pin/unpin the buffer to prove it's unpinned.
 +  */
 + if (PinBuffer(buf, NULL))
 + UnpinBuffer(buf, false);
 
 I don't think pinning and unpinning the buffer is sufficient to
 provide that it isn't otherwise pinned.  If the buffer isn't in shared
 buffers at all, it seems clear that no one can have it pinned.  But if
 it's present in shared buffers, it seems like you still need
 LockBufferForCleanup().

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow BLOBs restoring

2010-12-09 Thread Robert Haas
On Thu, Dec 9, 2010 at 12:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Vlad Arkhipov arhi...@dc.baikal.ru writes:
 08.12.2010 22:46, Tom Lane writes:
 Are you by any chance restoring from an 8.3 or older pg_dump file made
 on Windows?  If so, it's a known issue.

 No, I tried Linux only.

 OK, then it's not the missing-data-offsets issue.

 I think you can reproduce it. First I created a database full of many
 BLOBs on Postres 8.4.5. Then I created a dump:

 Oh, you should have said how many was many.  I had tried with several
 thousand large blobs yesterday and didn't see any problem.  However,
 with several hundred thousand small blobs, indeed it gets pretty slow
 as soon as you use -j.

 oprofile shows all the time is going into reduce_dependencies during the
 first loop in restore_toc_entries_parallel (ie, before we've actually
 started doing anything in parallel).  The reason is that for each blob,
 we're iterating through all of the several hundred thousand TOC entries,
 uselessly looking for anything that depends on the blob.  And to add
 insult to injury, because the blobs are all marked as SECTION_PRE_DATA,
 we don't get to parallelize at all.  I think we won't get to parallelize
 the blob data restoration either, since all the blob data is hidden in a
 single TOC entry :-(

 So the short answer is don't bother to use -j in a mostly-blobs restore,
 becausw it isn't going to help you in 9.0.

 One fairly simple, if ugly, thing we could do about this is skip calling
 reduce_dependencies during the first loop if the TOC object is a blob;
 effectively assuming that nothing could depend on a blob.  But that does
 nothing about the point that we're failing to parallelize blob
 restoration.  Right offhand it seems hard to do much about that without
 some changes to the archive representation of blobs.  Some things that
 might be worth looking at for 9.1:

 * Add a flag to TOC objects saying this object has no dependencies,
 to provide a generalized and principled way to skip the
 reduce_dependencies loop.  This is only a good idea if pg_dump knows
 that or can cheaply determine it at dump time, but I think it can.

 * Mark BLOB TOC entries as SECTION_DATA, or somehow otherwise make them
 parallelizable.  Also break the BLOBS data item apart into an item per
 BLOB, so that that part's parallelizable.  Maybe we should combine the
 metadata and data for each blob into one TOC item --- if we don't, it
 seems like we need a dependency, which will put us back behind the
 eight-ball.  I think the reason it's like this is we didn't originally
 have a separate TOC item per blob; but now that we added that to support
 per-blob ACL data, the monolithic BLOBS item seems pretty pointless.
 (Another thing that would have to be looked at here is the dependency
 between a BLOB and any BLOB COMMENT for it.)

 Thoughts?

Is there any use case for restoring a BLOB but not the BLOB COMMENT or
BLOB ACLs?  Can we just smush everything together into one section?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PS display and standby query conflict

2010-12-09 Thread Fujii Masao
Hi,

When I created the conflict between recovery and many read-only
transactions in the standby server for test purpose, I found that the
keyword waiting disappeared from PS display for just a moment
even though the conflict had not been resolved yet. This seems
strange to me.

This problem happens because ResolveRecoveryConflictWithVirtualXIDs
resets PS display for each read-only transactions that recovery
waits for. Why do we need to reset that each time even though
the conflict has not been resolved yet? The attached patch
suppresses such a needless reset. Comments?

BTW, ResolveRecoveryConflictWithVirtualXIDs calls
pgstat_report_waiting(), which is also needless since the startup
process doesn't have the shared memory entry (i.e., MyBEEntry)
for pg_stat_activity. The attached patch removes that call.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


ps_display_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Jie Li
Hi all,

I'm new to window functions. Recently I run some simple queries but
surprised to find percent_rank is so slower than rank, could anybody tell me
why?

The table schema:
test=# \d inventory1
 Table public.inventory1
Column|  Type   | Modifiers
--+-+---
 inv_date_sk  | integer | not null
 inv_item_sk  | integer | not null
 inv_warehouse_sk | integer | not null
 inv_quantity_on_hand | integer |

test=# \dt+ inventory1
   List of relations
 Schema |Name| Type  |  Owner   |  Size   | Description
++---+--+-+-
 public | inventory1 | table | workshop | 8880 kB |

The rank query result:
test=# explain analyze select inv_date_sk,inv_item_sk, rank()over(partition
by inv_date_sk order by inv_item_sk) from inventory1;
  QUERY
PLAN
---
 WindowAgg  (cost=19563.99..23343.99 rows=189000 width=8) (actual
time=631.947..1361.158 rows=189000 loops=1)
   -  Sort  (cost=19563.99..20036.49 rows=189000 width=8) (actual
time=631.924..771.990 rows=189000 loops=1)
 Sort Key: inv_date_sk, inv_item_sk
 Sort Method:  quicksort  Memory: 12218kB
 -  Seq Scan on inventory1  (cost=0.00..3000.00 rows=189000
width=8) (actual time=0.055..198.948 rows=189000 loops=1)
 Total runtime: 1500.193 ms
(6 rows)

The percent_rank result:
test=# explain analyze select inv_date_sk,inv_item_sk,
percent_rank()over(partition by inv_date_sk order by inv_item_sk) from
inventory1;
  QUERY
PLAN
---
 WindowAgg  (cost=19563.99..23343.99 rows=189000 width=8) (actual
time=766.432..32924.804 rows=189000 loops=1)
   -  Sort  (cost=19563.99..20036.49 rows=189000 width=8) (actual
time=756.320..905.407 rows=189000 loops=1)
 Sort Key: inv_date_sk, inv_item_sk
 Sort Method:  quicksort  Memory: 12218kB
 -  Seq Scan on inventory1  (cost=0.00..3000.00 rows=189000
width=8) (actual time=0.102..224.607 rows=189000 loops=1)
 Total runtime: 33152.188 ms
(6 rows)

One special thing is that all the values of the partition key(inv_date_sk)
are the same, that is, there is only one window partition. I find that
percent_rank needs to buffer all the tuples to get the total number of rows.
But why is it so expensive?

I use 8.4.4. And I only increase the work_mem to 100M and leave other
parameters untouched.

Thanks,
Li Jie


Re: [HACKERS] [PERFORM] Slow BLOBs restoring

2010-12-09 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Dec 9, 2010 at 12:28 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 * Mark BLOB TOC entries as SECTION_DATA, or somehow otherwise make them
 parallelizable.  Also break the BLOBS data item apart into an item per
 BLOB, so that that part's parallelizable.  Maybe we should combine the
 metadata and data for each blob into one TOC item --- if we don't, it
 seems like we need a dependency, which will put us back behind the
 eight-ball.  I think the reason it's like this is we didn't originally
 have a separate TOC item per blob; but now that we added that to support
 per-blob ACL data, the monolithic BLOBS item seems pretty pointless.
 (Another thing that would have to be looked at here is the dependency
 between a BLOB and any BLOB COMMENT for it.)

 Is there any use case for restoring a BLOB but not the BLOB COMMENT or
 BLOB ACLs?  Can we just smush everything together into one section?

The ACLs are already part of the main TOC entry for the blob.  As for
comments, I'd want to keep the handling of those the same as they are
for every other kind of object.  But that just begs the question of why
comments are separate TOC entries in the first place.  We could
eliminate this problem if they became fields of object entries across
the board.  Which would be a non-backwards-compatible change in dump
file format, but doing anything about the other issues mentioned above
will require that anyway.

I'm not certain however about whether it's safe to treat the
object-metadata aspects of a blob as SECTION_DATA rather than
SECTION_PRE_DATA.  That will take a bit of investigation.  It seems like
there shouldn't be any fundamental reason for it not to work, but that
doesn't mean there's not any weird assumptions buried someplace in
pg_dump ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow BLOBs restoring

2010-12-09 Thread Tom Lane
I wrote:
 One fairly simple, if ugly, thing we could do about this is skip calling
 reduce_dependencies during the first loop if the TOC object is a blob;
 effectively assuming that nothing could depend on a blob.  But that does
 nothing about the point that we're failing to parallelize blob
 restoration.  Right offhand it seems hard to do much about that without
 some changes to the archive representation of blobs.  Some things that
 might be worth looking at for 9.1:

 * Add a flag to TOC objects saying this object has no dependencies,
 to provide a generalized and principled way to skip the
 reduce_dependencies loop.  This is only a good idea if pg_dump knows
 that or can cheaply determine it at dump time, but I think it can.

I had further ideas about this part of the problem.  First, there's no
need for a file format change to fix this: parallel restore is already
groveling over all the dependencies in its fix_dependencies step, so it
could count them for itself easily enough.  Second, the real problem
here is that reduce_dependencies processing is O(N^2) in the number of
TOC objects.  Skipping it for blobs, or even for all dependency-free
objects, doesn't make that very much better: the kind of people who
really need parallel restore are still likely to bump into unreasonable
processing time.  I think what we need to do is make fix_dependencies
build a reverse lookup list of all the objects dependent on each TOC
object, so that the searching behavior in reduce_dependencies can be
eliminated outright.  That will take O(N) time and O(N) extra space,
which is a good tradeoff because you won't care if N is small, while if
N is large you have got to have it anyway.

Barring objections, I will do this and back-patch into 9.0.  There is
maybe some case for trying to fix 8.4 as well, but since 8.4 didn't
make a separate TOC entry for each blob, it isn't as exposed to the
problem.  We didn't back-patch the last round of efficiency hacks in
this area, so I'm thinking it's not necessary here either.  Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Solving sudoku using SQL

2010-12-09 Thread Alvaro Herrera
Excerpts from Jan Urbański's message of mié dic 08 17:11:44 -0300 2010:

 I'm pleasantly surprised that the SA code as it stands today, setting
 the equlibrium factor to 8 and temperature reduction factor to 0.4, the
 query takes 1799.662 ms in total.

That's 5x better than Oracle :-)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow BLOBs restoring

2010-12-09 Thread Robert Haas
On Thu, Dec 9, 2010 at 10:05 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 I wrote:
 One fairly simple, if ugly, thing we could do about this is skip calling
 reduce_dependencies during the first loop if the TOC object is a blob;
 effectively assuming that nothing could depend on a blob.  But that does
 nothing about the point that we're failing to parallelize blob
 restoration.  Right offhand it seems hard to do much about that without
 some changes to the archive representation of blobs.  Some things that
 might be worth looking at for 9.1:

 * Add a flag to TOC objects saying this object has no dependencies,
 to provide a generalized and principled way to skip the
 reduce_dependencies loop.  This is only a good idea if pg_dump knows
 that or can cheaply determine it at dump time, but I think it can.

 I had further ideas about this part of the problem.  First, there's no
 need for a file format change to fix this: parallel restore is already
 groveling over all the dependencies in its fix_dependencies step, so it
 could count them for itself easily enough.  Second, the real problem
 here is that reduce_dependencies processing is O(N^2) in the number of
 TOC objects.  Skipping it for blobs, or even for all dependency-free
 objects, doesn't make that very much better: the kind of people who
 really need parallel restore are still likely to bump into unreasonable
 processing time.  I think what we need to do is make fix_dependencies
 build a reverse lookup list of all the objects dependent on each TOC
 object, so that the searching behavior in reduce_dependencies can be
 eliminated outright.  That will take O(N) time and O(N) extra space,
 which is a good tradeoff because you won't care if N is small, while if
 N is large you have got to have it anyway.

 Barring objections, I will do this and back-patch into 9.0.  There is
 maybe some case for trying to fix 8.4 as well, but since 8.4 didn't
 make a separate TOC entry for each blob, it isn't as exposed to the
 problem.  We didn't back-patch the last round of efficiency hacks in
 this area, so I'm thinking it's not necessary here either.  Comments?

Ah, that sounds like a much cleaner solution.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow BLOBs restoring

2010-12-09 Thread Andrew Dunstan



On 12/09/2010 10:05 AM, Tom Lane wrote:

I think what we need to do is make fix_dependencies
build a reverse lookup list of all the objects dependent on each TOC
object, so that the searching behavior in reduce_dependencies can be
eliminated outright.  That will take O(N) time and O(N) extra space,
which is a good tradeoff because you won't care if N is small, while if
N is large you have got to have it anyway.

Barring objections, I will do this and back-patch into 9.0.  There is
maybe some case for trying to fix 8.4 as well, but since 8.4 didn't
make a separate TOC entry for each blob, it isn't as exposed to the
problem.  We didn't back-patch the last round of efficiency hacks in
this area, so I'm thinking it's not necessary here either.  Comments?





Sound good. Re 8.4: at a pinch people could probably use the 9.0 
pg_restore with their 8.4 dump.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Solving sudoku using SQL

2010-12-09 Thread Dimitri Fontaine
Merlin Moncure mmonc...@gmail.com writes:
 On Wed, Dec 8, 2010 at 8:57 AM, Tatsuo Ishii is...@postgresql.org wrote:

 Then execute the huge SELECT:
 http://codezine.jp/static/images/article/1629/html/sql.html

 benchmark what you've got against this (ported to postgres by marcin mank):
 http://www.pastie.org/684163

It that this one ?

http://archives.postgresql.org/message-id/e08cc0400911042333o5361b21cu2c9438f82b1e5...@mail.gmail.com

-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Solving sudoku using SQL

2010-12-09 Thread Merlin Moncure
On Thu, Dec 9, 2010 at 11:56 AM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Merlin Moncure mmonc...@gmail.com writes:
 On Wed, Dec 8, 2010 at 8:57 AM, Tatsuo Ishii is...@postgresql.org wrote:

 Then execute the huge SELECT:
 http://codezine.jp/static/images/article/1629/html/sql.html

 benchmark what you've got against this (ported to postgres by marcin mank):
 http://www.pastie.org/684163

 It that this one ?

 http://archives.postgresql.org/message-id/e08cc0400911042333o5361b21cu2c9438f82b1e5...@mail.gmail.com

sure is  --  I missed the formatted version.  The original query also
an Oracle original.  If you remove the 'where ind = 0', you can watch
the database solve the puzzle, which is pretty neat.

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimize PL/Perl function argument passing [PATCH]

2010-12-09 Thread Tim Bunce
On Wed, Dec 08, 2010 at 09:21:05AM -0800, David E. Wheeler wrote:
 On Dec 8, 2010, at 9:14 AM, Tim Bunce wrote:
 
  Do you have any more improvements in the pipeline?
  
  I'd like to add $arrayref = decode_array_literal('{2,3}') and
  maybe $hashref = decode_hstore_literal('x=1, y=2').
  I don't know how much works would be involved in those though.
 
 Those would be handy, but for arrays, at least, I'd rather have a GUC
 to turn on so that arrays are passed to PL/perl functions as array
 references.

Understood. At this stage I don't know what the issues are so I'm
nervous of over promising (plus I don't know how much time I'll have).
It's possible a blessed ref with string overloading would avoid
backwards compatibility issues.

Tim.

  Another possible improvement would be rewriting encode_array_literal()
  in C, so returning arrays would be much faster.
 
 +1
 
 Best,
 
 David
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] initdb failure with Postgres 8.4.4

2010-12-09 Thread BRUSSER Michael
Initdb fails for me when host machine has no access to the Postgres build 
location.

LOG:  could not open directory .../install/share/timezone: No such file or 
directory
LOG:  could not open directory .../install/share/timezone: No such file or 
directory
WARNING:  could not open directory .../install/share/timezonesets: No such 
file or directory
HINT:  This may indicate an incomplete PostgreSQL installation,
   or that the file .../bin.Linux/postgres has been moved away from its 
proper location.
FATAL:  invalid value for parameter timezone_abbreviations: Default
DEBUG:  shmem_exit(1): 0 callbacks to make
DEBUG:  proc_exit(1): 0 callbacks to make
DEBUG:  exit(1)

The postgres executable is in the proper place.
The truncated paths .../install/share/timezone, etc point to the original build 
location.

The timezone and other files  are available in the local directory:
share % ls -1
conversion_create.sql
information_schema.sql
pg_hba.conf.sample
pg_ident.conf.sample
pg_service.conf.sample
postgres.bki
postgres.description
postgresql.conf.sample
postgres.shdescription
psqlrc.sample
recovery.conf.sample
snowball_create.sql
sql_features.txt
system_views.sql
timezone
timezonesets
tsearch_data

I call initdb with the -L option pointing to that directory, from initdb log I 
can see that it is indeed being used for certain things -

POSTGRES_BKI=.../share/postgres.bki
POSTGRES_DESCR=.../share/postgres.description
POSTGRES_SHDESCR=.../share/postgres.shdescription
POSTGRESQL_CONF_SAMPLE=.../share/postgresql.conf.sample
PG_HBA_SAMPLE=.../share/pg_hba.conf.sample
PG_IDENT_SAMPLE=.../share/pg_ident.conf.sample

I am wondering why the timezone files cannot be found there as well?
This is quite urgent and any help will be greatly appreciated.
Michael.
.




This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.


Re: XLog vs SSD [Was: Re: [HACKERS] random write in xlog?]

2010-12-09 Thread James Cloos
 JJ == Jeff Janes jeff.ja...@gmail.com writes:

JJ So PG always writing 8K at a time is unlikely to make a difference
JJ than if it wrote a smaller amount.

Ah.

Somehow I was thinking of the xlog files' 16M filesize rather than the
internal 8k block size

If it is only writing 8k blocks then there is probably little hope of
making efficient use of flash.

-JimC
-- 
James Cloos cl...@jhcloos.com OpenPGP: 1024D/ED7DAEA6

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add a primary key using an existing index

2010-12-09 Thread Gurjeet Singh
On Sun, Dec 5, 2010 at 2:09 PM, Peter Eisentraut pete...@gmx.net wrote:

 On fre, 2010-12-03 at 15:27 -0500, Robert Haas wrote:
  On Fri, Dec 3, 2010 at 2:56 PM, r t pg...@xzilla.net wrote:
   What exactly was the objection to the following --
   ALTER TABLE table_name ADD PRIMARY KEY (column_list) USING index_name;
   Is the objection that you might have been trying to specify a
 constraint
   named using ? I'm willing to make that option more difficult. :-)
 
  I think it's that someone might expect the word after USING to be the
  name of an index AM.

 That could be avoided by writing

 USING INDEX name


Allowing USING INDEX along with USING INDEX TABLESPACE is causing
shift/reduce conflicts.

I liked the proposal upthread of providing alternate syntax where user does
not have to specify column-list and system picks up that list from the
index.

ALTER TABLE table_name ADD [CONSTRAINT cons_name] PRIMARY KEY (column_list)
[WITH (...)] [USING INDEX TABLESPACE tblspcname];
ALTER TABLE table_name ADD [CONSTRAINT cons_name] PRIMARY KEY [WITH (...)]
[USING INDEX index_name];

This would also help avoid the bug that Itagaki found, where the user wants
to use an existing index, and also specifies USING INDEX TABLESPACE.

But I still hold a bias towards renaming the index to match constraint name
(with a NOTICE), rather than require that the constraint name match the
index name, because the constraint name is optional and when it is not
provided system has to generate a name and we have to rename the index
anyway to maintain consistency.

Following are the gram.y changes that I am going to start with:

 %type boolean constraints_set_mode
-%type strOptTableSpace OptConsTableSpace OptTableSpaceOwner
+%type strOptTableSpace OptConsTableSpace OptConsIndex
OptTableSpaceOwner
 %type list   opt_check_option

[...]
| UNIQUE '(' columnList ')' opt_definition OptConsTableSpace
ConstraintAttributeSpec
{
Constraint *n = makeNode(Constraint);
n-contype = CONSTR_UNIQUE;
n-location = @1;
n-keys = $3;
n-options = $5;
n-indexspace = $6;
n-deferrable = ($7  1) != 0;
n-initdeferred = ($7  2) != 0;
$$ = (Node *)n;
}
+   | UNIQUE opt_definition OptConsIndex ConstraintAttributeSpec
+   {
+   Constraint *n = makeNode(Constraint);
+   n-contype = CONSTR_UNIQUE;
+   n-location = @1;
+   n-options = $2;
+   n-indexname = $3;
+   n-deferrable = ($4  1) != 0;
+   n-initdeferred = ($4  2) != 0;
+   $$ = (Node *)n;
+   }
| PRIMARY KEY '(' columnList ')' opt_definition
OptConsTableSpace
ConstraintAttributeSpec
{
Constraint *n = makeNode(Constraint);
n-contype = CONSTR_PRIMARY;
n-location = @1;
n-keys = $4;
n-options = $6;
n-indexspace = $7;
n-deferrable = ($8  1) != 0;
n-initdeferred = ($8  2) != 0;
$$ = (Node *)n;
}
+   | PRIMARY KEY opt_definition OptConsIndex
ConstraintAttributeSpec
+   {
+   Constraint *n = makeNode(Constraint);
+   n-contype = CONSTR_PRIMARY;
+   n-location = @1;
+   n-options = $3;
+   n-indexname = $4;
+   n-deferrable = ($5  1) != 0;
+   n-initdeferred = ($5  2) != 0;
+   $$ = (Node *)n;
+   }
| EXCLUDE access_method_clause '(' ExclusionConstraintList ')'

[...]
 OptConsTableSpace:   USING INDEX TABLESPACE name   { $$ = $4; }
| /*EMPTY*/ { $$ = NULL; }
;

+OptConsIndex:   USING INDEX name   { $$ = $3; }
+   | /*EMPTY*/ { $$ = NULL; }
+   ;
+

Regards,
-- 
gurjeet.singh
@ EnterpriseDB - The Enterprise Postgres Company
http://www.EnterpriseDB.com

singh.gurj...@{ gmail | yahoo }.com
Twitter/Skype: singh_gurjeet

Mail sent from my BlackLaptop device


Re: [HACKERS] Patch to add a primary key using an existing index

2010-12-09 Thread Tom Lane
Gurjeet Singh singh.gurj...@gmail.com writes:
 But I still hold a bias towards renaming the index to match constraint name
 (with a NOTICE), rather than require that the constraint name match the
 index name, because the constraint name is optional and when it is not
 provided system has to generate a name and we have to rename the index
 anyway to maintain consistency.

No.  If the constraint name is not specified, we should certainly use
the existing index name, not randomly rename it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add a primary key using an existing index

2010-12-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 If the constraint name is not specified, we should certainly use
 the existing index name, not randomly rename it.
 
+1
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BufFreelistLock

2010-12-09 Thread Jim Nasby
On Dec 8, 2010, at 11:44 PM, Jeff Janes wrote:
 For the clock sweep algorithm, I think you could access
 nextVictimBuffer without any type of locking.
 
 This is wrong, mainly because you wouldn't have any security against two
 processes decrementing the usage count of the same buffer because they'd
 fetched the same value of nextVictimBuffer.  That would probably happen
 often enough to severely compromise the accuracy of the usage counts and
 thus the accuracy of the LRU eviction behavior.  See above.
 
 Ah, I hadn't considered that.

Ideally, the clock sweep would be run by bgwriter and not individual backends. 
In that case it shouldn't matter much what the performance of the sweep is. To 
do that I think we'd want the bgwriter to target there being X number of 
buffers on the free list instead of (or in addition to) targeting how many 
dirty buffers need to be written. This would mirror what operating systems do; 
they strive to keep X number of pages on the free list so that when a process 
needs memory it can get it quickly.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb failure with Postgres 8.4.4

2010-12-09 Thread Tom Lane
BRUSSER Michael michael.brus...@3ds.com writes:
 Initdb fails for me when host machine has no access to the Postgres build 
 location.

 LOG:  could not open directory .../install/share/timezone: No such file or 
 directory

Moving the install tree works for me.  Did you do something odd with the
--with-system-tzdata configuration option?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+

2010-12-09 Thread Josh Berkus
On 12/6/10 6:13 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 OK, patch coming then.  Right now test_fsync aborts when O_DIRECT fails.
  What should I have it do instead?
 
 Report that it fails, and keep testing the other methods.

Patch attached.  Includes a fair amount of comment cleanup, since
existing comments did not meet our current project standards.  Tests all
6 of the methods we support separately.

Some questions, though:

(1) Why are we doing the open_sync different-size write test?  AFAIK,
this doesn't match any behavior which PostgreSQL has.

(2) In this patch, I'm stepping down the number of loops which
fsync_writethrough does by 90%.  The reason for that was that on the
platforms where I tested writethrough (desktop machines), doing 10,000
loops took 15-20 *minutes*, which seems hard on the user.  Would be easy
to revert if you think it's a bad idea.
Possibly auto-sizing the number of loops based on the first fsync test
might be a good idea, but seems like going a bit too far.

(3) Should the multi-descriptor test be using writethrough on platforms
which support it?

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com
diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile
index 252c087..2ddbbe9 100644
*** a/src/tools/fsync/Makefile
--- b/src/tools/fsync/Makefile
***
*** 4,10 
  #
  # Copyright (c) 2003-2010, PostgreSQL Global Development Group
  #
! # src/tools/fsync/Makefile
  #
  #-
  
--- 4,10 
  #
  # Copyright (c) 2003-2010, PostgreSQL Global Development Group
  #
! # $PostgreSQL: pgsql/src/tools/fsync/Makefile,v 1.9 2010/07/05 18:54:38 tgl Exp $
  #
  #-
  
*** override CPPFLAGS := -I$(libpq_srcdir) $
*** 16,24 
  
  OBJS= test_fsync.o
  
! all: test_fsync
  
! test_fsync: test_fsync.o | submake-libpq submake-libpgport
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $...@$(X)
  
  clean distclean maintainer-clean:
--- 16,24 
  
  OBJS= test_fsync.o
  
! all: submake-libpq submake-libpgport test_fsync
  
! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a
  	$(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $...@$(X)
  
  clean distclean maintainer-clean:
diff --git a/src/tools/fsync/README b/src/tools/fsync/README
index 6d9acd3..5b45581 100644
*** a/src/tools/fsync/README
--- b/src/tools/fsync/README
***
*** 1,4 
! src/tools/fsync/README
  
  fsync
  =
--- 1,4 
! $PostgreSQL: pgsql/src/tools/fsync/README,v 1.5 2009/11/28 15:04:54 momjian Exp $
  
  fsync
  =
*** fsync
*** 6,11 
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
  
! Loops defaults to 5000.  The default output file is /var/tmp/test_fsync.out.
! Consider that /tmp or /var/tmp might be memory-based file systems.
--- 6,25 
  This program tests fsync.  The tests are described as part of the program output.
  
  	Usage:	test_fsync [-f filename] [loops]
+ 	
+ test_fsync is intended to give you a reasonable idea of what the fastest 
+ fsync_method is on your specific system, as well as supplying diagnostic 
+ information in the event of an identified I/O problem.  However, differences
+ shown by test_fsync may not make any difference in real database throughput,
+ especially since many database servers are not speed-limited by their
+ transaction logs.
  
! Filename defaults to test_fsync.out in the current directory. test_fsync
! should be run on the same filesystem where your transaction log currently
! resides.
! 
! Loops default to 1, except for writethrough tests, where there are 1/10 of
! that in order to make the user not wait forever.  You should lower loops if you
! have a slow system and the tests are taking more than 5 minutes each.  You should
! raise loops if your system is faster than 5000/second, in order to get useful
! statistics.
diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c
index 28c2119..5980b70 100644
*** a/src/tools/fsync/test_fsync.c
--- b/src/tools/fsync/test_fsync.c
***
*** 3,9 
   *
   *
   *	test_fsync.c
!  *		test various fsync() methods
   */
  
  #include postgres.h
--- 3,9 
   *
   *
   *	test_fsync.c
!  *		tests all supported fsync() methods
   */
  
  #include postgres.h
***
*** 22,55 
  #include unistd.h
  #include string.h
  
! 
! #ifdef WIN32
  #define FSYNC_FILENAME	./test_fsync.out
- #else
- /* /tmp might be a memory file system */
- #define FSYNC_FILENAME	/var/tmp/test_fsync.out
- #endif
  
  #define WRITE_SIZE	(8 * 1024)	/* 8k */
  
  #define LABEL_FORMAT	\t%-30s
  
  int			loops = 1;
  
  

Re: [HACKERS] initdb failure with Postgres 8.4.4

2010-12-09 Thread BRUSSER Michael
No, we do not use --with-system-tzdata  option.
I looked at the makefile and at the output of pg_config. We may need to do some 
cleanup there, but I did not pick any clues. 
The problem occurs on all our UNIX platforms.
Is there anything I could do to shed more light on it? 
I can post the output of pg_config if it helps, or try to run initdb with truss 
or strace.

Thanks,
Michael.


-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Thursday, December 09, 2010 2:56 PM
To: BRUSSER Michael
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] initdb failure with Postgres 8.4.4 

BRUSSER Michael michael.brus...@3ds.com writes:
 Initdb fails for me when host machine has no access to the Postgres build 
 location.

 LOG:  could not open directory .../install/share/timezone: No such file or 
 directory

Moving the install tree works for me.  Did you do something odd with the
--with-system-tzdata configuration option?

regards, tom lane


This email and any attachments are intended solely for the use of the 
individual or entity to whom it is addressed and may be confidential and/or 
privileged.  If you are not one of the named recipients or have received this 
email in error, (i) you should not read, disclose, or copy it, (ii) please 
notify sender of your receipt by reply email and delete this email and all 
attachments, (iii) Dassault Systemes does not accept or assume any liability or 
responsibility for any use of or reliance on this email.For other languages, go 
to http://www.3ds.com/terms/email-disclaimer.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to add a primary key using an existing index

2010-12-09 Thread Robert Haas
On Thu, Dec 9, 2010 at 2:51 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Tom Lane t...@sss.pgh.pa.us wrote:
 If the constraint name is not specified, we should certainly use
 the existing index name, not randomly rename it.

 +1

+1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Tom Lane
Jie Li jay23j...@gmail.com writes:
 I'm new to window functions. Recently I run some simple queries but
 surprised to find percent_rank is so slower than rank, could anybody tell me
 why?

Huh, interesting.  I can reproduce this with toy data, such as

create table inventory1 (inv_date_sk int, inv_item_sk int);
insert into inventory1 select 1, random()* 10 from 
generate_series(1,189000);
explain analyze select inv_date_sk,inv_item_sk, percent_rank()over(partition by 
inv_date_sk order by inv_item_sk) from inventory1;

The example is *not* particularly slow if you leave work_mem at default.
But if you bump up work_mem enough so that the WindowAgg's internal
tuplestore fits into memory, it slows down like crazy.  A bit of quality
time with oprofile shows that all the time is going into this memmove()
in tuplestore_trim():

/*
 * Slide the array down and readjust pointers.  This may look pretty
 * stupid, but we expect that there will usually not be very many
 * tuple-pointers to move, so this isn't that expensive; and it keeps a
 * lot of other logic simple.
 *
 * In fact, in the current usage for merge joins, it's demonstrable that
 * there will always be exactly one non-removed tuple; so optimize that
 * case.
 */
if (nremove + 1 == state-memtupcount)
state-memtuples[0] = state-memtuples[nremove];
else
memmove(state-memtuples, state-memtuples + nremove,
(state-memtupcount - nremove) * sizeof(void *));

We're throwing away one tuple at a time as we advance forward through
the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
This code was all right when written, because (IIRC) the mergejoin
case was actually the only caller.  But it's not all right for
WindowAgg's less-predictable usage patterns.

I thought for a bit about changing things around so that the first-used
tuple slot isn't necessarily state-memtuples[0], but just like the
comment says, that complicates a lot of other logic.  And there isn't
any easy place to reclaim the wasted slots later.

What seems like the best bet is to put in a heuristic to make
tuplestore_trim simply not do anything until nremove reaches some
reasonably large amount, perhaps 10% of the number of stored tuples.
This wastes up to 10% of the alloted memory, but that seems tolerable.
We could complicate things a bit more by remembering that so-and-so
many slots are authorized to be removed, and forcing a trim operation
to discard them if we find ourselves in memory trouble.  I'm not sure
that extra complication is worthwhile though.  Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] initdb failure with Postgres 8.4.4

2010-12-09 Thread Andrew Dunstan




On 12/09/2010 03:36 PM, BRUSSER Michael wrote:

No, we do not use --with-system-tzdata  option.
I looked at the makefile and at the output of pg_config. We may need to do some 
cleanup there, but I did not pick any clues.
The problem occurs on all our UNIX platforms.
Is there anything I could do to shed more light on it?
I can post the output of pg_config if it helps, or try to run initdb with truss 
or strace.



Yes, please show us the pg_config (and please don't top-post).

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Extensions, patch v16

2010-12-09 Thread David E. Wheeler
On Dec 9, 2010, at 12:34 PM, Dimitri Fontaine wrote:

 - add support for 'relocatable' boolean property in the control file,
   as discussed on list
 
   this controls what happens at create extension time, by doing a
   relocation of the extension objects when the extension is relocatable
   and the asked schema isn't the first of the search_path (you can't
   relocate an object to a schema where it already is)
 
   when the extension is not relocatable, the mechanism used is the
   @extschema@ replacement in the script so that the user still has a
   say, but at create time only

This still isn't ideal, but I think it's a big improvement. Thanks.

 - nothing is done for the psql commands \dx and \dx+, here's an idea:
 
   \dx lists only installed extensions
   \dx+ extension lists the objects, calling pg_extension_objects()
   \dX lists available extensions (and installed too)

+1 I think that's much more like existing psql commands.

 - we still depend on extension authors providing a control file. Do we
   want to spend some efforts on trying to get rid of this file? I know
   David desperately want to, but that's at the cost of making it much
   harder to manage more than one extension in a single directory, for
   once, and the Makefile mechanisms to make than happen (include a rule
   depending on the presence of some variables, keep track of it for the
   cleaning, etc) doesn't seem to me to worth it.

I don't think it makes it any harder to manage multiple extension in a single 
directory because one can create the control file explicitly (or perhaps rely 
on .control.in for that), just as they do now. Everyone else can do less work.

So:

* If $extension.control.in exists, use that
* If it doesn't, generate $extension.control from the Makefile variables
* Always remove $extension.control in the `clean` targets

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Tom Lane
I wrote:
 We're throwing away one tuple at a time as we advance forward through
 the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
 This code was all right when written, because (IIRC) the mergejoin
 case was actually the only caller.  But it's not all right for
 WindowAgg's less-predictable usage patterns.

 I thought for a bit about changing things around so that the first-used
 tuple slot isn't necessarily state-memtuples[0], but just like the
 comment says, that complicates a lot of other logic.  And there isn't
 any easy place to reclaim the wasted slots later.

 What seems like the best bet is to put in a heuristic to make
 tuplestore_trim simply not do anything until nremove reaches some
 reasonably large amount, perhaps 10% of the number of stored tuples.
 This wastes up to 10% of the alloted memory, but that seems tolerable.

On reflection I think just not doing anything isn't a very good idea.
The problem with that is that a mis-coded caller could try to fetch
tuples that it had already told the tuplestore could be trimmed away;
and this would work, most of the time, until you got unlucky and the
trim operation had actually deleted them.  I think it's pretty important
for bug-catching purposes that the tuplestore enforce that those tuples
are not available anymore.

Hence the attached patch, which combines the two ideas by recycling
tuples immediately but not sliding the pointer array until a reasonable
amount of movement has occurred.  This fixes the complained-of
performance problem AFAICT.

I'm not sure whether or not to back-patch this into 9.0 and 8.4.  The
code in tuplestore.c hasn't changed at all since 8.4, so there's not
much risk of cross-version bugs, but if I did miss anything we could
be shipping a buggy version next week.  Thoughts?

regards, tom lane

diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 9bbaba43771f495fdf24e9f2afd545b69a22ecbd..8c8139c897679892e0d4ad13e69ae8d814484206 100644
*** a/src/backend/utils/sort/tuplestore.c
--- b/src/backend/utils/sort/tuplestore.c
*** struct Tuplestorestate
*** 145,152 
--- 145,159 
  	/*
  	 * This array holds pointers to tuples in memory if we are in state INMEM.
  	 * In states WRITEFILE and READFILE it's not used.
+ 	 *
+ 	 * When memtupdeleted  0, the first memtupdeleted pointers are already
+ 	 * released due to a tuplestore_trim() operation, but we haven't expended
+ 	 * the effort to slide the remaining pointers down.  These unused pointers
+ 	 * are set to NULL to catch any invalid accesses.  Note that memtupcount
+ 	 * includes the deleted pointers.
  	 */
  	void	  **memtuples;		/* array of pointers to palloc'd tuples */
+ 	int			memtupdeleted;	/* the first N slots are currently unused */
  	int			memtupcount;	/* number of tuples currently present */
  	int			memtupsize;		/* allocated length of memtuples array */
  
*** tuplestore_begin_common(int eflags, bool
*** 252,257 
--- 259,265 
  	state-context = CurrentMemoryContext;
  	state-resowner = CurrentResourceOwner;
  
+ 	state-memtupdeleted = 0;
  	state-memtupcount = 0;
  	state-memtupsize = 1024;	/* initial guess */
  	state-memtuples = (void **) palloc(state-memtupsize * sizeof(void *));
*** tuplestore_clear(Tuplestorestate *state)
*** 401,407 
  	state-myfile = NULL;
  	if (state-memtuples)
  	{
! 		for (i = 0; i  state-memtupcount; i++)
  		{
  			FREEMEM(state, GetMemoryChunkSpace(state-memtuples[i]));
  			pfree(state-memtuples[i]);
--- 409,415 
  	state-myfile = NULL;
  	if (state-memtuples)
  	{
! 		for (i = state-memtupdeleted; i  state-memtupcount; i++)
  		{
  			FREEMEM(state, GetMemoryChunkSpace(state-memtuples[i]));
  			pfree(state-memtuples[i]);
*** tuplestore_clear(Tuplestorestate *state)
*** 409,414 
--- 417,423 
  	}
  	state-status = TSS_INMEM;
  	state-truncated = false;
+ 	state-memtupdeleted = 0;
  	state-memtupcount = 0;
  	readptr = state-readptrs;
  	for (i = 0; i  state-readptrcount; readptr++, i++)
*** tuplestore_end(Tuplestorestate *state)
*** 432,438 
  		BufFileClose(state-myfile);
  	if (state-memtuples)
  	{
! 		for (i = 0; i  state-memtupcount; i++)
  			pfree(state-memtuples[i]);
  		pfree(state-memtuples);
  	}
--- 441,447 
  		BufFileClose(state-myfile);
  	if (state-memtuples)
  	{
! 		for (i = state-memtupdeleted; i  state-memtupcount; i++)
  			pfree(state-memtuples[i]);
  		pfree(state-memtuples);
  	}
*** tuplestore_gettuple(Tuplestorestate *sta
*** 774,787 
  }
  else
  {
! 	if (readptr-current = 0)
  	{
  		Assert(!state-truncated);
  		return NULL;
  	}
  	readptr-current--; /* last returned tuple */
  }
! if (readptr-current = 0)
  {
  	Assert(!state-truncated);
  	return NULL;
--- 783,796 
  }
  else
  {
! 	if (readptr-current = state-memtupdeleted)
  	{
  

Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 I'm not sure whether or not to back-patch this into 9.0 and 8.4. 
 The code in tuplestore.c hasn't changed at all since 8.4, so
 there's not much risk of cross-version bugs, but if I did miss
 anything we could be shipping a buggy version next week. 
 Thoughts?
 
Is there a performance regression involved, or is it a new feature
which hasn't performed well on this type of query until your patch? 
If the latter, I'd be inclined to give it some time on HEAD and
release it in the *following* minor release unless you're *very*
confident it couldn't break anything.
 
It's an uphill battle to convince managers that it's safe to apply
minor upgrades with minimal testing.  It doesn't take to many slips
for the boulder to roll all the way back to the bottom of that hill.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 I'm not sure whether or not to back-patch this into 9.0 and 8.4. 
 The code in tuplestore.c hasn't changed at all since 8.4, so
 there's not much risk of cross-version bugs, but if I did miss
 anything we could be shipping a buggy version next week. 
 Thoughts?
 
 Is there a performance regression involved, or is it a new feature
 which hasn't performed well on this type of query until your patch? 

Well, since window functions didn't exist before 8.4, it's difficult to
argue that there was a regression.  It's certainly a performance bug
though: nobody would expect that giving a query *more* work_mem would
cause it to run many times slower.

 If the latter, I'd be inclined to give it some time on HEAD and
 release it in the *following* minor release unless you're *very*
 confident it couldn't break anything.

Well, I'm reasonably confident in the patch, and it does pass regression
tests.  But I've been wrong before.

I'm not terribly thrilled with that suggestion though.  Do you have
reason to think that anybody is likely to exercise window functions in
HEAD, beyond what the regression tests do, in the next couple of months?
Slipping the application of the patch to back branches by a little bit
doesn't make a lot of management sense to me.  I think either we trust
it and put it into back branches, or we don't trust it and put it only
in HEAD, so it goes through a beta cycle before hitting production.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Kevin Grittner
Tom Lane t...@sss.pgh.pa.us wrote:
 
 Do you have reason to think that anybody is likely to exercise
 window functions in HEAD, beyond what the regression tests do, in
 the next couple of months?
 
Not specifically, no.  From the description (not having read the
patch) I was somewhat concerned that it might affect something
outside that narrow use case.  If that's not possible, then I'm more
comfortable putting it in a release that soon after it hits the
repository.
 
It's a judgment call, and you're clearly in the best position to
make it.  You asked for thoughts, so I shared my concerns.  :-)
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Why percent_rank is so slower than rank?

2010-12-09 Thread Kenneth Marshall
On Thu, Dec 09, 2010 at 05:18:57PM -0500, Tom Lane wrote:
 I wrote:
  We're throwing away one tuple at a time as we advance forward through
  the tuplestore, and moving 10+ tuple pointers each time.  Ugh.
  This code was all right when written, because (IIRC) the mergejoin
  case was actually the only caller.  But it's not all right for
  WindowAgg's less-predictable usage patterns.
 
  I thought for a bit about changing things around so that the first-used
  tuple slot isn't necessarily state-memtuples[0], but just like the
  comment says, that complicates a lot of other logic.  And there isn't
  any easy place to reclaim the wasted slots later.
 
  What seems like the best bet is to put in a heuristic to make
  tuplestore_trim simply not do anything until nremove reaches some
  reasonably large amount, perhaps 10% of the number of stored tuples.
  This wastes up to 10% of the alloted memory, but that seems tolerable.
 
 On reflection I think just not doing anything isn't a very good idea.
 The problem with that is that a mis-coded caller could try to fetch
 tuples that it had already told the tuplestore could be trimmed away;
 and this would work, most of the time, until you got unlucky and the
 trim operation had actually deleted them.  I think it's pretty important
 for bug-catching purposes that the tuplestore enforce that those tuples
 are not available anymore.
 
 Hence the attached patch, which combines the two ideas by recycling
 tuples immediately but not sliding the pointer array until a reasonable
 amount of movement has occurred.  This fixes the complained-of
 performance problem AFAICT.
 
 I'm not sure whether or not to back-patch this into 9.0 and 8.4.  The
 code in tuplestore.c hasn't changed at all since 8.4, so there's not
 much risk of cross-version bugs, but if I did miss anything we could
 be shipping a buggy version next week.  Thoughts?
 
   regards, tom lane
 

+1 for back patching.

Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final(?) proposal for wal_sync_method changes

2010-12-09 Thread Greg Smith
Since any Windows refactoring has been postponed for now (I'll get back 
to performance checks on that platform later), during my testing time 
this week instead I did a round of pre-release review of the change Tom 
has now committed.  All looks good to me, including the docs changes.


I confirmed that:

-Ubuntu system with an older kernel still has the same wal_sync_method 
(fdatasync) and performance after pulling the update
-RHEL6 system changes as planned from using open_datasync to fdatasync 
once I updated to a HEAD after the commit


On the RHEL6 system, I also checked the commit rate using pgbench with 
the attached INSERT only script, rather than relying on test_fsync.  
This is 7200 RPM drive, so theoretical max of 120 commits/second, on 
ext4; this is the same test setup I described in more detail back in 
http://archives.postgresql.org/message-id/4ce2ebf8.4040...@2ndquadrant.com


$ psql -c show wal_sync_method
wal_sync_method
-
fdatasync
(1 row)

$ pgbench -i -s 10 pgbench

[gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6733
tps = 112.208795 (including connections establishing)
tps = 112.216904 (excluding connections establishing)

And then manually switched over to test performance of the troublesome 
old default:


[gsm...@meddle ~]$ psql -c show wal_sync_method
wal_sync_method
-
open_datasync

[gsm...@meddle ~]$ pgbench -s 10 -f insert.sql -c 1 -T 60 pgbench
starting vacuum...end.
transaction type: Custom query
scaling factor: 10
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 6672
tps = 111.185802 (including connections establishing)
tps = 111.195089 (excluding connections establishing)

This is interesting, because test_fsync consistently reported a rate of 
about half this when using open_datasync instead of the equal 
performance I'm getting from the database.  I'll see if I can reproduce 
that further, but it's no reason to be concerned about the change that's 
been made I think.  Just more evidence that test_fsync has quirks left 
to be sorted out.  But that's not backbranch material, it should be part 
of 9.1 only refactoring, already in progress via the patch Josh 
submitted.  There's a bit of time left to get that done.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books

\set nbranches :scale
\set ntellers 10 * :scale
\set naccounts 10 * :scale
\setrandom aid 1 :naccounts
\setrandom bid 1 :nbranches
\setrandom tid 1 :ntellers
\setrandom delta -5000 5000
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Instrument checkpoint sync calls

2010-12-09 Thread Greg Smith

Jeff Janes wrote:

In my test cases, the syncs that the backends were doing were almost
always to the same file that the checkpoint writer was already choking
on (so they are entangled simply by virtue of that).  So very quickly
all the backends hit the same wall and thunked to a halt.  This is
probably a feature of trying to use pgbench as the basis to get a very
artificial model.


Yes--pgbench has some problems like you describe, ones that are a bit 
different than the way I've seen fsync writes get in each other's way in 
the production systems I've looked at.  That's good if you really want 
to provoke this behavior, which is one reason why I've used as an 
example for my patches so far (the other being that it's already 
available in everyone's installation).  But it's tough to get it to act 
more like a real-world system, which don't have quite so many localized 
updates, without cranking the scale way up.  And that then tends to 
aggravate other problems too.


The 8.3 checkpoint spreading work also got some useful results using the 
dbt-2 benchmark.  I'm at the point where I think I need to return to 
that test program for what I'm doing now.  I'd encourage you to try that 
out too if you get a chance.


Thanks for the feedback and the review.  I hope you appreciate now why I 
suggested you wait for the stuff I was submitting before getting back 
into the sorted checkpoint topic again.  That should be a lot easier to 
make sense of with this instrumentation in place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Postgresql 9.1 pg_last_xact_replay_timestamp limitations

2010-12-09 Thread Fujii Masao
On Fri, Dec 10, 2010 at 1:24 AM, Gabi Julien gabi.jul...@broadsign.com wrote:
 On Wednesday 08 December 2010 21:58:46 you wrote:
 On Thu, Dec 9, 2010 at 1:37 AM, Gabi Julien gabi.jul...@broadsign.com 
 wrote:
  slave# /etc/init.d/postgresql start
  slave# psql -hlocalhost my_db -c select pg_last_xact_replay_timestamp(), 
  now() as not_modified_since;
   pg_last_xact_replay_timestamp |      not_modified_since
  ---+---
                                | 2010-12-08 16:06:09.920219+00

 We should return the timestamp of last valid checkpoint rather than NULL in 
 that
 case?

 Well, I think this behavior would be more appreciated by postgresql users in 
 general. The case where the slave can be restarted after a clean shutdown is 
 rare but we need to consider it nonetheless. In my case I implemented a 
 custom function that reads the last returned timestamp from a new file on 
 disk. This is not a perfect solution since the value returned might be older 
 then the actual state of the replication but it's good enough for my needs.

The second question is; What should be returned when the server has been
started normally without recovery? NULL? The timestamp of last valid checkpoint?

The third question is; What should be returned while replaying WAL records which
exist between REDO starting point and checkpoint? In this case, it seems bad to
return the timestamp of the checkpoint whenever there is no replay transaction,
since the result timestamp would go back once at least one transaction has been
replayed before reaching the checkpoint record.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final(?) proposal for wal_sync_method changes

2010-12-09 Thread Josh Berkus
Greg,

 This is interesting, because test_fsync consistently reported a rate of
 about half this when using open_datasync instead of the equal
 performance I'm getting from the database.  I'll see if I can reproduce
 that further, but it's no reason to be concerned about the change that's
 been made I think.  Just more evidence that test_fsync has quirks left
 to be sorted out.  But that's not backbranch material, it should be part
 of 9.1 only refactoring, already in progress via the patch Josh
 submitted.  There's a bit of time left to get that done.

Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
figures you had from test_fsync earlier were without O_DIRECT.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Final(?) proposal for wal_sync_method changes

2010-12-09 Thread Greg Smith

Josh Berkus wrote:

Did you rerun test_sync with O_DIRECT entabled, using my patch?  The
figures you had from test_fsync earlier were without O_DIRECT.
  


No--I was just focused on testing the changes that had already been 
committed.  The use of O_DIRECT in the server but not test_fsync could 
very well be the reason for the difference; don't know yet.  I'm trying 
to get through this CF before I start getting distracted by newer 
patches, I'll get to yours soon I hope.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Supportwww.2ndQuadrant.us
PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] SynchRep; wait-forever and shutdown

2010-12-09 Thread Fujii Masao
Hi,

In previous discussion, some people wanted the wait-forever option which
blocks all the transactions on the master until sync'd standby has appeared,
in order to reduce the risk of data loss in synchronous replication.

What I'm not clear is; How does smart or fast shudown advance while all the
transactions are being blocked?

1. Shutdown should wait for all the transactions to end by appearance of
 sync'd standby?
 * Problem is that shutdown would take very long.

2. Shutdown should commit all the blocking transactions?
 * Problem is that a client thinks that those transactions have successfully
been committed even though they have not been replicated to the
standby.

3. Shutdown should abort all the blocking transactions?
 * Problem is that a client thinks that those transactions have been aborted
even though those WAL records have been written on the master. But
this is very common problem for DBMS, so we don't need to worry about
this in the context of replication.

ISTM smart and fast shutdown fits in with #1 and #3, respectively. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] serializable read only deferrable

2010-12-09 Thread Dan Ports
On Tue, Dec 07, 2010 at 10:14:24AM -0600, Kevin Grittner wrote:
  Essentially, instead of adding dependencies as you go along and
  abort once you hit a conflict, SERIALIZABLE READ ONLY DEFERRED
  transactions would assume the worst case from the start and thus
  be able to bypass the more detailed checks later on.
  
 Right -- such a transaction, having acquired a good snapshot, could
 release all SSI resources and run without any of the SSI overhead.

Yes, this makes sense. If no running transaction has ever read, and
will never read before COMMIT, any value that's modified by a
concurrent transaction, then they will not create snapshot anomalies,
and the current snapshot has a place in the serial ordering.

  With this scheme, you'd at least stand some chance of eventually
  acquiring a consistent snapshot, even in the case of an endless
  stream of overlapping READ WRITE transactions.
  
 Yeah, I'd been twisting ideas around trying to find a good way to do
 this; you've got it right at the conceptual level, I think.

The only thing I'm worried about here is how much risk of starvation
remains. You'd need to wait until there are no running r/w transactions
accessing overlapping data sets; for some applications that might not
be any better than waiting for the system to be idle. But I think
there's no way around that, it's just the price you have to pay to get
a snapshot that can never see an anomaly.

 Pseudo-code of idea (conveniently ignoring locking issues and
 non-serializable transactions):

This seems reasonable to me. Let me know if you need help implementing
it; I have some spare cycles right now.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Anyone for SSDs?

2010-12-09 Thread Vaibhav Kaushal
Hi all,

Most of you already know I am new to this list and newer to any OSS
development. However, while browsing the source code (of 9.0.1) I find
that there is only one way to store relations on disk - the magnetic
disk.

This came suddenly in my mind so I am asking the experts here. 

Considering the fact that SSDs will be common (at least for the
enterprise) in the coming years because of (of course you know the
reason) their less seek time and higher transfer rates per second, is
there someone trying for a ssd.c? In almost all cases even using md.c,
the kernel will handle it effectively but would it not be better that we
are well prepared to ask kernel for more?

Or has such an attempt already begun?

- Vaibhav (*_*)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] To Signal The postmaster

2010-12-09 Thread aaliya zarrin
Hi All,

Can anybody tell after finding the trigger file what steps does postgres
follow?

When and how it will set the postgres recovery mode to false?


On Thu, Dec 9, 2010 at 3:51 PM, Fujii Masao masao.fu...@gmail.com wrote:

 On Wed, Dec 8, 2010 at 6:22 PM, Fujii Masao masao.fu...@gmail.com wrote:
  On Wed, Dec 8, 2010 at 4:59 PM, Heikki Linnakangas
  heikki.linnakan...@enterprisedb.com wrote:
  For 9.1, we should think of a better way to do this, perhaps using
 SIGUSR1
  to wake up. Maybe we won't even need the trigger file anymore.
 
  If we use SIGUSR1, the mechanism to allow the users to specify the event
 type
  seems to be required. For example, we should make the SIGUSR1 handler
  check not only the shmem (i.e., PMSignalStat) but also the file?

 What I'm thinking is to make something like pg_ctl promote create the
 event
 file specifying the standby promotion in $PGDATA/pg_event or elsewhere,
 and send SIGUSR1 to postmaster. OTOH, when SIGUSR1 arrives, postmaster
 checks whether that event file exists. If it does, postmaster removes it
 and
 sends the signal to startup process for standby promotion.

 Thought?

 Regards,

 --
 Fujii Masao
 NIPPON TELEGRAPH AND TELEPHONE CORPORATION
 NTT Open Source Software Center




-- 
Thanks  Regards,

Aaliya Zarrin
(+91)-9160665888


[HACKERS] Fwd: Extended query protocol and exact types matches.

2010-12-09 Thread Dmitriy Igrishin
-- Forwarded message --
From: Dmitriy Igrishin dmit...@gmail.com
Date: 2010/12/10
Subject: Fwd: Extended query protocol and exact types matches.
To: postgres list pgsql-...@postgresql.org


Hey sql@,


-- Forwarded message --
From: Dmitriy Igrishin dmit...@gmail.com
Date: 2010/12/9
Subject: Extended query protocol and exact types matches.
To: pgsql-gene...@postgresql.org


Hey general@,

To be assured and just for calmness.

Problem:

1. CREATE TABLE test_tab (id integer, dat varchar(64));

2. INSERT INTO test_tab VALUES($1, $2) via PQexecParams,
 where paramTypes[0] == OID of bigint,
   paramTypes[1] == OID of text.

Questions:

Whether this case falls to
http://www.postgresql.org/docs/9.0/static/typeconv-query.html ?

Is such cases safe or it is recommended (best) to specify a
OIDs which are exact matches ?

PS.

I know, that queries like SELECT $1 does not work without
specifying OID or without rewriting it to e.g. SELECT $1::text.

Thanks.

-- 
// Dmitriy.





-- 
// Dmitriy.





-- 
// Dmitriy.