Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Heikki Linnakangas

On 14.02.2012 04:57, Dan Ports wrote:

Looking over the SSI 2PC code recently, I noticed that I overlooked a
case that could lead to non-serializable behavior after a crash.

When we PREPARE a serializable transaction, we store part of the
SERIALIZABLEXACT in the statefile (in addition to the list of SIREAD
locks). One of the pieces of information we record is whether the
transaction had any conflicts in or out. The problem is that that can
change if a new conflict occurs after the transaction has prepared.

Here's an example of the problem (based on the receipt-report test):

-- Setup
CREATE TABLE ctl (k text NOT NULL PRIMARY KEY, deposit_date date NOT NULL);
INSERT INTO ctl VALUES ('receipt', DATE '2008-12-22');
CREATE TABLE receipt (receipt_no int NOT NULL PRIMARY KEY, deposit_date date 
NOT NULL, amount numeric(13,2));

-- T2
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT INTO receipt VALUES (3, (SELECT deposit_date FROM ctl WHERE k = 
'receipt'), 4.00);
PREPARE TRANSACTION 't2';

-- T3
BEGIN ISOLATION LEVEL SERIALIZABLE;
UPDATE ctl SET deposit_date = DATE '2008-12-23' WHERE k = 'receipt';
COMMIT;

-- T1
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT * FROM ctl WHERE k = 'receipt';
SELECT * FROM receipt WHERE deposit_date = DATE '2008-12-22';
COMMIT;

Running this sequence of transactions normally, T1 will be rolled back
because of the pattern of conflicts T1 -  T2 -  T3, as we'd expect. This
should still be true even if we restart the database before executing
the last transaction -- but it's not. The problem is that, when T2
prepared, it had no conflicts, so we recorded that in the statefile.
The T2 -  T3 conflict happened later, so we didn't know about it during
recovery.

I discussed this a bit with Kevin and we agreed that this is important
to fix, since it's a false negative that violates serializability. The
question is how to fix it. There are a couple of options...

The easiest answer would be to just treat every prepared transaction
found during recovery as though it had a conflict in and out. This
is roughly a one-line change, and it's certainly safe.But the
downside is that this is pretty restrictive: after recovery, we'd
have to abort any serializable transaction that tries to read
anything that a prepared transaction wrote, or modify anything that
it read, until that transaction is either committed or rolled back.


+1 for this solution.


To do better than that, we want to know accurately whether the prepared
transaction had a conflict with a transaction that prepared or
committed before the crash. We could do this if we had a way to append
a record to the 2PC statefile of an already-prepared transaction --
then we'd just add a new record indicating the conflict. Of course, we
don't have a way to do that. It'd be tricky to add support for this,
since it has to be crash-safe, so the question is whether the improved
precision justifies the complexity it would require.


Not worth the complexity, IMO.

Perhaps it would be simpler to add the extra information to the commit 
records of the transactions that commit after the first transaction is 
prepared. In the commit record, you would include a list of prepared 
transactions that this transaction conflicted with. During recovery, you 
would collect those lists in memory, and use them at the end of recovery 
to flag the conflicts in prepared transactions that are still in 
prepared state.



A third option is to observe that the only conflicts *in* that
matter from a recovered prepared transaction are from other prepared
transactions. So we could have prepared transactions include in
their statefile the xids of any prepared transactions they conflicted
with at prepare time, and match them up during recovery to
reconstruct the graph. This is a middle ground between the other two
options. It doesn't require modifying the statefile after prepare.
However, conflicts *out* to non-prepared transactions do matter, and
this doesn't record those, so we'd have to do the conservative thing
-- which means that after recovery, no one can read anything a
prepared transaction wrote.


This would be fairly simple to do, but I'm not sure it's worth it, 
either. The nasty thing about this is whole thing is precisely that 
no-one can read anything the prepared transaction wrote, so making the 
conflict-in bookkeeping more accurate doesn't seem very helpful.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-14 Thread Etsuro Fujita
(2012/02/14 15:15), Shigeru Hanada wrote:
 (2012/02/13 20:50), Etsuro Fujita wrote:
 The patches have been applied, but role-related regression tests failed
 in my environment.  I fixed it in a similar fashion of
 /src/test/regress/sql/foreign_data.sql.  Please find attached a updated
 patch for the regression tests.
 
 Good catch, thanks.  I'll revise pgsql_fdw tests little more.
 
 BTW, What do you think about this?

 http://archives.postgresql.org/pgsql-hackers/2012-01/msg00229.php
 
 I'm sorry that I've left the thread unfinished...  I've given up to
 propose Join-push-down of foreign tables for 9.2, because it will take a
 while to achieve general semantics mapping for join push-down and WHERE
 clause push-down.  For 9.2, I'm proposing pgsql_fdw which has WHERE
 clause push-down for built-in elements which are free from collation.
 I'd like to go back to that item after 9.2 development enters beta or
 RC, hopefully :)

OK.  But my question was about the PlanForeignScan API.  As discussed at
that thread, it would have to change the PlanForeignScan API to let the
FDW generate multiple paths and dump them all to add_path instead of
returning a FdwPlan struct.  With this change, I think it would also
have to add a new FDW API that is called from create_foreignscan_plan()
and lets the FDW generate foreignscan plan for the base relation scanned
by the best path choosed by postgres optimizer for itself.  What do you
think about it?

Best regards,
Etsuro Fujita


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory usage during sorting

2012-02-14 Thread Hitoshi Harada
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 The attached patch allows it to reuse that memory.  On my meager
 system it reduced the building of an index on an integer column in a
 skinny 200 million row totally randomly ordered table by about 3% from
 a baseline of 25 minutes.


 Just to give a standard review, this patch is one line change and
 applies cleanly, builds ok.

 I'm not pretty sure what exactly you're trying to accomplish, but it
 seems to me that it's avoiding the first dumptuples cycle by forcing
 availMem = 0 even if it's negative.

 Yes.  Currently when it switches to the TSS_BUILDRUNS part of a
 tape-sort, it starts by calling WRITETUP a large number of time
 consecutively, to work off the memory deficit incurred by the 3 blocks
 per tape of tape overhead, and then after that calls WRITETUP about
 once per puttuple..   Under my patch, it would only call WRITETUP
 about once per puttuple, right from the beginning.

 I read your comments as it'd be
 avoiding to alternate reading/writing back and force with scattered
 memory failing memory cache much during merge phase, but actually it
 doesn't affect merge phase but only init-dump phase, does it?

 It effects the building of the runs.  But this building of the runs is
 not a simple dump, it is itself a mini merge phase, in which it merges
 the existing in-memory priority queue against the still-incoming
 tuples from the node which invoked the sort.  By using less memory
 than it could, this means that the resulting runs are smaller than
 they could be, and so will sometimes necessitate an additional layer
 of merging later on.   (This effect is particularly large for the very
 first run being built.  Generally by merging incoming tuples into the
 memory-tuples, you can create runs that are 1.7 times the size of fits
 in memory.  By wasting some memory, we are getting 1.7 the size of a
 smaller starting point.  But for the first run, it is worse than that.
  Most of the benefit that leads to that 1.7 multiplier comes at the
 very early stage of each run-build.  But by initially using the full
 memory, then writing out a bunch of tuples without doing any merge of
 the incoming, we have truncated the part that gives the most benefit.)

 My analysis that the freed memory is never reused (because we refuse
 to reuse it ourselves and it is too fragmented to be reused by anyone
 else, like the palloc or VM system) only applies to the run-building
 phase.  So never was a bit of an overstatement.  By the time the last
 initial run is completely written out to tape, the heap used for the
 priority queue should be totally empty.  So at this point the
 allocator would have the chance to congeal all of the fragmented
 memory back into larger chunks, or maybe it parcels the allocations
 back out again in an order so that the unused space is contiguous and
 could be meaningfully paged out.

 But, it is it worth worrying about how much we fragment memory and if
 we overshoot our promises by 10 or 20%?

 If so,
 I'm not so convinced your benchmark gave 3 % gain by this change.
 Correct me as I'm probably wrong.

 I've now done more complete testing.  Building an index on an
 200,000,000 row table with an integer column populated in random order
 with integers from 1..500,000,000, non-unique, on a machine with 2GB
 of RAM and 600MB of shared_buffers.

 It improves things by 6-7 percent at the end of working mem size, the
 rest are in the noise except at 77936 KB, where it reproducibly makes
 things 4% worse, for reasons I haven't figured out.  So maybe the best
 thing to do is, rather than micromanaging memory usage, simply don't
 set maintenance_work_mem way to low.  (But, it is the default).

I've tested here with only a million rows mix of integer/text (table
size is 80MB or so) with default setting, running CREATE INDEX
continuously, but couldn't find performance improvement.  The number
varies from -2% to +2%, which I think is just error.

While I agree with your insist that avoiding the first dump would make
sense, I guess it depends on situations; if the dump goes with lots of
tuples (which should happen when availMem is big), writing tuples a
lot at a time will be faster than writing little by little later.

I'm not sure about the conclusion, but given this discussion, I'm
inclined to mark this Returned with Feedback.


Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)

2012-02-14 Thread Kohei KaiGai
2012/2/14 Robert Haas robertmh...@gmail.com:
 On Mon, Feb 13, 2012 at 7:51 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I rebased the patch due to the updates of pg_proc.h.

 Please see the newer one. Thanks,

 Thanks, committed.  I think, though, that some further adjustment is
 needed here, because you currently can't do ALTER FUNCTION ... NO
 LEAKPROOF, which seems unacceptable.  It's fairly clear why not,
 though: you get a grammar conflict, because the parser allows this:

 create or replace function z() returns int as $$select 1$$ language
 sql set transaction not deferrable;

 However, since that syntax doesn't actually work, I'm thinking we
 could just refactor things a bit to reject that at the parser stage.
 The attached patch adopts that approach.  Anyone have a better idea?

I could not find out where is the origin of grammer conflicts, although
it does not conflict with any options within ALTER FUNCTION.

Do you think the idea of ALTER ... NOT LEAKPROOF should be
integrated within v9.2 timeline also?

 I also think we ought to stick create_function_3 into one of the
 parallel groups in the regression tests, if possible.  Can you
 investigate that?

Not yet. This test does not have dependency with other tests,
so, I'm optimistic to run create_function_3 concurrently.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-14 Thread Shigeru Hanada
(2012/02/14 17:40), Etsuro Fujita wrote:
 OK.  But my question was about the PlanForeignScan API.

Sorry for misunderstanding. :(

  As discussed at
 that thread, it would have to change the PlanForeignScan API to let the
 FDW generate multiple paths and dump them all to add_path instead of
 returning a FdwPlan struct.  With this change, I think it would also
 have to add a new FDW API that is called from create_foreignscan_plan()
 and lets the FDW generate foreignscan plan for the base relation scanned
 by the best path choosed by postgres optimizer for itself.  What do you
 think about it?

Though I have only random thoughts about this issue at the moment...

Multiple valuable Paths for a scan of a foreign table by FDW, but
changing PlanForeignScan to return list of FdwPlan in 9.2 seems too
hasty.  It would need more consideration about general interface for
possible results such as:

* Full output (no WHERE push-down) is expensive on both remote and transfer.
* Filtered output (WHERE push-down) has cheap total costs when only few
rows come through the filter.
* Ordered output (ORDER BY push-down) is expensive on remote, but has
chance to omit upper Sort node.
* Aggregated output (GROUP BY push-down) is expensive on remote, but
have chance to omit upper Agg node, and reduces data transfer.
* Limited output (LIMIT/OFFSET push-down) can reduce data transfer, and
have chance to omit upper Limit node.

Currently FDWs can consider only first two, AFAIK.  If FDW generates
multiple FdwPlan (Full and Filtered) and sets different start-up costs
and total costs to them (may be former has higher start-up and lower
total than latter), planner would choose better for the whole plan.

In addition to changing FdwRoutine, it seems worth changing FdwPlan too
so that FDWs can return more information to planner, such as pathkeys
and rows, for each possible path.

In short, I have some ideas to enhance foreign table scans, but IMO they
are half-baked and we don't have enough time to achieve them for 9.2.

Regards,
-- 
Shigeru Hanada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-14 Thread Etsuro Fujita
(2012/02/14 19:42), Shigeru Hanada wrote:
 (2012/02/14 17:40), Etsuro Fujita wrote:
   As discussed at
 that thread, it would have to change the PlanForeignScan API to let the
 FDW generate multiple paths and dump them all to add_path instead of
 returning a FdwPlan struct.  With this change, I think it would also
 have to add a new FDW API that is called from create_foreignscan_plan()
 and lets the FDW generate foreignscan plan for the base relation scanned
 by the best path choosed by postgres optimizer for itself.  What do you
 think about it?

 In short, I have some ideas to enhance foreign table scans, but IMO they
 are half-baked and we don't have enough time to achieve them for 9.2.

OK.  Thank you for your answer.

Best regards,
Etsuro Fujita

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stats_recovery view

2012-02-14 Thread Alvaro Herrera

Excerpts from Jaime Casanova's message of mar feb 14 04:10:58 -0300 2012:
 On Thu, Feb 2, 2012 at 2:32 AM, Magnus Hagander mag...@hagander.net wrote:
 
  I haven't looked through the code in detail, but one direct comment:
  do we really need/want to send this through the stats collector? It
  will only ever have one sender - perhaps we should just either store
  it in shared memory or store it locally and only send it on demand?
 
 
 fyi, i intend to send a reworked patch later today, it will store the
 info locally and send it on demand.
 about the _short_desc functions, i added that because i wanted to
 understand what was happening during recovery and the wal_record_type
 (xl_info) being a number is not that clear

Maybe it'd be clearer if you display it in hex and filter out just the
bits that are interesting for this use?  IIRC xl_info carries some other
bits than the ones to identify the record type, which could be
confusing.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)

2012-02-14 Thread Robert Haas
On Tue, Feb 14, 2012 at 4:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I could not find out where is the origin of grammer conflicts, although
 it does not conflict with any options within ALTER FUNCTION.

 Do you think the idea of ALTER ... NOT LEAKPROOF should be
 integrated within v9.2 timeline also?

Yes.  Did you notice that I attached a patch to make that work?  I'll
commit that today or tomorrow unless someone comes up with a better
solution.

 I also think we ought to stick create_function_3 into one of the
 parallel groups in the regression tests, if possible.  Can you
 investigate that?

 Not yet. This test does not have dependency with other tests,
 so, I'm optimistic to run create_function_3 concurrently.

Me, too.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [v9.2] LEAKPROOF attribute of FUNCTION (Re: [HACKERS] [v9.2] Fix Leaky View Problem)

2012-02-14 Thread Kohei KaiGai
2012/2/14 Robert Haas robertmh...@gmail.com:
 On Tue, Feb 14, 2012 at 4:55 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 I could not find out where is the origin of grammer conflicts, although
 it does not conflict with any options within ALTER FUNCTION.

 Do you think the idea of ALTER ... NOT LEAKPROOF should be
 integrated within v9.2 timeline also?

 Yes.  Did you notice that I attached a patch to make that work?  I'll
 commit that today or tomorrow unless someone comes up with a better
 solution.

Yes. I'll be available to work on the feature based on this patch.
It was a headache of mine to implement alter statement to add/remove
leakproof attribute.

 I also think we ought to stick create_function_3 into one of the
 parallel groups in the regression tests, if possible.  Can you
 investigate that?

 Not yet. This test does not have dependency with other tests,
 so, I'm optimistic to run create_function_3 concurrently.

 Me, too.

I tried to move create_function_3 into the group of create_view and
create_index, then it works correctly.

Thanks,
-- 
KaiGai Kohei kai...@kaigai.gr.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgsql_fdw, FDW for PostgreSQL server

2012-02-14 Thread Tom Lane
Shigeru Hanada shigeru.han...@gmail.com writes:
 (2012/02/14 17:40), Etsuro Fujita wrote:
 As discussed at
 that thread, it would have to change the PlanForeignScan API to let the
 FDW generate multiple paths and dump them all to add_path instead of
 returning a FdwPlan struct.

 Multiple valuable Paths for a scan of a foreign table by FDW, but
 changing PlanForeignScan to return list of FdwPlan in 9.2 seems too
 hasty.

I would really like to see that happen in 9.2, because the longer we let
that mistake live, the harder it will be to change.  More and more FDWs
are getting written.  I don't think it's that hard to do: we just have
to agree that PlanForeignScan should return void and call add_path for
itself, possibly more than once.  If we do that, I'm inclined to think
we cou;d get rid of the separate Node type FdwPlan, and just incorporate
List *fdw_private into ForeignPath and ForeignScan.

This does mean that FDWs will be a bit more tightly coupled to the
planner, because they'll have to change whenever we add new fields to
struct Path; but that is not really something that happens often.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Kevin Grittner
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 On 14.02.2012 04:57, Dan Ports wrote:
 Looking over the SSI 2PC code recently, I noticed that I
 overlooked a case that could lead to non-serializable behavior
 after a crash.

 When we PREPARE a serializable transaction, we store part of the
 SERIALIZABLEXACT in the statefile (in addition to the list of
 SIREAD locks). One of the pieces of information we record is
 whether the transaction had any conflicts in or out. The problem
 is that that can change if a new conflict occurs after the
 transaction has prepared.
 
 I discussed this a bit with Kevin and we agreed that this is
 important to fix, since it's a false negative that violates
 serializability. The question is how to fix it. There are a
 couple of options...

 The easiest answer would be to just treat every prepared
 transaction found during recovery as though it had a conflict in
 and out. This is roughly a one-line change, and it's certainly
 safe.
 
Dan, could you post such a patch, please?
 
 But the downside is that this is pretty restrictive: after
 recovery, we'd have to abort any serializable transaction that
 tries to read anything that a prepared transaction wrote, or
 modify anything that it read, until that transaction is either
 committed or rolled back.
 
 +1 for this solution.
 
+1 for 9.2 and backpatching this; with the notion that we might be
able to do better in some later release.  (A TODO entry?)
 
Should we add anything to the docs to warn people that if they crash
with serializable prepared transactions pending, they will see this
behavior until the prepared transactions are either committed or
rolled back, either by the transaction manager or through manual
intervention?
 
 Perhaps it would be simpler to add the extra information to the
 commit records of the transactions that commit after the first
 transaction is prepared. In the commit record, you would include a
 list of prepared transactions that this transaction conflicted
 with. During recovery, you would collect those lists in memory,
 and use them at the end of recovery to flag the conflicts in
 prepared transactions that are still in prepared state.
 
That indeed seems simpler.  I'm not even sure that you would need to
build a list and process it at the end; couldn't this be done as the
commit records are replayed?  Keep in mind that if the prepared
transaction is not still pending, the information can be safely
ignored, and if it *is* still pending you don't need to know *which*
transaction it had the conflict with, because it will certainly have
committed before the start of any post-recovery transaction.
 
 A third option is to observe that the only conflicts *in* that
 matter from a recovered prepared transaction are from other
 prepared transactions. So we could have prepared transactions
 include in their statefile the xids of any prepared transactions
 they conflicted with at prepare time, and match them up during
 recovery to reconstruct the graph. This is a middle ground
 between the other two options. It doesn't require modifying the
 statefile after prepare. However, conflicts *out* to non-prepared
 transactions do matter, and this doesn't record those, so we'd
 have to do the conservative thing -- which means that after
 recovery, no one can read anything a prepared transaction wrote.
 
 This would be fairly simple to do, but I'm not sure it's worth
 it, either. The nasty thing about this is whole thing is precisely
 that no-one can read anything the prepared transaction wrote, so
 making the conflict-in bookkeeping more accurate doesn't seem very
 helpful.
 
Yeah, the benefit of this would be marginal without solving the
other side of the problem; but if we're adding TODO entries for this
area, perhaps they should be two separate entries, because either
side of this could be done without touching the other.
 
To summarize the above discussion, there is a bug that can be hit
when using both SSI and 2PC if a crash or shutdown occurs while any
serializable prepared transactions are pending and certain other
conditions are met.  The proposed quick fix would be to cause a
serialization failure after recovery on any attempt by a
serializable transaction to read data written by a serializable
prepared transaction that was pending when a crash or shutdown
occurred, and on any attempt by a serializable transaction to do a
write which conflicts with a predicate lock acquired by such a
prepared transaction.  This would tend to be more than a little
inconvenient until the prepared statements pending at crash or
shutdown were all committed or rolled back.  A more sophisticated
solution is available that could be implemented in 9.3 or later.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 This would tend to be more than a little inconvenient until the
 prepared statements pending at crash or shutdown were all
 committed or rolled back.
 
[sigh]
 
Probably obvious, but to avoid confusion:
 
s/prepared statements/prepared transactions/
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_fsync performance

2012-02-14 Thread Bruce Momjian
On Mon, Feb 13, 2012 at 09:54:06PM -0500, Bruce Momjian wrote:
 On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote:
  Robert Haas robertmh...@gmail.com writes:
   Instead of or in addition to a fixed number operations per test, maybe
   we should cut off each test after a certain amount of wall-clock time,
   like 15 seconds.
  
  +1, I was about to suggest the same thing.  Running any of these tests
  for a fixed number of iterations will result in drastic degradation of
  accuracy as soon as the machine's behavior changes noticeably from what
  you were expecting.  Run them for a fixed time period instead.  Or maybe
  do a few, then check elapsed time and estimate a number of iterations to
  use, if you're worried about the cost of doing gettimeofday after each
  write.
 
 Good idea, and it worked out very well.  I changed the -o loops
 parameter to -s seconds which calls alarm() after (default) 2 seconds,
 and then once the operation completes, computes a duration per
 operation.

Update patch applied, with additional fix for usage message, and use of
macros for start/stop testing.

I like this method much better because not only does it speed up the
test, but it also allows the write test, which completes very quickly,
to run longer and report more accurate numbers.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Initial 9.2 pgbench write results

2012-02-14 Thread Greg Smith
Last year at this time, I was investigating things like ext3 vs xfs, how 
well Linux's dirty_bytes parameter worked, and how effective a couple of 
patches were on throughput  latency.  The only patch that ended up 
applied for 9.1 was for fsync compaction.  That was measurably better in 
terms of eliminating backend syncs altogether, and it also pulled up 
average TPS a bit on the database scales I picked out to test it on.  
That rambling group of test sets is available at 
http://highperfpostgres.com/pgbench-results/index.htm


For the first round of 9.2 testing under a write-heavy load, I started 
with 9.0 via the yum.postgresql.org packages for SL6, upgraded to 9.1 
from there, and then used a source code build of 9.2 HEAD as of Feb 11 
(58a9596ed4a509467e1781b433ff9c65a4e5b5ce).  Attached is an Excel 
spreadsheet showing the major figures, along with a CSV formatted copy 
of that data too.  Results that are ready so far are available at 
http://highperfpostgres.com/results-write-9.2-cf4/index.htm


Most of that is good; here's the best and worst parts of the news in 
compact form:


scale=500, db is 46% of RAM
Version Avg TPS
9.0  1961
9.1  2255
9.2  2525

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0  535
9.1  491 (-8.4% relative to 9.0)
9.2  338 (-31.2% relative to 9.1)

There's usually a tipping point with pgbench results, where the 
characteristics change quite a bit as the database exceeds total RAM 
size.  You can see the background writer statistics change quite a bit 
around there too.  Last year the sharpest part of that transition 
happened when exceeding total RAM; now it's happening just below that.


This test set takes about 26 hours to run in the stripped down form I'm 
comparing, which doesn't even bother trying larger than RAM scales like 
2000 or 3000 that might also be helpful.  Most of the runtime time is 
spent on the larger scale database tests, which unfortunately are the 
interesting ones this year.  I'm torn at this point between chasing down 
where this regression came from, moving forward with testing the new 
patches proposed for this CF, and seeing if this regression also holds 
with SSD storage.  Obvious big commit candidates to bisect this over are 
the bgwriter/checkpointer split (Nov 1) and the group commit changes 
(Jan 30).  Now I get to pay for not having set this up to run 
automatically each week since earlier in the 9.2 development cycle.


If someone else wants to try and replicate the bad part of this, best 
guess for how is using the same minimal postgresql.conf changes I have 
here, and picking your database scale so that the test database just 
barely fits into RAM.  pgbench gives rough 16MB of data per unit of 
scale, and scale=1000 is 15GB; percentages above are relative to the 
16GB of RAM in my server.  Client count should be small, number of 
physical cores is probably a good starter point (that's 4 in my system, 
I didn't test below that).  At higher client counts, the general 
scalability improvements in 9.2 negate some of this downside.


= Server config =

The main change to the 8 hyperthreaded core test server (Intel i7-870) 
for this year is bumping it from 8GB to 16GB of RAM, which effectively 
doubles the scale I can reach before things slow dramatically.  It's 
also been updated to run Scientific Linux 6.0, giving a slightly later 
kernel.  That kernel does have different defaults for 
dirty_background_ratio and dirty_ratio, they're 10% and 20% now 
(compared to 5%/10% in last year's tests).


Drive set for tests I'm publishing so far is basically the same:  4-port 
Areca card with 256MB battery-backed cache, 3 disk RAID0 for the 
database, single disk for the WAL, all cheap 7200 RPM drives.  The OS is 
a separate drive, not connected to the caching controller.  That's also 
where the pgbench latency data is writing to.  Idea is that this will be 
similar to having around 10 drives in a production server, where you'll 
also be using RAID1 for redundancy.  I have some numbers brewing for 
this system running with an Intel 320 series SSD, too, but they're not 
ready yet.


= Test setup =

pgbench-tools has been upgraded to break down its graphs per test set 
now, and there's even a configuration option to use client-side 
Javascript to put that into a tab-like interface available.  Thanks to 
Ben Bleything for that one.


Minimal changes were made to the postgresql.conf.  shared_buffers=2GB, 
checkpoint_segments=64, and I left wal_buffers at its default so that 
9.1 got credit for that going up.  See 
http://highperfpostgres.com/results-write-9.2-cf4/541/pg_settings.txt 
for a full list of changes, drive mount options, and important kernel 
settings.  Much of that data wasn't collected in last year's 
pgbench-tools runs.


= Results commentary =

For the most part the 9.2 results are quite good.  The increase at high 
client counts is solid, as expected from all the lock refactoring this 
release has gotten.  The smaller than RAM 

Re: [HACKERS] Initial 9.2 pgbench write results

2012-02-14 Thread Greg Smith

On 02/14/2012 01:45 PM, Greg Smith wrote:

scale=1000, db is 94% of RAM; clients=4
Version TPS
9.0  535
9.1  491 (-8.4% relative to 9.0)
9.2  338 (-31.2% relative to 9.1)


A second pass through this data noted that the maximum number of buffers 
cleaned by the background writer is =2785 in 9.0/9.1, while it goes as 
high as 17345 times in 9.2.  The background writer is so busy now it 
hits the max_clean limit around 147 times in the slower[1] of the 9.2 
runs.  That's an average of once every 4 seconds, quite frequent.  
Whereas max_clean rarely happens in the comparable 9.0/9.1 results.  
This is starting to point my finger more toward this being an unintended 
consequence of the background writer/checkpointer split.


Thinking out loud, about solutions before the problem is even nailed 
down, I wonder if we should consider lowering bgwriter_lru_maxpages now 
in the default config?  In older versions, the page cleaning work had at 
most a 50% duty cycle; it was only running when checkpoints were not.  
If we wanted to keep the ceiling on background writer cleaning at the 
same level in the default configuration, that would require dropping 
bgwriter_lru_maxpages from 100 to 50.  That would be roughly be the same 
amount of maximum churn.  It's obviously more complicated than that, but 
I think there's a defensible position along those lines to consider.


As a historical aside, I wonder how much this behavior might have been 
to blame for my failing to get spread checkpoints to show a positive 
outcome during 9.1 development.  The way that was written also kept the 
cleaner running during checkpoints.  I didn't measure those two changes 
individually as much as I did the combination.


[1] I normally do 3 runs of every scale/client combination, and find 
that more useful than a single run lasting 3X as long.  The first out of 
each of the 3 runs I do at any scale is usually a bit faster than the 
later two, presumably due to table and/or disk fragmentation.  I've 
tried to make this less of a factor in pgbench-tools by iterating 
through all requested client counts first, before beginning a second run 
of those scale/client combination.  So if the two client counts were 4 
and 8, it would be 4/8/4/8/4/8, which works much better than 4/4/4/8/8/8 
in terms of fragmentation impacting the average result.  Whether it 
would be better or worse to eliminate this difference by rebuilding the 
whole database multiple times for each scale is complicated.  I happen 
to like seeing the results with a bit more fragmentation mixed in, see 
how they compare with the fresh database.  Since more rebuilds would 
also make these tests take much longer than they already do, that's the 
tie-breaker that's led to the current testing schedule being the 
preferred one.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] random_page_cost vs seq_page_cost

2012-02-14 Thread Bruce Momjian
On Tue, Feb 07, 2012 at 07:58:28PM -0500, Bruce Momjian wrote:
 I was initially concerned that tuning advice in this part of the docs
 would look out of place, but now see the 25% shared_buffers
 recommentation, and it looks fine, so we are OK.  (Should we caution
 against more than 8GB of shared buffers?  I don't see that in the docs.)
 
 I agree we are overdue for better a explanation of random page cost, so
 I agree with your direction.  I did a little word-smithing to tighten up
 your text;  feel free to discard what you don't like:
 
   Random access to mechanical disk storage is normally much more expensive
   than four-times sequential access.  However, a lower default is used
   (4.0) because the majority of random accesses to disk, such as indexed
   reads, are assumed to be in cache.  The default value can be thought of
   as modeling random access as 40 times slower than sequential, while
   expecting 90% of random reads to be cached.
   
   If you believe a 90% cache rate is an incorrect assumption
   for your workload, you can increase random_page_cost to better
   reflect the true cost of random storage reads. Correspondingly,
   if your data is likely to be completely in cache, such as when
   the database is smaller than the total server memory, decreasing
   random_page_cost can be appropriate.  Storage that has a low random
   read cost relative to sequential, e.g. solid-state drives, might
   also be better modeled with a lower value for random_page_cost.

Patch applied for random_page_cost docs.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-14 Thread Alexander Korotkov
ITSM, I found the problem. This piece of code is triggering an error. It
assumes each page of corresponding to have initialized buffer. That should
be true because we're inserting index tuples from up to down while
splits propagate from down to up.

if (!found)
{
/*
 * Node buffer should exist at this point. If it didn't exist before,
 * the insertion that caused the page to split should've created it.
 */
elog(ERROR, node buffer of page being split (%u) does not exist,
 blocknum);
}

But this assumptions becomes false we turn buffer off in the root page. So,
root page can produce pages without initialized buffers when splits.

/*
 * Does specified level have buffers? (Beware of multiple evaluation of
 * arguments.)
 */
#define LEVEL_HAS_BUFFERS(nlevel, gfbb) \
((nlevel) != 0  (nlevel) % (gfbb)-levelStep == 0  \
 (nlevel) != (gfbb)-rootitem-level)

So, I think we should just do silent return from the function instead of
triggering error. Patch is attached.

--
With best regards,
Alexander Korotkov.
*** a/src/backend/access/gist/gistbuildbuffers.c
--- b/src/backend/access/gist/gistbuildbuffers.c
***
*** 607,617  gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb, GISTSTATE *giststate,
  	if (!found)
  	{
  		/*
! 		 * Node buffer should exist at this point. If it didn't exist before,
! 		 * the insertion that caused the page to split should've created it.
  		 */
! 		elog(ERROR, node buffer of page being split (%u) does not exist,
! 			 blocknum);
  	}
  
  	/*
--- 607,616 
  	if (!found)
  	{
  		/*
! 		 * Page without buffer could be produced by split of root page. So
! 		 * we've just nothing to do here when there is no buffer.
  		 */
! 		return;
  	}
  
  	/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] When do we lose column names?

2012-02-14 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 On 02/13/2012 11:00 AM, Tom Lane wrote:
 This is surely fixable by passing a bit more information down.  If you
 (Andrew) have something that covers everything but this issue, pass it
 over and I'll take a whack at it.

 What I have fixes one of the three cases I have identified that appear 
 to need fixing, the one that caused the json tests to crash with your 
 original partial patch. See 
 https://bitbucket.org/adunstan/pgdevel/changesets/tip/rowexprs. I 
 won't have time to return to this for a few days. The two remaining 
 cases should be fairly independent I think. If you don't get to this 
 before then I'll look at the flatten_join_alias_vars_mutator case again 
 and try to fix it based on the above, and then give you what I've got.

OK, I fixed this up and committed it.  I made some cosmetic changes
(the most notable being that the definition of RowExpr is really
changing here, and so should its comment).  The adjust_appendrel_attrs
situation was fixed by passing in the PlannerInfo, which is something
that probably should have been made available all along --- there are
very few nontrivial functions in the planner that don't need it.

I'm still a bit annoyed by the behavior I mentioned here,
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01031.php
that we don't get real column names from an unflattened VALUES RTE.
Might be worth looking into that, but I don't have time for it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bugs/slowness inserting and indexing cubes

2012-02-14 Thread Tom Lane
Alexander Korotkov aekorot...@gmail.com writes:
 ITSM, I found the problem. This piece of code is triggering an error. It
 assumes each page of corresponding to have initialized buffer. That should
 be true because we're inserting index tuples from up to down while
 splits propagate from down to up.
 But this assumptions becomes false we turn buffer off in the root page. So,
 root page can produce pages without initialized buffers when splits.

Hmm ... can we tighten the error check rather than just remove it?  It
feels less than safe to assume that a hash-entry-not-found condition
*must* reflect a corner-case situation like that.  At the very least
I'd like to see it verify that we'd turned off buffering before deciding
this is OK.  Better, would it be practical to make dummy entries in the
hash table even after turning buffers off, so that the logic here
becomes

if (!found) error;
else if (entry is dummy) return without doing anything;
else proceed;

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_fsync performance

2012-02-14 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote:
 +1, I was about to suggest the same thing.  Running any of these tests
 for a fixed number of iterations will result in drastic degradation of
 accuracy as soon as the machine's behavior changes noticeably from what
 you were expecting.  Run them for a fixed time period instead.  Or maybe
 do a few, then check elapsed time and estimate a number of iterations to
 use, if you're worried about the cost of doing gettimeofday after each
 write.

 Good idea, and it worked out very well.  I changed the -o loops
 parameter to -s seconds which calls alarm() after (default) 2 seconds,
 and then once the operation completes, computes a duration per
 operation.

I was kind of wondering how portable alarm() is, and the answer
according to the buildfarm is that it isn't.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_test_fsync performance

2012-02-14 Thread Marko Kreen
On Tue, Feb 14, 2012 at 05:59:06PM -0500, Tom Lane wrote:
 Bruce Momjian br...@momjian.us writes:
  On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote:
  +1, I was about to suggest the same thing.  Running any of these tests
  for a fixed number of iterations will result in drastic degradation of
  accuracy as soon as the machine's behavior changes noticeably from what
  you were expecting.  Run them for a fixed time period instead.  Or maybe
  do a few, then check elapsed time and estimate a number of iterations to
  use, if you're worried about the cost of doing gettimeofday after each
  write.
 
  Good idea, and it worked out very well.  I changed the -o loops
  parameter to -s seconds which calls alarm() after (default) 2 seconds,
  and then once the operation completes, computes a duration per
  operation.
 
 I was kind of wondering how portable alarm() is, and the answer
 according to the buildfarm is that it isn't.

I'm using following simplistic alarm() implementation for win32:

  https://github.com/markokr/libusual/blob/master/usual/signal.c#L21

this works with fake sigaction()/SIGALARM hack below - to remember
function to call.

Good enough for simple stats printing, and avoids win32-specific
code spreading around.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Dan Ports
On Tue, Feb 14, 2012 at 10:04:15AM +0200, Heikki Linnakangas wrote:
 Perhaps it would be simpler to add the extra information to the commit 
 records of the transactions that commit after the first transaction is 
 prepared. In the commit record, you would include a list of prepared 
 transactions that this transaction conflicted with. During recovery, you 
 would collect those lists in memory, and use them at the end of recovery 
 to flag the conflicts in prepared transactions that are still in 
 prepared state.

Yeah, doing it that way might be a better strategy if we wanted to go
that route. I hadn't really considered it because I'm not that familiar
with the xlog code (plus, the commit record already contains a variable
length field, making it that much more difficult to add another).

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI rw-conflicts and 2PC

2012-02-14 Thread Dan Ports
On Tue, Feb 14, 2012 at 09:27:58AM -0600, Kevin Grittner wrote:
 Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
  On 14.02.2012 04:57, Dan Ports wrote:
  The easiest answer would be to just treat every prepared
  transaction found during recovery as though it had a conflict in
  and out. This is roughly a one-line change, and it's certainly
  safe.
  
 Dan, could you post such a patch, please?

Sure. See attached.

 Should we add anything to the docs to warn people that if they crash
 with serializable prepared transactions pending, they will see this
 behavior until the prepared transactions are either committed or
 rolled back, either by the transaction manager or through manual
 intervention?

Hmm, it occurs to me if we have to abort a transaction due to
serialization failure involving a prepared transaction, we might want
to include the prepared transaction's gid in the errdetail. 

This semes like it'd be especially useful for prepared transactions. We
can generally pick the transaction to abort to ensure the safe retry
property -- if that transaction is immediately retried, it won't
fail with the same conflict -- but we can't always guarantee that when
prepared transactions are involved. And prepared transactions already
have a convenient, user-visible ID we can report.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index b75b73a..b102e19 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -4733,14 +4733,11 @@ AtPrepare_PredicateLocks(void)
 	xactRecord-flags = MySerializableXact-flags;
 
 	/*
-	 * Tweak the flags. Since we're not going to output the inConflicts and
-	 * outConflicts lists, if they're non-empty we'll represent that by
-	 * setting the appropriate summary conflict flags.
+	 * Note that we don't include the list of conflicts in our out in
+	 * the statefile, because new conflicts can be added even after the
+	 * transaction prepares. We'll just make a conservative assumption
+	 * during recovery instead.
 	 */
-	if (!SHMQueueEmpty(MySerializableXact-inConflicts))
-		xactRecord-flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
-	if (!SHMQueueEmpty(MySerializableXact-outConflicts))
-		xactRecord-flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
 
 	RegisterTwoPhaseRecord(TWOPHASE_RM_PREDICATELOCK_ID, 0,
 		   record, sizeof(record));
@@ -4875,15 +4872,6 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
 
 		sxact-SeqNo.lastCommitBeforeSnapshot = RecoverySerCommitSeqNo;
 
-
-		/*
-		 * We don't need the details of a prepared transaction's conflicts,
-		 * just whether it had conflicts in or out (which we get from the
-		 * flags)
-		 */
-		SHMQueueInit((sxact-outConflicts));
-		SHMQueueInit((sxact-inConflicts));
-
 		/*
 		 * Don't need to track this; no transactions running at the time the
 		 * recovered xact started are still active, except possibly other
@@ -4905,6 +4893,17 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info,
    (MaxBackends + max_prepared_xacts));
 		}
 
+		/*
+		 * We don't know whether the transaction had any conflicts or
+		 * not, so we'll conservatively assume that it had both a
+		 * conflict in and a conflict out, and represent that with the
+		 * summary conflict flags.
+		 */
+		SHMQueueInit((sxact-outConflicts));
+		SHMQueueInit((sxact-inConflicts));
+		sxact-flags |= SXACT_FLAG_SUMMARY_CONFLICT_IN;
+		sxact-flags |= SXACT_FLAG_SUMMARY_CONFLICT_OUT;
+		
 		/* Register the transaction's xid */
 		sxidtag.xid = xid;
 		sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CUDA Sorting

2012-02-14 Thread Gaetano Mendola

On 13/02/2012 08:26, Greg Smith wrote:

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU via
external hardware (PCI Express interface with PCI Express switches),
look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.


The C410X adds 16 PCIe slots to a server, housed inside a separate 3U
enclosure. That's a completely sensible purchase if your goal is to
build a computing cluster, where a lot of work is handed off to a set of
GPUs. I think that's even less likely to be a cost-effective option for
a database server. Adding a single dedicated GPU installed in a server
to accelerate sorting is something that might be justifiable, based on
your benchmarks. This is a much more expensive option than that though.
Details at http://www.dell.com/us/enterprise/p/poweredge-c410x/pd for
anyone who wants to see just how big this external box is.


I did some experimenst timing the sort done with CUDA and the sort
done with pg_qsort:
CUDA pg_qsort
33Milion integers: ~ 900 ms, ~ 6000 ms
1Milion integers: ~ 21 ms, ~ 162 ms
100k integers: ~ 2 ms, ~ 13 ms
CUDA time has already in the copy operations (host-device,
device-host).
As GPU I was using a C2050, and the CPU doing the pg_qsort was a
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz


That's really interesting, and the X5650 is by no means a slow CPU. So
this benchmark is providing a lot of CPU power yet still seeing over a
6X speedup in sort times. It sounds like the PCI Express bus has gotten
fast enough that the time to hand data over and get it back again can
easily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scales
when using in parallel. Easiest way to do that would be to use the
pgbench -f feature, which allows running an arbitrary number of some
query at once. Seeing whether this acceleration continued to hold as the
number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent? For
example, how much of this time is consumed in the GPU itself, compared
to time spent transferring data between CPU and GPU? I'm also curious
where the bottleneck is at with this approach. If it's the speed of the
PCI-E bus for smaller data sets, adding more GPUs may never be
practical. If the bus can handle quite a few of these at once before it
saturates, it might be possible to overload a single GPU. That seems
like it would be really hard to reach for database sorting though; I
can't really defend justify my gut feel for that being true though.


There you go (times are in ms):

Size   H-D SORT D-H TOTAL
64   0.209824 0.479392 0.013856 0.703072
128  0.098144 0.41744  0.01312  0.528704
256  0.096832 0.420352 0.013696 0.53088
512  0.097568 0.3952   0.014464 0.507232
1024 0.09872  0.396608 0.014624 0.509952
2048 0.101344 0.56224  0.016896 0.68048
4096 0.106176 0.562976 0.02016  0.689312
8192 0.116512 0.571264 0.02672  0.714496
163840.136096 0.587584 0.040192 0.763872
327680.179296 0.658112 0.066304 0.903712
655360.212352 0.84816  0.118016 1.178528
131072   0.317056 1.1465   0.22784  1.691396
262144   0.529376 1.82237  0.42512  2.776866
524288   0.724032 2.39834  0.64576  3.768132
1048576  1.11162  3.51978  1.12176  5.75316
2097152  1.95939  5.93434  2.06992  9.96365
4194304  3.76192  10.6011  4.10614  18.46916
8388608  7.16845  19.9245  7.93741  35.03036
16777216 13.8693  38.7413  15.4073  68.0179
33554432 27.3017  75.6418  30.6646  133.6081
67108864 54.2171  151.192  60.327   265.7361

pg_sort

64   0.01
128  0.01
256  0.021000
512  0.128000
1024 0.092000
2048 0.196000
4096 0.415000
8192 0.883000
163841.881000
327683.96
655368.432000
131072  17.951000
262144  37.14
524288  78.32
1048576163.276000
2097152339.118000
4194304693.223000
8388608   1423.142000
16777216  2891.218000
33554432  5910.851000
67108864 11980.93

As you can notice the times with CUDA are lower than the timing I have 
reported on my previous post because the server was doing something else

in mean while, I have repeated those benchmarks with server completely
unused.

And this is the boost as in pg_sort/cuda :

64   0.0142232943
128  0.018914175
256  0.039556962
512  0.2070058671
1024 0.1804091365
2048 0.2880319774
4096 0.6078524674
8192 1.2372357578
163842.4637635625
327684.4106972133
655367.1742037525
131072   10.5090706139
262144   13.3719091955
524288   20.5834084369
1048576  28.2516043357
2097152  33.9618513296
4194304  37.5247168794
8388608  40.5135716561
16777216 42.4743633661
33554432 44.2394809896
67108864 45.1499777411



 I've never seen a PostgreSQL server capable of running CUDA, and I
 don't expect that to change.

That sounds like:

I think there is a world market for 

Re: [HACKERS] pg_test_fsync performance

2012-02-14 Thread Bruce Momjian
On Wed, Feb 15, 2012 at 01:35:05AM +0200, Marko Kreen wrote:
 On Tue, Feb 14, 2012 at 05:59:06PM -0500, Tom Lane wrote:
  Bruce Momjian br...@momjian.us writes:
   On Mon, Feb 13, 2012 at 08:28:03PM -0500, Tom Lane wrote:
   +1, I was about to suggest the same thing.  Running any of these tests
   for a fixed number of iterations will result in drastic degradation of
   accuracy as soon as the machine's behavior changes noticeably from what
   you were expecting.  Run them for a fixed time period instead.  Or maybe
   do a few, then check elapsed time and estimate a number of iterations to
   use, if you're worried about the cost of doing gettimeofday after each
   write.
  
   Good idea, and it worked out very well.  I changed the -o loops
   parameter to -s seconds which calls alarm() after (default) 2 seconds,
   and then once the operation completes, computes a duration per
   operation.
  
  I was kind of wondering how portable alarm() is, and the answer
  according to the buildfarm is that it isn't.
 
 I'm using following simplistic alarm() implementation for win32:
 
   https://github.com/markokr/libusual/blob/master/usual/signal.c#L21
 
 this works with fake sigaction()/SIGALARM hack below - to remember
 function to call.
 
 Good enough for simple stats printing, and avoids win32-specific
 code spreading around.

Wow, I wasn't even aware this compiled in Win32;  I thought it was
ifdef'ed out.  Anyway, I am looking at SetTimer as a way of making this
work.  (Me wonders if the GoGrid Windows images have compilers.)

I see backend/port/win32/timer.c so I might go with a simple create a
thread, sleep(2), set flag, exit solution.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] client performance v.s. server statistics

2012-02-14 Thread Zhou Han
Hi,

I am checking a performance problem encountered after porting old embeded
DB to postgreSQL. While the system is real-time sensitive, we are
concerning for per-query cost. In our environment sequential scanning
(select * from ...) for a table with tens of thousands of record costs 1 -
2 seconds, regardless of using ODBC driver or the timing result shown in
psql client (which in turn, relies on libpq). However, using EXPLAIN
ANALYZE, or checking the statistics in pg_stat_statement view, the query
costs only less than 100ms.

So, is it client interface (ODBC, libpq) 's cost mainly due to TCP? Has the
pg_stat_statement or EXPLAIN ANALYZE included the cost of copying tuples
from shared buffers to result sets?

Could you experts share your views on this big gap? And any suggestions to
optimise?

P.S. In our original embeded DB a fastpath interface is provided to read
directly from shared memory for the records, thus provides extremely
realtime access (of course sacrifice some other features such as
consistency).

Best regards,
Han


Re: [HACKERS] When do we lose column names?

2012-02-14 Thread Andrew Dunstan



On 02/14/2012 05:39 PM, Tom Lane wrote:

OK, I fixed this up and committed it.  I made some cosmetic changes
(the most notable being that the definition of RowExpr is really
changing here, and so should its comment).  The adjust_appendrel_attrs
situation was fixed by passing in the PlannerInfo, which is something
that probably should have been made available all along --- there are
very few nontrivial functions in the planner that don't need it.



Great, many thanks for finishing this up.




I'm still a bit annoyed by the behavior I mentioned here,
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01031.php
that we don't get real column names from an unflattened VALUES RTE.
Might be worth looking into that, but I don't have time for it.




A TODO maybe?

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Progress on fast path sorting, btree index creation time

2012-02-14 Thread Robert Haas
On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan pe...@2ndquadrant.com wrote:
 [ new patch ]

I spent quite a bit of time looking at this today - the patch
specifically, and the issue of making quicksort fast more generally.
It seemed to me that if we're going to have separate copies of the
quicksort code for tuple sorting, we might as well go whole hog and
specialize those copies to the particular needs of tuplesort.c as much
as possible.  Accordingly, I whacked the code around so that it knows
that it is sorting SortTuple objects and need not conditionalize at
runtime on the size of the objects being swapped.  You suggested
upthread that this might be worthwhile, and it seems that it is, so I
think we should do it.

Your patch removes the CHECK_FOR_INTERRUPTS() call from
comparetup_heap, which is no good.  However, since I'd already decided
to specialize the copies of quicksort intended for sorting as much as
possible, it made sense to me to refactor things so that the qsort
routine itself, rather than the comparator, is responsible for calling
CHECK_FOR_INTERRUPTS().  This slightly reduces the number of times we
CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons
before doing it.

I find that your pg_always_inline macro is equivalent to just plain
inline on my system (MacOS X v10.6.8, gcc 4.2.1).  It seems to need
something like this:

+#elif __GNUC__
+#define pg_always_inline inline __attribute__((always_inline))

...but I'm not very happy about relying on that, because I don't know
that it will work on every gcc version (never mind non-gcc compilers),
and I'm not convinced it's actually improving performance even on this
one.  The documentation seems to indicate that this is intended to
force inlining even when not optimizing, which may have something to
do with the lack of effect: that's not really the point here anyway.
What I did instead is to replace template_qsort_arg.h with a script
called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c
that tuplesort.c then #includes.  This seems more flexible to me than
the macro-based approach.  In particular, it allows me to generate
versions of qsort with different call signatures.  The attached patch
generates two:

static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator
cmp_tuple, Tuplesortstate *state);
static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup);

The first of these is a drop-in replacement for qsort_arg() - any
tuplesort can use it, not just heap sorts.  But it is faster than
qsort_arg() because of the specializations for the SortTuple data
type.  The second handles the special case where we are sorting by a
single key that has an associated SortSupport object.  In this case we
don't need to carry the overhead of passing around the Tuplesortstate
and dereferencing it, nor do we need the SortTupleComparator: we can
just pass the SortSupport itself.  Maybe there's a way to get this
effect using macros, but I couldn't figure it out.  At any rate, at
least for the single-key case, this approach effectively forces the
comparator to be inlined without requiring pg_always_inline.

With this patch, I get the following results, as compared with your
2012-02-10 version and master, using the same test cases I tested
before.

select * from nodups order by g offset 10001;
tps on master: 289.471274, 289.967984, 289.595958
tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900
tps on attached version: 388.212793, 386.085083, 386.867478

select * from twocol order by a, b offset 1;
tps on master: 261.676611, 260.440886, 259.529362
tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208
tps on attached version: 283.146463, 278.344827, 280.727798

select * from f8 order by g offset 1;
tps on master: 228.299924, 222.650355, 227.408506
tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456
tps on attached version: 276.985299, 275.341575, 274.428095

There's some variation (which I can't account for) between the results
on master now and the results on master before - possibly just code
shifting around between cache lines due to unrelated changes, or maybe
some inadvertent change in my test setup.  But it looks to me like
your 2012-02-10 version, without any type-specific optimizations, does
pretty much just as well on multi-key sorting as your previous
version, which had them - or if there is a difference, it's pretty
small.

Overall, I think the numbers for the version I'm attaching here look
pretty good: the single-key performance is clearly better than your
last version, and the multi-key performance is very slightly worse.  I
think that slight worsening is a good trade-off, though, because this
version can use qsort_tuple() for all kinds of tuplesorts, not just
heap tuplesorts.  Still, it seems like we ought to be able to do even
better: the multi-key specialization that you had in your patch can be
coded in this framework, too, and in theory those are ndependent of
the 

Re: [HACKERS] client performance v.s. server statistics

2012-02-14 Thread Zhou Han
Hi,

I have tried unix domain socket and the performance is similar with TCP
socket. It is MIPS architecture so memory copy to/from kernel can occupy
much time, and apparently using unit domain socket has no difference than
TCP in terms of memory copy.

But it is still unbelievable for the ten-fold gap between the client side
statistic and the server side statistics. So I want to know what exactly
the operations are involved in the server side statistics in EXPLAIN
ANALYZE. May I check the code later on when I get time.

For the query itself, it was just for performance comparison. There are
other index based queries, which are of course much faster, but still
result in similar ten-fold of time gap between client side and server side
statistics.

I am thinking of non-kernel involved client interface, is there such an
option, or do I have to develop one from scratch?

Best regards,
Han

On Wed, Feb 15, 2012 at 1:23 PM, Amit Kapila amit.kap...@huawei.com wrote:

 So, is it client interface (ODBC, libpq) 's cost mainly due to TCP?

 ** **

 The difference as compare to your embedded DB you are seeing is mainly
 seems to be due to TCP.

 One optimization you can use is to use Unix-domain socket mode of
 PostgreSQL. You can refer unix_socket_directory parameter in
 postgresql.conf and other related parameters. 

 I am suggesting you this as earlier you were using embedded DB, so your
 client/server should be on same machine. If now this is not the case then
 it will not work.

 ** **

 Can you please clarify some more things like

 **1.  **After doing sequence scan, do you need all the records in
 client for which seq. scan is happening. If less records then why you have
 not created index.

 **2.  **What is exact scenario for fetching records

 ** **

 ** **

 ** **

 * pgsql-hackers-ow...@postgresql.org [mailto:
 pgsql-hackers-ow...@postgresql.org] On Behalf Of Zhou Han
 Sent: Wednesday, February 15, 2012 9:30 AM
 To: pgsql-hackers@postgresql.org
 Subject: [HACKERS] client performance v.s. server statistics*

 ** **

 Hi,

 I am checking a performance problem encountered after porting old embeded
 DB to postgreSQL. While the system is real-time sensitive, we are
 concerning for per-query cost. In our environment sequential scanning
 (select * from ...) for a table with tens of thousands of record costs 1 -
 2 seconds, regardless of using ODBC driver or the timing result shown in
 psql client (which in turn, relies on libpq). However, using EXPLAIN
 ANALYZE, or checking the statistics in pg_stat_statement view, the query
 costs only less than 100ms.
  rface (ODBC, libpq) 's cost mainly due to TCP? Has the pg_stat_statement
 or EXPLAIN ANALYZE included the cost of copying tuples from shared buffers
 to result sets?

 Could you experts share your views on this big gap? And any suggestions to
 optimise?

 P.S. In our original embeded DB a fastpath interface is provided to read
 directly from shared memory for the records, thus provides extremely
 realtime access (of course sacrifice some other features such as
 consistency).

 Best regards,
 Han