Re: [HACKERS] Road map to study about fetching a set of tuples - novice!

2013-05-19 Thread Soroosh Sardari
Thanks, it's helpful.

The flowchart is a very good point to start.

Cheers,
Soroosh


On Sat, May 18, 2013 at 7:56 PM, Atri Sharma atri.j...@gmail.com wrote:



 Sent from my iPad

 On 18-May-2013, at 20:49, Dickson S. Guedes lis...@guedesoft.net
 wrote:

  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Em 18-05-2013 11:40, Atri Sharma escreveu:
  On 18-May-2013, at 20:01, Soroosh Sardari
  soroosh.sard...@gmail.com wrote:
 
  Hi
 
  I was tracing a simple SELECT query to find how pg works for
  fetching tuples. but I'm totally lost in the code. Could you
  help me to understand under the hood? I know about parsing and
  planning parts, my actual problem is executer. If you show me a
  road map to study, I would appreciate it.
 
 
 
  You can probably try:
 
 
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/executor/README;h=8afa1e3e4a7596475cbf19a76c88d48a04aeef02;hb=HEAD
 
 
 
  There
 
 
  is a cool flowchart too:
 
  http://www.postgresql.org/developer/backend/
 

 Oh yes, I am in love with that flow chart.It is so easy and expressive!

 Regards,

 Atri


Re: [HACKERS] Better LWLocks with compare-and-swap (9.4)

2013-05-19 Thread Heikki Linnakangas

On 18.05.2013 03:52, Dickson S. Guedes wrote:

pgbench -S is such a workload. With 9.3beta1, I'm seeing this
profile, when I run pgbench -S -c64 -j64 -T60 -M prepared on a
32-core Linux machine:

-  64.09%  postgres  postgres   [.] tas - tas - 99.83%
s_lock - 53.22% LWLockAcquire + 99.87% GetSnapshotData - 46.78%
LWLockRelease GetSnapshotData + GetTransactionSnapshot +   2.97%
postgres  postgres   [.] tas +   1.53%  postgres
libc-2.13.so   [.] 0x119873 +   1.44%  postgres  postgres
[.] GetSnapshotData +   1.29%  postgres  [kernel.kallsyms]  [k]
arch_local_irq_enable +   1.18%  postgres  postgres   [.]
AllocSetAlloc ...


I'd like to test this here but I couldn't reproduce that perf output
here in a 64-core or 24-core machines, could you post the changes to
postgresql.conf and the perf arguments that you used?


Sure, here are the non-default postgresql.conf settings:


name| current_setting
+-
 application_name   | psql
 autovacuum | off
 checkpoint_segments| 70
 config_file| /home/hlinnakangas/data/postgresql.conf
 data_directory | /home/hlinnakangas/data
 default_text_search_config | pg_catalog.english
 hba_file   | /home/hlinnakangas/data/pg_hba.conf
 ident_file | /home/hlinnakangas/data/pg_ident.conf
 lc_collate | en_US.UTF-8
 lc_ctype   | en_US.UTF-8
 log_timezone   | US/Pacific
 logging_collector  | on
 max_connections| 100
 max_stack_depth| 2MB
 server_encoding| UTF8
 shared_buffers | 2GB
 synchronous_commit | off
 TimeZone   | US/Pacific
 transaction_deferrable | off
 transaction_isolation  | read committed
 transaction_read_only  | off
 wal_buffers| 16MB

While pgbench was running, I ran this:

perf record -p 6050 -g -e cpu-clock

to connect to one of the backends. (I used cpu-clock, because the 
default cpu-cycles event didn't work on the box)


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgbench vs. SERIALIZABLE

2013-05-19 Thread Fabien COELHO



Should it give up trying under some conditions, say there are more errors
than transactions?


I don't really see the point of that.  I can't think of a scenario where you 
would get too many serialization errors to even finish the pgbench test.


My point is really to avoid in principle a potential infinite loop under 
option -t in these conditions, if all transactions are failing because of 
a table lock for instance. If pgbench is just killed, I'm not sure you get 
a report.


At any rate, as proposed, this would fail horribly if the very first 
transaction fails, or the second transaction fails twice, etc..


Yep. Or maybe some more options to control the expected behavior on 
transaction failures ? --stop-client-on-fail (current behavior), 
--keep-trying-indefinitely, --stop-client-after=nfails... or nothing if 
this is not a problem:-)


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Block write statistics WIP

2013-05-19 Thread Greg Smith
I have some time now for working on the mystery of why there are no 
block write statistics in the database.  I hacked out the statistics 
collection and reporting side already, rough patch attached if you want 
to see the boring parts.


I guessed that there had to be a gotcha behind why this wasn't done 
before now, and I've found one so far.  All of the read statistics are 
collected with a macro that expects to know a Relation number.  Callers 
to ReadBuffer pass one.  On the write side, the two places that 
increment the existing write counters (pg_stat_statements, vacuum) are 
MarkBufferDirty and MarkBufferDirtyHint.  Neither of those is given a 
Relation though.  Inspecting the Buffer passed can only find the buffer 
tag's RelFileNode.


I've thought of two paths to get a block write count out of that so far:

-Provide a function to find the Relation from the RelFileNode.  There is 
a warning about the perils of assuming you can map that way from a 
buftag value in buf_internals.h though:


Note: the BufferTag data must be sufficient to determine where to write 
the block, without reference to pg_class or pg_tablespace entries.  It's 
possible that the backend flushing the buffer doesn't even believe the 
relation is visible yet (its xact may have started before the xact that 
created the rel).  The storage manager must be able to cope anyway.


-Modify every caller of MarkDirty* to include a relation when that 
information is available.  There are about 200 callers of those 
functions around, so that won't be fun.  I noted that many of them are 
in the XLog replay functions, which won't have the relation--but those 
aren't important to count anyway.


Neither of these options feels very good to me, so selecting between the 
two feels like picking the lesser of two evils.  I'm fine with chugging 
through all of the callers to modify MarkDirty*, but I didn't want to do 
that only to have the whole approach rejected as wrong.  That's why I 
stopped here to get some feedback.


The way that MarkDirty requires this specific underlying storage manager 
behavior to work properly strikes me as as a bit of a layering violation 
too.  I'd like the read and write paths to have a similar API, but here 
they don't even operate on the same type of inputs.  Addressing that is 
probably harder than just throwing a hack on the existing code though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index a03bfa6..1773d59 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -467,15 +467,19 @@ CREATE VIEW pg_statio_all_tables AS
 pg_stat_get_blocks_fetched(C.oid) -
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_read,
 pg_stat_get_blocks_hit(C.oid) AS heap_blks_hit,
+pg_stat_get_blocks_dirtied(C.oid) AS heap_blks_dirtied,
 sum(pg_stat_get_blocks_fetched(I.indexrelid) -
 pg_stat_get_blocks_hit(I.indexrelid))::bigint AS 
idx_blks_read,
 sum(pg_stat_get_blocks_hit(I.indexrelid))::bigint AS idx_blks_hit,
+sum(pg_stat_get_blocks_dirtied(I.indexrelid))::bigint AS 
idx_blks_dirtied,
 pg_stat_get_blocks_fetched(T.oid) -
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_read,
 pg_stat_get_blocks_hit(T.oid) AS toast_blks_hit,
+pg_stat_get_blocks_dirtied(T.oid) AS toast_blks_dirtied,
 pg_stat_get_blocks_fetched(X.oid) -
 pg_stat_get_blocks_hit(X.oid) AS tidx_blks_read,
 pg_stat_get_blocks_hit(X.oid) AS tidx_blks_hit
+pg_stat_get_blocks_dirted(X.oid) AS tidx_blks_dirtied
 FROM pg_class C LEFT JOIN
 pg_index I ON C.oid = I.indrelid LEFT JOIN
 pg_class T ON C.reltoastrelid = T.oid LEFT JOIN
@@ -530,6 +534,7 @@ CREATE VIEW pg_statio_all_indexes AS
 pg_stat_get_blocks_fetched(I.oid) -
 pg_stat_get_blocks_hit(I.oid) AS idx_blks_read,
 pg_stat_get_blocks_hit(I.oid) AS idx_blks_hit
+pg_stat_get_blocks_dirtied(I.oid) AS idx_blks_dirtied
 FROM pg_class C JOIN
 pg_index X ON C.oid = X.indrelid JOIN
 pg_class I ON I.oid = X.indexrelid
@@ -554,6 +559,7 @@ CREATE VIEW pg_statio_all_sequences AS
 pg_stat_get_blocks_fetched(C.oid) -
 pg_stat_get_blocks_hit(C.oid) AS blks_read,
 pg_stat_get_blocks_hit(C.oid) AS blks_hit
+pg_stat_get_blocks_dirtied(C.oid) AS blks_dirtied
 FROM pg_class C
 LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
 WHERE C.relkind = 'S';
@@ -622,6 +628,7 @@ CREATE VIEW pg_stat_database AS
 pg_stat_get_db_blocks_fetched(D.oid) -
 pg_stat_get_db_blocks_hit(D.oid) AS 

Re: [HACKERS] fast promotion and log_checkpoints

2013-05-19 Thread Simon Riggs
On 1 May 2013 10:05, Fujii Masao masao.fu...@gmail.com wrote:

 In HEAD, when the standby is promoted, recovery requests the checkpoint
 but doesn't wait for its completion. I found the checkpoint starting log 
 message
 of this checkpoint looks odd as follows:

 LOG:  checkpoint starting:

 I think something like the following is better.

 LOG:  checkpoint starting: end-of-recovery

 In 9.2 or before, end-of-recovery part is logged. Even if we changed the
 behavior of end-of-recovery checkpoint, I think that it's more intuitive to
 label it as end-of-recovery. Thought?

The checkpoint isn't an end-of-recovery checkpoint, its just the
first checkpoint after the end of recovery. I don't think it should
say end-of-recovery.

The problem is that we've now changed the code to trigger a checkpoint
in a place that wasn't part of the original design, so the checkpoint
called at that point isn't supplied with a reason and so has nothing
to print.

It would be possible to redesign this with a special new reason, or we
could just use time as the reason, or we could just leave it.

Do nothing is easy, though so are the others, so we can choose
anything we want. What do we want it to say?

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fast promotion failure

2013-05-19 Thread Simon Riggs
On 7 May 2013 10:57, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 While testing the bug from the Assertion failure at standby promotion, I
 bumped into a different bug in fast promotion. When the first checkpoint
 after fast promotion is performed, there is no guarantee that the
 checkpointer process is running with the correct, new, ThisTimeLineID. In
 CreateCheckPoint(), we have this:

 /*
  * An end-of-recovery checkpoint is created before anyone is
 allowed to
  * write WAL. To allow us to write the checkpoint record,
 temporarily
  * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
  * initialized, which we need here and in AdvanceXLInsertBuffer.)
  */
 if (flags  CHECKPOINT_END_OF_RECOVERY)
 LocalSetXLogInsertAllowed();


 That ensures that ThisTimeLineID is updated when performing an
 end-of-recovery checkpoint, but it doesn't get executed with fast promotion.
 The consequence is that the checkpoint is created with the old timeline, and
 subsequent recovery from it will fail.

LocalSetXLogInsertAllowed() is called by CreateEndOfRecoveryRecord(),
but in fast promotion this is called by Startup process, not
Checkpointer process. So there is no longer an explicit call to
InitXLOGAccess() in the checkpointer process via a checkpoint with
flag CHECKPOINT_END_OF_RECOVERY.

However, there is a call to RecoveryInProgress() at the top of the
main loop of the checkpointer, which does explicitly state that it
initializes TimeLineID if it's not set yet. The checkpointer makes
the decision about whether to run a restartpoint or a checkpoint
directly from the answer to that, modified only in the case of an
CHECKPOINT_END_OF_RECOVERY.

So the appropiate call is made and I don't agree with the statement
that this doesn't get executed with fast promotion, or the title of
the thread.

So while I believe that the checkpointer might have an incorrect TLI
and that you've seen a bug, what isn't clear is that the checkpointer
is the only process that would see an incorrect TLI, or why such
processes see an incorrect TLI. It seems equally likely at this point
that the TLI may be set incorrectly somehow and that is why it is
being read incorrectly.

I see that the comment in InitXLOGAccess() is incorrect
ThisTimeLineID doesn't change so we need no lock to copy it, which
seems worth correcting since there's no need to save time in a once
only function.

Continuing to investigate further.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Assertion failure when promoting node by deleting recovery.conf and restart node

2013-05-19 Thread Simon Riggs
On 25 March 2013 19:14, Heikki Linnakangas hlinnakan...@vmware.com wrote:
 On 15.03.2013 04:25, Michael Paquier wrote:

 Hi,

 When trying to *promote* a slave as master by removing recovery.conf and
 restarting node, I found an assertion failure on master branch:
 LOG:  database system was shut down in recovery at 2013-03-15 10:22:27 JST
 TRAP: FailedAssertion(!(ControlFile-minRecoveryPointTLI != 1), File:
 xlog.c, Line: 4954)
 (gdb) bt
 #0  0x7f95af03b2c5 in raise () from /usr/lib/libc.so.6
 #1  0x7f95af03c748 in abort () from /usr/lib/libc.so.6
 #2  0x0086ce71 in ExceptionalCondition (conditionName=0x8f2af0
 !(ControlFile-minRecoveryPointTLI != 1), errorType=0x8f0813
 FailedAssertion, fileName=0x8f076b xlog.c,
  lineNumber=4954) at assert.c:54
 #3  0x004fe499 in StartupXLOG () at xlog.c:4954
 #4  0x006f9d34 in StartupProcessMain () at startup.c:224
 #5  0x0050ef92 in AuxiliaryProcessMain (argc=2,
 argv=0x7fffa6fc3d20) at bootstrap.c:423
 #6  0x006f8816 in StartChildProcess (type=StartupProcess) at
 postmaster.c:4956
 #7  0x006f39e9 in PostmasterMain (argc=6, argv=0x1c950a0) at
 postmaster.c:1237
 #8  0x0065d59b in main (argc=6, argv=0x1c950a0) at main.c:197
 Ok, this is not the cleanest way to promote a node as it doesn't do any
 safety checks relation at promotion but 9.2 and previous versions allowed
 to do that properly.

 The assertion has been introduced by commit 3f0ab05 in order to record
 properly minRecoveryPointTLI in control file at the end of recovery in the
 case of a crash.
 However, in the case of a slave node properly shutdown in recovery which
 is
 then restarted as a master, the code path of this assertion is taken.
 What do you think of the patch attached? It avoids the update of
 recoveryTargetTLI and recoveryTargetIsLatest if the node has been shutdown
 while in recovery.
 Another possibility could be to add in the assertion some conditions based
 on the state of controlFile but I think it is more consistent simply not
 to
 update those fields.


 Simon, can you comment on this? ISTM we could just remove the assertion and
 update the comment to mention that this can happen. If there is a min
 recovery point, surely we always need to recover to the timeline containing
 that point, so setting recoveryTargetTLI to minRecoveryPointTLI seems
 sensible.

Fixed using the latest TLI available and removing the assertion.

--
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Sort

2013-05-19 Thread Hitoshi Harada
On Wed, May 15, 2013 at 11:11 AM, Noah Misch n...@leadboat.com wrote:

 On Wed, May 15, 2013 at 08:12:34AM +0900, Michael Paquier wrote:
  The concept of clause parallelism for backend worker is close to the
  concept of clause shippability introduced in Postgres-XC. In the case of
  XC, the equivalent of the master backend is a backend located on a node
  called Coordinator that merges and organizes results fetched in parallel
  from remote nodes where data scans occur (on nodes called Datanodes). The
  backends used for tuple scans across Datanodes share the same data
  visibility as they use the same snapshot and transaction ID as the
 backend
  on Coordinator. This is different from the parallelism as there is no
 idea
  of snapshot import to worker backends.

 Worker backends would indeed share snapshot and XID.

  However, the code in XC planner used for clause shippability evaluation
 is
  definitely worth looking at just considering the many similarities it
  shares with parallelism when evaluating if a given clause can be executed
  on a worker backend or not. It would be a waste to implement twice the
 same
  thing is there is code already available.

 Agreed.  Local parallel query is very similar to distributed query; the
 specific IPC cost multipliers differ, but that's about it.  I hope we can
 benefit from XC's experience in this area.


I believe the parallel execution is much easier to be done if the data is
partitioned.  Of course it is possible to make only the sort operation
parallel but then the question would be how to split and pass each tuple to
workers.  XC and Greenplum use notion of hash distributed table that
enables the parallel sort (XC doesn't perform parallel sort on replicated
table, I guess).  For postgres, I don't think hash distributed table is
foreseeable option, but MergeAppend over inheritance is a good choice to
run in parallel.  You won't even need to modify many lines of sort
execution code if you correctly dispatch the work, as it's just to split
and assign the subnode of query plan to workers.  Transactions and locks
will be tricky though, and we might end up introducing small set of
snapshot sharing infra for the former and notion of session id rather than
process id for the latter.  I don't think SnapshotNow is the problem as
anyway executor is reading catalogs with that today.

Thanks,
Hitoshi


-- 
Hitoshi Harada


[HACKERS] psql \watch versus \timing

2013-05-19 Thread Jeff Janes
I'd like to run same query repeatedly and see how long it takes each time.

I thought \watch would be excellent for this, but it turns out that using
\watch suppresses the output of \timing.

Is this intentional, or unavoidable?

Also, is it just or does the inability to watch more frequently than once a
second make it a lot less useful than it could be?


Cheers,

Jeff


Re: [HACKERS] psql \watch versus \timing

2013-05-19 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 I'd like to run same query repeatedly and see how long it takes each time.
 I thought \watch would be excellent for this, but it turns out that using
 \watch suppresses the output of \timing.

 Is this intentional, or unavoidable?

\watch uses PSQLexec not SendQuery; the latter implements \timing which
I agree is arguably useful here, but also autocommit/auto-savepoint
behavior which probably isn't a good idea.

It might be a good idea to refactor those two routines into one routine
with some sort of bitmap flags argument to control the various add-on
behaviors, but that seems like not 9.3 material anymore.

 Also, is it just or does the inability to watch more frequently than once a
 second make it a lot less useful than it could be?

It did not seem that exciting to me.  In particular, we've already found
out that \watch with zero delay is a pretty bad idea, so you'd have to
make a case for what smaller minimum to use if it's not to be 1 second.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ASYNC Privileges proposal

2013-05-19 Thread Chris Farmiloe
Hey all,

I find the current LISTEN / NOTIFY rather limited in the context of
databases with multiple roles. As it stands it is not possible to restrict
the use of LISTEN or NOTIFY to specific roles, and therefore notifications
(and their payloads) cannot really be trusted as coming from any particular
source.

If the payloads of notifications could be trusted, then applications could
make better use of them, without fear of leaking any sensitive information
to anyone who shouldn't be able to see it.

I'd like to propose a new ASYNC database privilege that would control
whether a role can use LISTEN, NOTIFY and UNLISTEN statements and the
associated pg_notify function.

ie:
GRANT ASYNC ON DATABASE  TO bob;
REVOKE ASYNC ON DATABASE  FROM bob;

SECURITY DEFINER functions could then be used anywhere that a finer grained
access control was required.

I had a quick play to see what might be involved [attached], and would like
to hear people thoughts; good idea, bad idea, not like that! etc

Chris.


async_privileges_r0.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ASYNC Privileges proposal

2013-05-19 Thread Tom Lane
Chris Farmiloe chrisfa...@gmail.com writes:
 I find the current LISTEN / NOTIFY rather limited in the context of
 databases with multiple roles. As it stands it is not possible to restrict
 the use of LISTEN or NOTIFY to specific roles, and therefore notifications
 (and their payloads) cannot really be trusted as coming from any particular
 source.

TBH, nobody has complained about this in the fifteen-plus years that
LISTEN has been around.  I'm dubious about adding privilege-checking
overhead for everybody to satisfy a complaint from one person.

 I'd like to propose a new ASYNC database privilege that would control
 whether a role can use LISTEN, NOTIFY and UNLISTEN statements and the
 associated pg_notify function.

... and if I did think that there were an issue here, I doubt I'd think
that a privilege as coarse-grained as that would fix it.  Surely you'd
want per-channel privileges if you were feeling paranoid about this,
not to mention separate read and write privileges.  But the demand for
that just isn't out there.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ASYNC Privileges proposal

2013-05-19 Thread Chris Farmiloe
In fairness NOTIFY has only had a payload since v9 (maybe 8.4?), and the
issue of trust is mainly tied to data leaking from the payload, so I
suspect I won't be last person to request this as people re-visit NOTIFY :)
...but I totally get your point of course.

My first thought was also that having control at the channel-level would be
ideal, but that would be a huge implementation change by the looks of
things, would certainly affect performance a great deal more and would not
really give much more benefit that could be attained with database-level
control + a SECURITY DEFINER function.


Chris


On 20 May 2013 03:23, Tom Lane t...@sss.pgh.pa.us wrote:

 Chris Farmiloe chrisfa...@gmail.com writes:
  I find the current LISTEN / NOTIFY rather limited in the context of
  databases with multiple roles. As it stands it is not possible to
 restrict
  the use of LISTEN or NOTIFY to specific roles, and therefore
 notifications
  (and their payloads) cannot really be trusted as coming from any
 particular
  source.

 TBH, nobody has complained about this in the fifteen-plus years that
 LISTEN has been around.  I'm dubious about adding privilege-checking
 overhead for everybody to satisfy a complaint from one person.

  I'd like to propose a new ASYNC database privilege that would control
  whether a role can use LISTEN, NOTIFY and UNLISTEN statements and the
  associated pg_notify function.

 ... and if I did think that there were an issue here, I doubt I'd think
 that a privilege as coarse-grained as that would fix it.  Surely you'd
 want per-channel privileges if you were feeling paranoid about this,
 not to mention separate read and write privileges.  But the demand for
 that just isn't out there.

 regards, tom lane