from:"satoshi nagayasu"

Re: [HACKERS] pg_stat_lwlock wait time view

2016-08-24 Thread Satoshi Nagayasu

2016-08-25 13:46 GMT+09:00 Haribabu Kommi <kommi.harib...@gmail.com>:
> On Thu, Aug 25, 2016 at 6:57 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Wed, Aug 24, 2016 at 4:23 AM, Haribabu Kommi
>> <kommi.harib...@gmail.com> wrote:
>>>
>>> Based on the performance impact with the additional timing calculations,
>>> we can decide the view default behavior, Are there any objections to the
>>> concept?
>>
>> There have been some other recent threads on extending the wait event
>> stuff.  If you haven't already read those, you should, because the
>> issues are closely related.  I think that timing LWLock waits will be
>> quite expensive.  I believe that what the Postgres Pro folks want to
>> do is add up the wait times or maybe keep a history of waits (though
>> maybe I'm wrong about that), but showing them in pg_stat_activity is
>> another idea.  That's likely to add some synchronization overhead
>> which might be even greater in this case than for a feature that just
>> publishes accumulated times, but maybe it's all a drop in the bucket
>> compared to the cost of calling gettimeofday() in the first place.
>
> Yes, I agree this is an issue for the cases where the wait time is smaller
> than the logic that is added to calculate the wait time. Even if we use
> clock_gettime with CLOCK_REALTIME_COARSE there will be some
> overhead, as this clock method is 8 times faster than gettimeofday
> but not that accurate in result. May be we can use the clock_getime
> instead of gettimeofday in this case, as we may not needed the fine-grained
> value.

Is there any other option (rather than gettimeofday()) to measure elapsed
time with lower overhead?

I've heard about the RDTSC feature (hardware counter) supported by the recent
processors, and have found a few articles [1] [2] on its lower overhead than
gettimeofday().

[1] 
http://stackoverflow.com/questions/15623343/using-cpu-counters-versus-gettimeofday
[2] http://stackoverflow.com/questions/6498972/faster-equivalent-of-gettimeofday

I'm not sure how we can benefit from it so far, because I'm not
familiar with this
facility and of course I don't have the numbers. In addition to that,
I guess it would
bring some portability issues. But I'm still curious to know more
about these stuff.
Anyone has some experiences on it?

>> Personally, my preferred solution is still to have a background worker
>> that samples the published wait events and rolls up statistics, but
>> I'm not sure I've convinced anyone else.  It could report the number
>> of seconds since it detected a wait event other than the current one,
>> which is not precisely the same thing as tracking the length of the
>> current wait but it's pretty close.  I don't know for sure what's best
>> here - I think some experimentation and dialog is needed.
>
> Yes, using of background worker can reduce the load of adding all the
> wait time calculations in the main backend. I can give a try by modifying
> direct calculation approach and background worker (may be pg_stat_collector)
> to find the wait time based on the stat messages that are received from
> main backend related to wait start and wait end.
>
> I am not sure with out getting any signal or message from main backend,
> how much accurate the data can be gathered from a background worker.

It looks a sort of accuracy-performance trade-off.
So, I think the use-cases matter here to get a better design.

I guess that's the reason why llya is looking for feature requests from DBA
in another thread [3].

[3] 
https://www.postgresql.org/message-id/cag95seuaqvj09kzlwu%2bz1b-gqdmqerzekpfr3hn0q88xzmq...@mail.gmail.com

Regards,
-- 
Satoshi Nagayasu <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait events monitoring future development

2016-08-10 Thread Satoshi Nagayasu

2016/08/10 23:22 "Bruce Momjian" :
>
> On Wed, Aug 10, 2016 at 05:14:52PM +0300, Alexander Korotkov wrote:
> > On Tue, Aug 9, 2016 at 5:37 AM, Bruce Momjian  wrote:
> >
> > On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
> > > I hope wait event monitoring will be on by default even if the
overhead
> > is not
> > > almost zero, because the data needs to be readily available for
faster
> > > troubleshooting.  IMO, the benefit would be worth even 10%
overhead.  If
> > you
> > > disable it by default because of overhead, how can we convince
users to
> > enable
> > > it in production systems to solve some performance problem?  I’m
afraid
> > severe
> > > users would say “we can’t change any setting that might cause more
> > trouble, so
> > > investigate the cause with existing information.”
> >
> > If you want to know why people are against enabling this monitoring
by
> > default, above is the reason.  What percentage of people do you
think
> > would be willing to take a 10% performance penalty for monitoring
like
> > this?  I would bet very few, but the argument above doesn't seem to
> > address the fact it is a small percentage.
> >
> >
> > Just two notes from me:
> >
> > 1) 10% overhead from monitoring wait events is just an idea without any
proof
> > so soon.
> > 2) We already have functionality which trades insight into database
with way
> > more huge overhead.  auto_explain.log_analyze = true can slowdown
queries *in
> > times*.  Do you think we should remove it?
>
> The point is not removing it, the point is whether
> auto_explain.log_analyze = true should be enabled by default, and I
> think no one wants to do that.

Agreed.

If people are facing with some difficult situation in terms of performance,
they may accept some (one-time) overhead to resolve the issue.
But if they don't have (recognize) any issue, they may not.

That's one of the realities according to my experiences.

Regards,

Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Satoshi Nagayasu

2016-08-07 21:03 GMT+09:00 Ilya Kosmodemiansky
<ilya.kosmodemian...@postgresql-consulting.com>:
> I've summarized Wait events monitoring discussion at Developer unconference
> in Ottawa this year on wiki:
>
> https://wiki.postgresql.org/wiki/PgCon_2016_Developer_Unconference/Wait_events_monitoring
>
> (Thanks to Alexander Korotkov for patiently pushing me to make this thing
> finally done)

Thanks for your effort to make us move forward.

> If you attended, fill free to point me out if I missed something, I will put
> it on the wiki too.
>
> Wait event monitoring looks ones again stuck on the way through community
> approval in spite of huge progress done last year in that direction. The
> importance of the topic is beyond discussion now, if you talk to any
> PostgreSQL person about implementing such a tool in Postgres and if the
> person does not get exited, probably you talk to a full-time PostgreSQL
> developer;-) Obviously it needs a better design, both the user interface and
> implementation, and perhaps this is why full-time developers are still
> sceptical.
>
> In order to move forward, imho we need at least some steps, whose steps can
> be done in parallel
>
> 1. Further requirements need to be collected from DBAs.
>
>If you are a PostgreSQL DBA with Oracle experience and use perf for
> troubleshooting Postgres - you are an ideal person to share your experience,
> but everyone is welcome.
>
> 2. Further pg_wait_sampling performance testing needed and in different
> environments.
>
>According to developers, overhead is small, but many people have doubts
> that it can be much more significant for intensive workloads. Obviously, it
> is not an easy task to test, because you need to put doubtfully
> non-production ready code into mission-critical production for such tests.
>As a result it will be clear if this design should be abandoned  and we
> need to think about less-invasive solutions or this design is acceptable.
>
> Any thoughts?

Seems a good starting point. I'm interested in both, and I would like
to contribute
with running (or writing) several tests.

Regards,
-- 
Satoshi Nagayasu <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Wait events monitoring future development

2016-08-08 Thread Satoshi Nagayasu

2016-08-09 11:49 GMT+09:00 Joshua D. Drake <j...@commandprompt.com>:
> On 08/08/2016 07:37 PM, Bruce Momjian wrote:
>>
>> On Tue, Aug  9, 2016 at 02:06:40AM +, Tsunakawa, Takayuki wrote:
>>>
>>> I hope wait event monitoring will be on by default even if the overhead
>>> is not
>>> almost zero, because the data needs to be readily available for faster
>>> troubleshooting.  IMO, the benefit would be worth even 10% overhead.  If
>>> you
>>> disable it by default because of overhead, how can we convince users to
>>> enable
>>> it in production systems to solve some performance problem?  I’m afraid
>>> severe
>>> users would say “we can’t change any setting that might cause more
>>> trouble, so
>>> investigate the cause with existing information.”
>>
>>
>> If you want to know why people are against enabling this monitoring by
>> default, above is the reason.  What percentage of people do you think
>> would be willing to take a 10% performance penalty for monitoring like
>> this?  I would bet very few, but the argument above doesn't seem to
>> address the fact it is a small percentage.
>
>
> I would argue it is zero. There are definitely users for this feature but to
> enable it by default is looking for trouble. *MOST* users do not need this.

I used to think of that this kind of features should be enabled by default,
because when I was working at the previous company, I had only few features
to understand what is happening inside PostgreSQL by observing production
databases. I needed those features enabled in the production databases
when I was called.

However, now I have another opinion. When we release the next major release
saying 10.0 with the wait monitoring, many people will start their
benchmark test
with a configuration with *the default values*, and if they see some performance
decrease, for example around 10%, they will be talking about it as the
performance
decrease in PostgreSQL 10.0. It means PostgreSQL will be facing
difficult reputation.

So, I agree with the features should be disabled by default for a while.

Regards,
-- 
Satoshi Nagayasu <sn...@uptime.jp>

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] DROP OWNED BY ... CACADE & "could not open relation with OID" error

2016-07-20 Thread Satoshi Nagayasu

2016-07-21 13:53 GMT+09:00 Alvaro Herrera <alvhe...@2ndquadrant.com>:
> Satoshi Nagayasu wrote:
>> Hi,
>>
>> I have been trying MADlib [1], a machine-learning library for PostgreSQL,
>> and when I was tying it on 9.5 and 9.6beta2, I often got following
>> error on my box.
>>
>> 
>> madpack.py : ERROR : SQL command failed:
>> SQL: DROP OWNED BY madlib_19_installcheck CASCADE;
>> ERROR:  could not open relation with OID 147056
>> 
>>
>> I wonder it could be a PostgreSQL bug or not.
>> AFAIK, this kind of error should not occur during regular SQL command 
>> execution,
>> so I guess this is a PostgreSQL bug.
>
> Agreed.
>
>> Should I dig it down deeper to reproduce in a simple procedure?
>
> Yes, please.

Ok. I will try it.

Regards,
-- 
Satoshi Nagayasu <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] DROP OWNED BY ... CACADE & "could not open relation with OID" error

2016-07-20 Thread Satoshi Nagayasu

Hi,

I have been trying MADlib [1], a machine-learning library for PostgreSQL,
and when I was tying it on 9.5 and 9.6beta2, I often got following
error on my box.


madpack.py : ERROR : SQL command failed:
SQL: DROP OWNED BY madlib_19_installcheck CASCADE;
ERROR:  could not open relation with OID 147056


I wonder it could be a PostgreSQL bug or not.
AFAIK, this kind of error should not occur during regular SQL command execution,
so I guess this is a PostgreSQL bug.

Should I dig it down deeper to reproduce in a simple procedure?

Regards,

[1] http://madlib.incubator.apache.org/
-- 
Satoshi Nagayasu <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PARALLEL SAFE/UNSAFE question

2016-07-06 Thread Satoshi Nagayasu

Hi all,

I was trying writing my own parallel aggregation functions on 9.6beta2.
During that, we found a bit confusing behavior with SAFE/UNSAFE options.

Once a PARALLEL UNSAFE function f1_unsafe() is wrapped by
a PARALLEL SAFE function f1_safe_unsafe(), calling f1_safe_unsafe()
produces a parallel execution plan despite it implicitly calls
the UNSAFE FUNCTION f1_unsafe().

Is this intentional?

Please take a look at our example here:
https://gist.github.com/snaga/362a965683fb2581bc693991b1fcf721

According to the manual[1], it is described as:

> the presence of such a function in an SQL statement forces a serial execution 
> plan.

so, I expected that calling UNSAFE functions should produce
a non-parallel execution plan even wrapped in SAFE functions.

Is this a sort of limitations? Is this working correctly as we expected?

Regards,

[1] https://www.postgresql.org/docs/9.6/static/sql-createfunction.html
-- 
Satoshi Nagayasu <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Actuall row count of Parallel Seq Scan in EXPLAIN ANALYZE .

2016-06-20 Thread Satoshi Nagayasu

Hi,

2016-06-20 15:42 GMT+09:00 Amit Kapila <amit.kapil...@gmail.com>:
> On Mon, Jun 20, 2016 at 11:48 AM, Masahiko Sawada <sawada.m...@gmail.com>
> wrote:
>>
>> Hi all,
>>
>> My colleague noticed that the output of EXPLAIN ANALYZE doesn't work
>> fine for parallel seq scan.
>>
>> postgres(1)=# explain analyze verbose select count(*) from
>> pgbench_accounts ;
>> QUERY PLAN
>>
>> -
>>  Finalize Aggregate  (cost=217018.55..217018.56 rows=1 width=8)
>> (actual time=2640.015..2640.015 rows=1 loops=1)
>>Output: count(*)
>>->  Gather  (cost=217018.33..217018.54 rows=2 width=8) (actual
>> time=2639.064..2640.002 rows=3 loops=1)
>>  Output: (PARTIAL count(*))
>>  Workers Planned: 2
>>  Workers Launched: 2
>>  ->  Partial Aggregate  (cost=216018.33..216018.34 rows=1
>> width=8) (actual time=2632.714..2632.715 rows=1 loops=3)
>>Output: PARTIAL count(*)
>>Worker 0: actual time=2632.583..2632.584 rows=1 loops=1
>>Worker 1: actual time=2627.517..2627.517 rows=1 loops=1
>>->  Parallel Seq Scan on public.pgbench_accounts
>> (cost=0.00..205601.67 rows=417 width=0) (actual
>> time=0.042..1685.542 rows=333 loops=3)
>>  Worker 0: actual time=0.033..1657.486 rows=3457968
>> loops=1
>>  Worker 1: actual time=0.039..1702.979 rows=3741069
>> loops=1
>>  Planning time: 1.026 ms
>>  Execution time: 2640.225 ms
>> (15 rows)
>>
>> For example, the above result shows,
>> Parallel Seq Scan : actual rows = 333
>> worker 0   : actual rows = 3457968
>> worker 1   : actual rows = 3741069
>> Summation of these is 10532370, but actual total rows is 1000.
>> I think that Parallel Seq Scan should show actual rows =
>> 1000(total rows) or actual rows = 2800963(rows collected by
>> itself). (1000 maybe better)
>>
>
> You have to read the rows at Parallel Seq Scan nodes as total count of rows,
> but you have to consider the loops parameter as well.

IMHO, "actual rows" of "Parallel Seq Scan" should not be divided by the loops,
because the total rows processed here is 1000, not 333 * 3.
I think the actual row value shown here "333 " is a bit confusing and tricky
for users.

>> After spent time to investigate this behaviour, ISTM that the problem
>> is nloops of Parallel Seq Scan.
>> Parallel Seq Scan is done only once, but nloops is incremented to 3.
>
> nloops here indicates, that it is done for 2 workers and a master backend.

Right, but I'm not sure "loops" is an appropriate word for this context.

>> So its "actual rows" is calculated 333(1000 / 3) at
>> explain.c:L1223.
>>
>
> Thats how it should be considered.  You might want to compare the behaviour
> with other cases where value of nloops is used.
>
>> Is it a bug?
>>
>
> I don't think so.

I would like to propose a few modification for parallel queries to
make it more clear.

>>->  Parallel Seq Scan on public.pgbench_accounts 
>> (cost=0.00..205601.67 rows=417 width=0) (actual time=0.042..1685.542 
>> rows=1000 workers=2)
>>  Parent: actual time=0.033..1657.486 rows=2800963loops=1
>>  Worker 0: actual time=0.033..1657.486 rows=3457968 
>> loops=1
>>  Worker 1: actual time=0.039..1702.979 rows=3741069 
>> loops=1

How about this?

Regards,
-- 
Satoshi Nagayasu <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statements query jumbling question

2015-10-10 Thread Satoshi Nagayasu


On 2015/10/03 6:18, Peter Geoghegan wrote:

On Wed, Sep 2, 2015 at 7:41 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote:

I know this still needs to be discussed, but I would like to submit
a patch for further discussion at the next CF, 2015-11.


I think I already expressed this in my explanation of the current
behavior, but to be clear: -1 from me to this proposal.


I would like to introduce queryId to pgaudit and sql_firewall extensions
to determine query groups. queryId could be useful if available in those
modules.

I think users may want to do that based on object names, because they
issue queries with the object names, not the internal object ids.

Changing queryId after re-creating the same table may make the user
gets confused, because the query string the user issue is not changed.

At least, I would like to give some options to be chosen by the user.
Is it possible and/or reasonable?

Regards,
--
NAGAYASU Satoshi <sn...@uptime.jp>


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statements query jumbling question

2015-09-02 Thread Satoshi Nagayasu

On 2015/09/01 14:39, Satoshi Nagayasu wrote:
> On 2015/09/01 14:01, Tom Lane wrote:
>> Satoshi Nagayasu <sn...@uptime.jp> writes:
>>> On 2015/09/01 13:41, Peter Geoghegan wrote:
>>>> If you want to use the queryId field directly, which I recall you
>>>> mentioning before, then that's harder. There is simply no contract
>>>> among extensions for "owning" a queryId. But when the fingerprinting
>>>> code is moved into core, then I think at that point queryId may cease
>>>> to be even a thing that pg_stat_statements theoretically has the right
>>>> to write into. Rather, it just asks the core system to do the
>>>> fingerprinting, and finds it within queryId. At the same time, other
>>>> extensions may do the same, and don't need to care about each other.
>>>>
>>>> Does that work for you?
>>
>>> Yes. I think so.
>>
>>> I need some query fingerprint to determine query group. I want queryid
>>> to keep the same value when query strings are the same (except literal
>>> values).
>>
>> The problem I've got with this is the unquestioned assumption that every
>> application for query IDs will have exactly the same requirements for
>> what the ID should include or ignore.
> 
> I'm not confident about that too, but at least, I think we will be able
> to collect most common use cases as of today. (aka best guess. :)
> 
> And IMHO it would be ok to change the spec in future release.

I know this still needs to be discussed, but I would like to submit
a patch for further discussion at the next CF, 2015-11.

Regards,
-- 
NAGAYASU Satoshi <sn...@uptime.jp>
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..2f6fb99 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -76,6 +76,8 @@
 #include "tcop/utility.h"
 #include "utils/builtins.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
+#include "utils/relcache.h"
 
 PG_MODULE_MAGIC;
 
@@ -2286,6 +2288,7 @@ static void
 JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
 {
ListCell   *lc;
+   Relation   rel;
 
foreach(lc, rtable)
{
@@ -2296,7 +2299,9 @@ JumbleRangeTable(pgssJumbleState *jstate, List *rtable)
switch (rte->rtekind)
{
case RTE_RELATION:
-   APP_JUMB(rte->relid);
+   rel = RelationIdGetRelation(rte->relid);
+   APP_JUMB_STRING(RelationGetRelationName(rel));
+   RelationClose(rel);
JumbleExpr(jstate, (Node *) rte->tablesample);
break;
case RTE_SUBQUERY:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_stat_statements query jumbling question

2015-08-31 Thread Satoshi Nagayasu

Hi,

I have a question on jumbling queries in pg_stat_statements.

I found that JumbleRangeTable() uses relation oid in
RangeTblEntry.

Obviously, this would result different queryid when the table
gets re-created (dropped and created).

Why don't we use relation name (with looking up the catalog)
on query jumbling? For performance reason?

Regards,
-- 
NAGAYASU Satoshi 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statements query jumbling question

2015-08-31 Thread Satoshi Nagayasu


On 2015/09/01 13:41, Peter Geoghegan wrote:

On Mon, Aug 31, 2015 at 9:29 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote:

BTW, I'm interested in improving the queryid portability now because
I'd like to use it in other extensions. :)
That's the reason why I'm looking at query jumbling here.


Are you interested in having the query fingerprinting/jumbling
infrastructure available to all backend code? That seems like a good
idea to me generally.


Yes. I've been working on the sql_firewall extension[1], which is
totally built on top of the pg_stat_statements.

[1] http://pgsnaga.blogspot.jp/2015/08/postgresql-sql-firewall.html

As of today, sql_firewall has duplicated code for query jumbling.
But if it goes into the core, it looks fantastic.


I would like to be able to put queryId in
log_line_prefix, or to display it within EXPLAIN, and have it
available everywhere. I like the idea of per-query
log_min_duration_statement settings.


Sounds cool. :)


If you want to use the queryId field directly, which I recall you
mentioning before, then that's harder. There is simply no contract
among extensions for "owning" a queryId. But when the fingerprinting
code is moved into core, then I think at that point queryId may cease
to be even a thing that pg_stat_statements theoretically has the right
to write into. Rather, it just asks the core system to do the
fingerprinting, and finds it within queryId. At the same time, other
extensions may do the same, and don't need to care about each other.

Does that work for you?


Yes. I think so.

I need some query fingerprint to determine query group. I want queryid
to keep the same value when query strings are the same (except literal 
values).


Another reason is just because I need to import/export query ids.

Regards,
--
NAGAYASU Satoshi <sn...@uptime.jp>


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statements query jumbling question

2015-08-31 Thread Satoshi Nagayasu

On 2015/09/01 14:01, Tom Lane wrote:
> Satoshi Nagayasu <sn...@uptime.jp> writes:
>> On 2015/09/01 13:41, Peter Geoghegan wrote:
>>> If you want to use the queryId field directly, which I recall you
>>> mentioning before, then that's harder. There is simply no contract
>>> among extensions for "owning" a queryId. But when the fingerprinting
>>> code is moved into core, then I think at that point queryId may cease
>>> to be even a thing that pg_stat_statements theoretically has the right
>>> to write into. Rather, it just asks the core system to do the
>>> fingerprinting, and finds it within queryId. At the same time, other
>>> extensions may do the same, and don't need to care about each other.
>>>
>>> Does that work for you?
> 
>> Yes. I think so.
> 
>> I need some query fingerprint to determine query group. I want queryid
>> to keep the same value when query strings are the same (except literal
>> values).
> 
> The problem I've got with this is the unquestioned assumption that every
> application for query IDs will have exactly the same requirements for
> what the ID should include or ignore.

I'm not confident about that too, but at least, I think we will be able
to collect most common use cases as of today. (aka best guess. :)

And IMHO it would be ok to change the spec in future release.

Regards,
-- 
NAGAYASU Satoshi <sn...@uptime.jp>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_statements query jumbling question

2015-08-31 Thread Satoshi Nagayasu




On 2015/09/01 12:36, Peter Geoghegan wrote:

On Mon, Aug 31, 2015 at 8:32 PM, Satoshi Nagayasu <sn...@uptime.jp> wrote:

Why don't we use relation name (with looking up the catalog)
on query jumbling? For performance reason?


I think that there is a good case for preferring this behavior. While
it is a little confusing that pg_stat_statements does not change the
representative query string, renaming a table does not make it a
substantively different table.

There is, IIRC, one case where a string is jumbled directly (CTE
name). It's usually not the right thing, IMV.



Thanks for the comment. I've never considered that. Interesting.

From the users point of view, IMHO, it would be better to avoid
confusing if queryid is determined by only visible values -- userid,
dbid and query string itself.

BTW, I'm interested in improving the queryid portability now because
I'd like to use it in other extensions. :)
That's the reason why I'm looking at query jumbling here.

Thoughts?

Regards,
--
NAGAYASU Satoshi <sn...@uptime.jp>


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assert in pg_stat_statements

2015-08-09 Thread Satoshi Nagayasu

2015-08-10 0:04 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Sun, Aug 9, 2015 at 1:36 AM, Satoshi Nagayasu sn...@uptime.jp wrote:
 On 2015/08/08 22:32, Robert Haas wrote:
 On Sat, Aug 8, 2015 at 5:00 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

 I just found that pg_stat_statements causes assert when queryId is
 set by other module, which is loaded prior to pg_stat_statements in
 the shared_preload_libraries parameter.

 Theoretically, queryId in the shared memory could be set by other
 module by design.

 So, IMHO, pg_stat_statements should not cause assert even if queryId
 already has non-zero value --- my module has this problem now.

 Is my understanding correct? Any comments?


 Seems like an ERROR would be more appropriate than an Assert.


 Well, it's not that simple, and I'm afraid that it may not fix
 the issue.

 Here, let's assume that we have two modules (foo and pg_stat_statements)
 which calculate, use and store Query-queryId independently.

 What we may see when two are enabled is following.

 (1) Enable two modules in shared_preload_libraries.

 shared_preload_libraries = 'foo,pg_stat_statements'

 (2) Once a query comes in, a callback function in module foo
 associated with post_parse_analyze_hook calculates and store
 queryId in Query-queryId.

 (3) After that, a callback function (pgss_post_parse_analyze) in
 pg_stat_statements associated with post_parse_analyze_hook
 detects that Query-queryId has non-zero value, and asserts it.

 As a result, we can not have two or more modules that use queryId
 at the same time.

 But I think it should be possible because one of the advantages of
 PostgreSQL is its extensibility.

 So, I think the problem here is the fact that pg_stat_statements
 deals with having non-zero value in queryId as error even if
 it has exact the same value with other module.

 I'm not too excited about supporting the use case where there are two
 people using queryId but it just so happens that they always set
 exactly the same value.  That seems like a weird setup.

I think so too, but unfortunately, I have to do that to work
with 9.4 and 9.5.

 Wouldn't that
 mean both modules were applying the same jumble algorithm, and
 therefore you're doing the work of computing the jumble twice per
 query?

Basically yes, because both modules should work not only together,
but also solely. So, my module need to have capability to calculate
queryId itself.

More precisely, if we have following configuration,

  shared_preload_libraries = 'foo,pg_stat_statements'

my module foo calculates queryId itself, and pg_stat_statements
would do the same. (and gets crashed when it has --enable-cassert)

If we have following configuration,

  shared_preload_libraries = 'pg_stat_statements,foo'

my module foo can skip queryId calculation because
pg_stat_statements has already done that.

Of course, my module foo should be able to work solely as following.

  shared_preload_libraries = 'foo'

 It would be better to have the second module have an option
 not to compute the query ID and just do whatever else it does.  Then,
 if you want to use that other module with pg_stat_statements, flip the
 GUC.

Yes, that's exactly what I'm doing in my module, and what I intended
in my first patch for pg_stat_statements on this thread. :)

Regards,
-- 
NAGAYASU Satoshi sn...@uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assert in pg_stat_statements

2015-08-09 Thread Satoshi Nagayasu

2015-08-10 2:23 GMT+09:00 Tom Lane t...@sss.pgh.pa.us:
 Robert Haas robertmh...@gmail.com writes:
 I'm not too excited about supporting the use case where there are two
 people using queryId but it just so happens that they always set
 exactly the same value.  That seems like a weird setup.  Wouldn't that
 mean both modules were applying the same jumble algorithm, and
 therefore you're doing the work of computing the jumble twice per
 query?

 Not only that, but you'd need to keep the two modules in sync, which
 would be one of the greatest recipes for bugs we've thought of yet.

 If there's actually a use case for that sort of thing, I would vote
 for moving the jumble-calculation code into core and making both of
 the interested extensions do

 /* Calculate query hash if it's not been done yet */
 if (query-queryId == 0)
 calculate_query_hash(query);

I vote for this too.

Jumble-calculation is the smartest way to identify query groups,
so several modules can take advantages of it if in the core.

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Assert in pg_stat_statements

2015-08-08 Thread Satoshi Nagayasu

Hi,

I just found that pg_stat_statements causes assert when queryId is
set by other module, which is loaded prior to pg_stat_statements in
the shared_preload_libraries parameter.

Theoretically, queryId in the shared memory could be set by other
module by design.

So, IMHO, pg_stat_statements should not cause assert even if queryId
already has non-zero value --- my module has this problem now.

Is my understanding correct? Any comments?

Regards,

-- 
NAGAYASU Satoshi sn...@uptime.jp
commit b975d7c2fe1b36a3ded1e0960be191466704e0fc
Author: Satoshi Nagayasu sn...@uptime.jp
Date:   Sat Aug 8 08:51:45 2015 +

Fix pg_stat_statements to avoid assert failure when queryId already has 
non-zero value.

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
b/contrib/pg_stat_statements/pg_stat_statements.c
index 59b8a2e..84c5200 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -776,8 +776,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query)
if (prev_post_parse_analyze_hook)
prev_post_parse_analyze_hook(pstate, query);
 
-   /* Assert we didn't do this already */
-   Assert(query-queryId == 0);
+   /* Assume that other module has calculated and set queryId */
+   if (query-queryId  0)
+   return;
 
/* Safety check... */
if (!pgss || !pgss_hash)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_filedump patch for 9.5

2015-08-08 Thread Satoshi Nagayasu

Hi,

I have created a patch for pg_filedump to work with 9.5.
Here is a list of changes.

 * Fix to rename CRC32 macros to work with 9.5.
 * Fix to add missing DBState: DB_SHUTDOWNED_IN_RECOVERY.
 * Fix to add missing page flags for Btree and GIN.
 * Update copyright date.

Please take a look. Any comments are welcome.

Regards,

-- 
NAGAYASU Satoshi sn...@uptime.jp
diff --git a/Makefile b/Makefile
index c64dfbf..d80a3a3 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 # View README.pg_filedump first
 
 # note this must match version macros in pg_filedump.h
-FD_VERSION=9.3.0
+FD_VERSION=9.5.0
 
 CC=gcc
 CFLAGS=-g -O -Wall -Wmissing-prototypes -Wmissing-declarations
diff --git a/README.pg_filedump b/README.pg_filedump
index 3a04d59..13d51ff 100644
--- a/README.pg_filedump
+++ b/README.pg_filedump
@@ -2,7 +2,7 @@ pg_filedump - Display formatted contents of a PostgreSQL heap, 
index,
   or control file.
 
 Copyright (c) 2002-2010 Red Hat, Inc.
-Copyright (c) 2011-2014, PostgreSQL Global Development Group
+Copyright (c) 2011-2015, PostgreSQL Global Development Group
 
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
diff --git a/pg_filedump.c b/pg_filedump.c
index 6cd02a9..1dfc524 100644
--- a/pg_filedump.c
+++ b/pg_filedump.c
@@ -3,7 +3,7 @@
  *formatting heap (data), index and control 
files.
  *
  * Copyright (c) 2002-2010 Red Hat, Inc.
- * Copyright (c) 2011-2014, PostgreSQL Global Development Group
+ * Copyright (c) 2011-2015, PostgreSQL Global Development Group
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
@@ -24,7 +24,7 @@
 
 #include pg_filedump.h
 
-#include utils/pg_crc_tables.h
+#include utils/pg_crc.h
 
 /* checksum_impl.h uses Assert, which doesn't work outside the server */
 #undef Assert
@@ -91,7 +91,7 @@ DisplayOptions(unsigned int validOptions)
printf
(\nVersion %s (for %s)
 \nCopyright (c) 2002-2010 Red Hat, Inc.
- \nCopyright (c) 2011-2014, PostgreSQL Global Development 
Group\n,
+ \nCopyright (c) 2011-2015, PostgreSQL Global Development 
Group\n,
 FD_VERSION, FD_PG_VERSION);
 
printf
@@ -1132,6 +1132,8 @@ FormatSpecial()
strcat(flagString, SPLITEND|);
if (btreeSection-btpo_flags  BTP_HAS_GARBAGE)
strcat(flagString, HASGARBAGE|);
+   if (btreeSection-btpo_flags  
BTP_INCOMPLETE_SPLIT)
+   strcat(flagString, INCOMPLETESPLIT|);
if (strlen(flagString))
flagString[strlen(flagString) - 1] = 
'\0';
 
@@ -1216,6 +1218,10 @@ FormatSpecial()
strcat(flagString, LIST|);
if (ginSection-flags  GIN_LIST_FULLROW)
strcat(flagString, FULLROW|);
+   if (ginSection-flags  GIN_INCOMPLETE_SPLIT)
+   strcat(flagString, INCOMPLETESPLIT|);
+   if (ginSection-flags  GIN_COMPRESSED)
+   strcat(flagString, COMPRESSED|);
if (strlen(flagString))
flagString[strlen(flagString) - 1] = 
'\0';
printf( GIN Index Section:\n
@@ -1340,9 +1346,9 @@ FormatControl()
char   *dbState;
 
/* Compute a local copy of the CRC to verify the one on disk */
-   INIT_CRC32(crcLocal);
-   COMP_CRC32(crcLocal, buffer, offsetof(ControlFileData, crc));
-   FIN_CRC32(crcLocal);
+   INIT_CRC32C(crcLocal);
+   COMP_CRC32C(crcLocal, buffer, offsetof(ControlFileData, crc));
+   FIN_CRC32C(crcLocal);
 
/* Grab a readable version of the database state */
switch (controlData-state)
@@ -1353,6 +1359,9 @@ FormatControl()
case DB_SHUTDOWNED:
dbState = SHUTDOWNED;
break;
+   case DB_SHUTDOWNED_IN_RECOVERY:
+   dbState = SHUTDOWNED_IN_RECOVERY;
+   break;
case DB_SHUTDOWNING:
dbState = SHUTDOWNING;
break;
@@ -1400,7 +1409,7 @@ FormatControl()
  Maximum Index Keys: %u\n
TOAST Chunk Size: %u\n
  Date and Time Type Storage:

Re: [HACKERS] Assert in pg_stat_statements

2015-08-08 Thread Satoshi Nagayasu



On 2015/08/08 22:32, Robert Haas wrote:

On Sat, Aug 8, 2015 at 5:00 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

I just found that pg_stat_statements causes assert when queryId is
set by other module, which is loaded prior to pg_stat_statements in
the shared_preload_libraries parameter.

Theoretically, queryId in the shared memory could be set by other
module by design.

So, IMHO, pg_stat_statements should not cause assert even if queryId
already has non-zero value --- my module has this problem now.

Is my understanding correct? Any comments?


Seems like an ERROR would be more appropriate than an Assert.


Well, it's not that simple, and I'm afraid that it may not fix
the issue.

Here, let's assume that we have two modules (foo and pg_stat_statements)
which calculate, use and store Query-queryId independently.

What we may see when two are enabled is following.

(1) Enable two modules in shared_preload_libraries.

shared_preload_libraries = 'foo,pg_stat_statements'

(2) Once a query comes in, a callback function in module foo
associated with post_parse_analyze_hook calculates and store
queryId in Query-queryId.

(3) After that, a callback function (pgss_post_parse_analyze) in
pg_stat_statements associated with post_parse_analyze_hook
detects that Query-queryId has non-zero value, and asserts it.

As a result, we can not have two or more modules that use queryId
at the same time.

But I think it should be possible because one of the advantages of
PostgreSQL is its extensibility.

So, I think the problem here is the fact that pg_stat_statements
deals with having non-zero value in queryId as error even if
it has exact the same value with other module.

Regards,
--
NAGAYASU Satoshi sn...@uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix to expose a GUC variable, Log_disconnections, to outside of postgres.c

2015-07-09 Thread Satoshi Nagayasu



On 2015/07/09 15:30, Heikki Linnakangas wrote:
 On 07/09/2015 07:19 AM, Satoshi Nagayasu wrote:

 On 2015/07/09 13:06, Tom Lane wrote:
 Satoshi Nagayasu sn...@uptime.jp writes:
 I just found that Log_disconnections value has not been
 exposed to outside of postgres.c.
 Despite that, Log_connections has already been exposed.

 Why would an external module need to touch either one?

 To check settings of GUC variables from a shared preload
 library.

 For example, I'd like to print WARNING  in _PG_init()
 when some GUC variable is disabled on postmaster startup.
 
 I still have a hard time seeing why an extension would be interested in
 Log_disconnections. But hey, extensions do strange things..

Definitely. :)

 You could use GetConfigOption(). I'm not necessarily opposed to exposing
 Log_disconnections, but with GetConfigOption() it will work with earlier
 versions too.

That's it! This is what I'm looking for. Thanks a lot. :)

Regards,
-- 
NAGAYASU Satoshi sn...@uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fix to expose a GUC variable, Log_disconnections, to outside of postgres.c

2015-07-08 Thread Satoshi Nagayasu

Hi,

I just found that Log_disconnections value has not been
exposed to outside of postgres.c.
Despite that, Log_connections has already been exposed.

So, I'd like to add attached fix to touch the Log_disconnections
value in other modules.

Any comments?

Regards,
-- 
NAGAYASU Satoshi sn...@uptime.jp
commit afffca193b69ea5eb42f5e7273d48a6a82eb7e37
Author: Satoshi Nagayasu sn...@uptime.jp
Date:   Thu Jul 9 03:42:31 2015 +

Fix to expose Log_disconnections GUC variable to the outside postgres.c.

diff --git a/src/include/postmaster/postmaster.h 
b/src/include/postmaster/postmaster.h
index d160304..4c76f30 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -25,6 +25,7 @@ extern bool ClientAuthInProgress;
 extern int PreAuthDelay;
 extern int AuthenticationTimeout;
 extern bool Log_connections;
+extern bool Log_disconnections;
 extern bool log_hostname;
 extern bool enable_bonjour;
 extern char *bonjour_name;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix to expose a GUC variable, Log_disconnections, to outside of postgres.c

2015-07-08 Thread Satoshi Nagayasu


On 2015/07/09 13:06, Tom Lane wrote:
 Satoshi Nagayasu sn...@uptime.jp writes:
 I just found that Log_disconnections value has not been
 exposed to outside of postgres.c.
 Despite that, Log_connections has already been exposed.
 
 Why would an external module need to touch either one?

To check settings of GUC variables from a shared preload
library.

For example, I'd like to print WARNING  in _PG_init()
when some GUC variable is disabled on postmaster startup.

Regards,
-- 
NAGAYASU Satoshi sn...@uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind in contrib

2014-12-16 Thread Satoshi Nagayasu


Hi,

On 2014/12/12 23:13, Heikki Linnakangas wrote:
 Hi,

 I'd like to include pg_rewind in contrib. I originally wrote it as an
 external project so that I could quickly get it working with the
 existing versions, and because I didn't feel it was quite ready for
 production use yet. Now, with the WAL format changes in master, it is a
 lot more maintainable than before. Many bugs have been fixed since the
 first prototypes, and I think it's fairly robust now.

 I propose that we include pg_rewind in contrib/ now. Attached is a patch
 for that. It just includes the latest sources from the current pg_rewind
 repository at https://github.com/vmware/pg_rewind. It is released under
 the PostgreSQL license.

 For those who are not familiar with pg_rewind, it's a tool that allows
 repurposing an old master server as a new standby server, after
 promotion, even if the old master was not shut down cleanly. That's a
 very often requested feature.

I'm looking into pg_rewind with a very first scenario.
My scenario is here.

https://github.com/snaga/pg_rewind_test/blob/master/pg_rewind_test.sh

At least, I think a file descriptor srcf should be closed before
exiting copy_file_range(). I got can't open file error with
too many open file while running pg_rewind.


diff --git a/contrib/pg_rewind/copy_fetch.c b/contrib/pg_rewind/copy_fetch.c
index bea1b09..5a8cc8e 100644
--- a/contrib/pg_rewind/copy_fetch.c
+++ b/contrib/pg_rewind/copy_fetch.c
@@ -280,6 +280,8 @@ copy_file_range(const char *path, off_t begin, off_t 
end, bool trunc)

write_file_range(buf, begin, readlen);
begin += readlen;
}
+
+   close(srcfd);
 }

 /*


And I have one question here.

pg_rewind assumes that the source PostgreSQL has, at least, one 
checkpoint after getting promoted. I think the target timeline id

in the pg_control file to be read is only available after the first
checkpoint. Right?

Regards,



- Heikki





--
NAGAYASU Satoshi sn...@uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewind in contrib

2014-12-16 Thread Satoshi Nagayasu


On 2014/12/16 18:37, Heikki Linnakangas wrote:

On 12/16/2014 11:23 AM, Satoshi Nagayasu wrote:

pg_rewind assumes that the source PostgreSQL has, at least, one
checkpoint after getting promoted. I think the target timeline id
in the pg_control file to be read is only available after the first
checkpoint. Right?


Yes, it does assume that the source server (= old standby, new master)
has had at least one checkpoint after promotion. It probably should be
more explicit about it: If there hasn't been a checkpoint, you will
currently get an error source and target cluster are both on the same
timeline, which isn't very informative.


Yes, I got the message, so I could find the checkpoint thing.
It could be more informative, or some hint message could be added.


I assume that by target timeline ID you meant the timeline ID of the
source server, i.e. the timeline that the target server should be
rewound to.


Yes.
Target timeline I mean here is the timeline id coming from the promoted 
master (== source server == old standby).


I got it. Thanks.

I'm going to look into more details.

Regards,



- Heikki



--
NAGAYASU Satoshi sn...@uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-10-10 Thread Satoshi Nagayasu

(2013/10/11 7:32), Mark Kirkwood wrote:
 On 11/10/13 11:09, Mark Kirkwood wrote:
 On 16/09/13 16:20, Satoshi Nagayasu wrote:
 (2013/09/15 11:07), Peter Eisentraut wrote:
 On Sat, 2013-09-14 at 16:18 +0900, Satoshi Nagayasu wrote:
 I'm looking forward to seeing more feedback on this approach,
 in terms of design and performance improvement.
 So, I have submitted this for the next CF.
 Your patch fails to build:

 pgstattuple.c: In function ‘pgstat_heap_sample’:
 pgstattuple.c:737:13: error: ‘SnapshotNow’ undeclared (first use in
 this function)
 pgstattuple.c:737:13: note: each undeclared identifier is reported
 only once for each function it appears in
 Thanks for checking. Fixed to eliminate SnapshotNow.

 This seems like a cool idea! I took a quick look, and initally
 replicated the sort of improvement you saw:


 bench=# explain analyze select * from pgstattuple('pgbench_accounts');
 QUERY PLAN

 
 Function Scan on pgstattuple (cost=0.00..0.01 rows=1 width=72) (actual
 time=786.368..786.369 rows=1 loops=1)
 Total runtime: 786.384 ms
 (2 rows)

 bench=# explain analyze select * from pgstattuple2('pgbench_accounts');
 NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00,
 dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00
 QUERY PLAN

 
 Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual
 time=12.004..12.005 rows=1 loops=1)
 Total runtime: 12.019 ms
 (2 rows)



 I wondered what sort of difference eliminating caching would make:

 $ sudo sysctl -w vm.drop_caches=3

 Repeating the above queries:


 bench=# explain analyze select * from pgstattuple('pgbench_accounts');
 QUERY PLAN

 
 Function Scan on pgstattuple (cost=0.00..0.01 rows=1 width=72) (actual
 time=9503.774..9503.776 rows=1 loops=1)
 Total runtime: 9504.523 ms
 (2 rows)

 bench=# explain analyze select * from pgstattuple2('pgbench_accounts');
 NOTICE: pgstattuple2: SE tuple_count 0.00, tuple_len 0.00,
 dead_tuple_count 0.00, dead_tuple_len 0.00, free_space 0.00
 QUERY PLAN

 
 Function Scan on pgstattuple2 (cost=0.00..0.01 rows=1 width=72) (actual
 time=12330.630..12330.631 rows=1 loops=1)
 Total runtime: 12331.353 ms
 (2 rows)


 So the sampling code seems *slower* when the cache is completely cold -
 is that expected? (I have not looked at how the code works yet - I'll
 dive in later if I get a chance)!

Thanks for testing that. It would be very helpful to improve the
performance.

 Quietly replying to myself - looking at the code the sampler does 3000
 random page reads... I guess this is slower than 163935 (number of pages
 in pgbench_accounts) sequential page reads thanks to os readahead on my
 type of disk (WD Velociraptor). Tweaking the number of random reads (i.e
 the sample size) down helps - but obviously that can impact estimation
 accuracy.
 
 Thinking about this a bit more, I guess the elapsed runtime is not the
 *only* theng to consider - the sampling code will cause way less
 disruption to the os page cache (3000 pages vs possibly lots more than
 3000 for reading an entire ralation).
 
 Thoughts?

I think it could be improved by sorting sample block numbers
*before* physical block reads in order to eliminate random access
on the disk.

pseudo code:
--
for (i=0 ; iSAMPLE_SIZE ; i++)
{
sample_block[i] = random();
}

qsort(sample_block);

for (i=0 ; iSAMPLE_SIZE ; i++)
{
buf = ReadBuffer(rel, sample_block[i]);

do_some_stats_stuff(buf);
}
--

I guess it would be helpful for reducing random access thing.

Any comments?
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Statistics collection for CLUSTER command

2013-09-16 Thread Satoshi Nagayasu


(2013/08/08 20:52), Vik Fearing wrote:

As part of routine maintenance monitoring, it is interesting for us to
have statistics on the CLUSTER command (timestamp of last run, and
number of runs since stat reset) like we have for (auto)ANALYZE and
(auto)VACUUM.  Patch against today's HEAD attached.

I would add this to the next commitfest but I seem to be unable to log
in with my community account (I can log in to the wiki).  Help appreciated.


I have reviewed the patch.

Succeeded to build with the latest HEAD, and passed the regression
tests.

Looks good enough, and I'd like to add a test case here, not only
for the view definition, but also working correctly.

Please take a look at attached one.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/test/regress/expected/stats.out 
b/src/test/regress/expected/stats.out
index 56bace1..ba614b2 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -28,7 +28,13 @@ SELECT pg_sleep(2.0);
 CREATE TEMP TABLE prevstats AS
 SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
(b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
-   (b.idx_blks_read + b.idx_blks_hit) AS idx_blks
+   (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
+   coalesce(t.last_vacuum, now()) AS last_vacuum,
+   coalesce(t.last_analyze, now()) AS last_analyze,
+   coalesce(t.last_cluster, now()) AS last_cluster,
+   t.vacuum_count,
+   t.analyze_count,
+   t.cluster_count
   FROM pg_catalog.pg_stat_user_tables AS t,
pg_catalog.pg_statio_user_tables AS b
  WHERE t.relname='tenk2' AND b.relname='tenk2';
@@ -111,4 +117,27 @@ SELECT st.heap_blks_read + st.heap_blks_hit = 
pr.heap_blks + cl.relpages,
  t| t
 (1 row)
 
+-- table maintenance stats
+ANALYZE tenk2;
+VACUUM tenk2;
+CLUSTER tenk2 USING tenk2_unique1;
+SELECT pg_sleep(1.0);
+ pg_sleep 
+--
+ 
+(1 row)
+
+SELECT st.last_vacuum  pr.last_vacuum,
+   st.last_analyze  pr.last_analyze,
+   st.last_cluster  pr.last_cluster,
+   st.vacuum_count  pr.vacuum_count,
+   st.analyze_count  pr.analyze_count,
+   st.cluster_count  pr.cluster_count
+  FROM pg_stat_user_tables AS st, prevstats pr
+ WHERE st.relname='tenk2';
+ ?column? | ?column? | ?column? | ?column? | ?column? | ?column? 
+--+--+--+--+--+--
+ t| t| t| t| t| t
+(1 row)
+
 -- End of Stats Test
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index bb349b2..71e5e27 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -22,7 +22,13 @@ SELECT pg_sleep(2.0);
 CREATE TEMP TABLE prevstats AS
 SELECT t.seq_scan, t.seq_tup_read, t.idx_scan, t.idx_tup_fetch,
(b.heap_blks_read + b.heap_blks_hit) AS heap_blks,
-   (b.idx_blks_read + b.idx_blks_hit) AS idx_blks
+   (b.idx_blks_read + b.idx_blks_hit) AS idx_blks,
+   coalesce(t.last_vacuum, now()) AS last_vacuum,
+   coalesce(t.last_analyze, now()) AS last_analyze,
+   coalesce(t.last_cluster, now()) AS last_cluster,
+   t.vacuum_count,
+   t.analyze_count,
+   t.cluster_count
   FROM pg_catalog.pg_stat_user_tables AS t,
pg_catalog.pg_statio_user_tables AS b
  WHERE t.relname='tenk2' AND b.relname='tenk2';
@@ -81,4 +87,20 @@ SELECT st.heap_blks_read + st.heap_blks_hit = pr.heap_blks 
+ cl.relpages,
   FROM pg_statio_user_tables AS st, pg_class AS cl, prevstats AS pr
  WHERE st.relname='tenk2' AND cl.relname='tenk2';
 
+-- table maintenance stats
+ANALYZE tenk2;
+VACUUM tenk2;
+CLUSTER tenk2 USING tenk2_unique1;
+
+SELECT pg_sleep(1.0);
+
+SELECT st.last_vacuum  pr.last_vacuum,
+   st.last_analyze  pr.last_analyze,
+   st.last_cluster  pr.last_cluster,
+   st.vacuum_count  pr.vacuum_count,
+   st.analyze_count  pr.analyze_count,
+   st.cluster_count  pr.cluster_count
+  FROM pg_stat_user_tables AS st, prevstats pr
+ WHERE st.relname='tenk2';
+
 -- End of Stats Test

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist

2013-09-16 Thread Satoshi Nagayasu


(2013/07/06 1:16), Pavel Stehule wrote:

I am sending a patch that removes strict requirements for DROP IF
EXISTS statements. This behave is similar to our ALTER IF EXISTS
behave now.


postgres=# DROP CAST IF EXISTS (sss AS public.casttesttype);
NOTICE:  types sss and public.casttesttype does not exist, skipping
DROP CAST
postgres=#   DROP FUNCTION IF EXISTS public.pt_in_widget(point, widget);
NOTICE:  function public.pt_in_widget(point,widget) does not exist, skipping
DROP FUNCTION
postgres=#  DROP OPERATOR IF EXISTS public.% (point, widget);
NOTICE:  operator public.% does not exist, skipping
DROP OPERATOR
postgres=# DROP TRIGGER test_trigger_exists ON no_such_table;
ERROR:  relation no_such_table does not exist
postgres=# DROP TRIGGER IF EXISTS test_trigger_exists ON no_such_table;
NOTICE:  trigger test_trigger_exists for table no_such_table does
not exist, skipping
DROP TRIGGER

This functionality is necessary for correct quite reload from dump
without possible warnings


I'm looking at this patch, and I have a question here.

Should DROP TRIGGER IF EXISTS ignore error for non-existing trigger
and non-existing table? Or just only for non-existing trigger?

I've not understood the pg_restore issue precisely so far,
but IMHO DROP TRIGGER IF EXISTS means if the _trigger_ exists,
not if the _table_ exists.

Is this a correct and/or an expected behavior?

Sorry if I missed some consensus which we already made.

Any comments?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-09-15 Thread Satoshi Nagayasu


(2013/09/15 11:07), Peter Eisentraut wrote:

On Sat, 2013-09-14 at 16:18 +0900, Satoshi Nagayasu wrote:

I'm looking forward to seeing more feedback on this approach,
in terms of design and performance improvement.
So, I have submitted this for the next CF.


Your patch fails to build:

pgstattuple.c: In function ‘pgstat_heap_sample’:
pgstattuple.c:737:13: error: ‘SnapshotNow’ undeclared (first use in this 
function)
pgstattuple.c:737:13: note: each undeclared identifier is reported only once 
for each function it appears in


Thanks for checking. Fixed to eliminate SnapshotNow.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
index 2783a63..8ebec6f 100644
--- a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
@@ -37,3 +37,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+CREATE FUNCTION pgstattuple2(IN relname regclass,
+OUT table_len BIGINT,  -- physical table length in bytes
+OUT tuple_count BIGINT,-- number of live tuples
+OUT tuple_len BIGINT,  -- total tuples length in bytes
+OUT tuple_percent FLOAT8,  -- live tuples in %
+OUT dead_tuple_count BIGINT,   -- number of dead tuples
+OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes
+OUT dead_tuple_percent FLOAT8, -- dead tuples in %
+OUT free_space BIGINT, -- free space in bytes
+OUT free_percent FLOAT8)   -- free space in %
+AS 'MODULE_PATHNAME', 'pgstattuple2'
+LANGUAGE C STRICT;
+
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.2.sql
index e5fa2f5..963eb00 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.2.sql
@@ -77,3 +77,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+CREATE FUNCTION pgstattuple2(IN relname regclass,
+OUT table_len BIGINT,  -- physical table length in bytes
+OUT tuple_count BIGINT,-- number of live tuples
+OUT tuple_len BIGINT,  -- total tuples length in bytes
+OUT tuple_percent FLOAT8,  -- live tuples in %
+OUT dead_tuple_count BIGINT,   -- number of dead tuples
+OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes
+OUT dead_tuple_percent FLOAT8, -- dead tuples in %
+OUT free_space BIGINT, -- free space in bytes
+OUT free_percent FLOAT8)   -- free space in %
+AS 'MODULE_PATHNAME', 'pgstattuple2'
+LANGUAGE C STRICT;
+
diff --git a/contrib/pgstattuple/pgstattuple.c 
b/contrib/pgstattuple/pgstattuple.c
index f9ba0a6..2b29d44 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -41,9 +41,22 @@ PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pgstattuple);
 PG_FUNCTION_INFO_V1(pgstattuplebyid);
+PG_FUNCTION_INFO_V1(pgstattuple2);
 
 extern Datum pgstattuple(PG_FUNCTION_ARGS);
 extern Datum pgstattuplebyid(PG_FUNCTION_ARGS);
+extern Datum pgstattuple2(PG_FUNCTION_ARGS);
+
+#define SAMPLE_SIZE 3000
+
+typedef struct pgstattuple_block_stats
+{
+   uint64  tuple_count;
+   uint64  tuple_len;
+   uint64  dead_tuple_count;
+   uint64  dead_tuple_len;
+   uint64  free_space; /* free/reusable space in bytes 
*/
+} pgstattuple_block_stats;
 
 /*
  * struct pgstattuple_type
@@ -66,8 +79,9 @@ typedef void (*pgstat_page) (pgstattuple_type *, Relation, 
BlockNumber,
 
 static Datum build_pgstattuple_type(pgstattuple_type *stat,
   FunctionCallInfo fcinfo);
-static Datum pgstat_relation(Relation rel, FunctionCallInfo fcinfo);
+static Datum pgstat_relation(Relation rel, FunctionCallInfo fcinfo, bool 
enable_sample);
 static Datum pgstat_heap(Relation rel, FunctionCallInfo fcinfo);
+static Datum pgstat_heap_sample(Relation rel, FunctionCallInfo fcinfo);
 static void pgstat_btree_page(pgstattuple_type *stat,
  Relation rel, BlockNumber blkno,
  BufferAccessStrategy bstrategy);
@@ -81,6 +95,11 @@ static Datum pgstat_index(Relation rel, BlockNumber start,
 pgstat_page pagefn, FunctionCallInfo fcinfo);
 static void pgstat_index_page(pgstattuple_type *stat, Page page,
  OffsetNumber minoff, OffsetNumber maxoff);
+static void compute_parameters(pgstattuple_block_stats *block_stats,
+  BlockNumber sample_size, BlockNumber nblocks,
+  uint64 *tuple_count, uint64 *tuple_len,
+  uint64 *dead_tuple_count

Re: [HACKERS] pg_system_identifier()

2013-09-15 Thread Satoshi Nagayasu

Hi,

I'm catching up with the discussion as a reviewer...

(2013/08/27 5:35), Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
 On 2013-08-26 10:10:54 -0700, Josh Berkus wrote:
 I'm going to reverse my vote, and vote against this patch.  The reason
 why is that I think we should instead have a function:

 pg_controldata(parameter text)

 ... which would report *all* strings in pg_controldata.  Hence, you'd do
 instead:

 pg_controldata('system identifier')

 This will hopefully spare us from 15 patches incrementally adding all of
 the individual items in controldata.
 
 If anything but the proposed feature, it should be an SRF - passing in
 text parameters isn't very discoverable.
 
 I'm not pleased with the idea that we'd have to dumb all the relevant
 datatypes down to text so that we could push them through this single
 function.
 
 Also, what about i18n?  pg_controldata localizes all the strings it
 prints, but I doubt that's a great idea for either the input or the
 output of this proposed function.

How about adding new system view with new function which returns
a single pg_controldata value in text type, and using a cast for
each column in the view definition?

CREATE VIEW pg_catalog.pg_controldata AS
  SELECT pg_controldata('control_version')::integer AS control_version,
 pg_controldata('catalog_version')::integer AS catalog_version,
 pg_controldata('system_identifier')::bigint AS system_identifier,
 ...
 pg_controldata('next_xlog_file')::char(25) AS next_xlog_file,
 ...
 pg_controldata('encoding')::text AS encoding;

Given that the view can work like a SRF, and it allows us to retrieve
all the values of pg_controldata with appropriate types in single
record from the view:

  select * from pg_catalog.pg_controldata;

To get the system identifier value:

  select system_identifier from pg_catalog.pg_controldata;

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-09-14 Thread Satoshi Nagayasu

(2013/07/23 20:02), Greg Smith wrote:
 On 7/23/13 2:16 AM, Satoshi Nagayasu wrote:
 I've been working on new pgstattuple function to allow
 block sampling [1] in order to reduce block reads while
 scanning a table. A PoC patch is attached.
 
 Take a look at all of the messages linked in
 https://commitfest.postgresql.org/action/patch_view?id=778
 
 Jaime and I tried to do what you're working on then, including a random
 block sampling mechanism modeled on the stats_target mechanism.  We
 didn't do that as part of pgstattuple though, which was a mistake.
 
 Noah created some test cases as part of his thorough review that were
 not computing the correct results.  Getting the results correct for all
 of the various types of PostgreSQL tables and indexes ended up being
 much harder than the sampling part.  See
 http://www.postgresql.org/message-id/20120222052747.ge8...@tornado.leadboat.com
 in particular for that.

Thanks for the info. I have read the previous discussion.

I'm looking forward to seeing more feedback on this approach,
in terms of design and performance improvement.
So, I have submitted this for the next CF.

 This new function, pgstattuple2(), samples only 3,000 blocks
 (which accounts 24MB) from the table randomly, and estimates
 several parameters of the entire table.
 
 There should be an input parameter to the function for how much sampling
 to do, and if it's possible to make the scale for it to look like
 ANALYZE that's helpful too.
 
 I have a project for this summer that includes reviving this topic and
 making sure it works on some real-world systems.  If you want to work on
 this too, I can easily combine that project into what you're doing.

Yeah, I'm interested in that. Something can be shared?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New statistics for WAL buffer dirty writes

2013-09-10 Thread Satoshi Nagayasu

Thanks for checking. Revised one attached.

(2013/09/10 6:43), Peter Eisentraut wrote:
 On 9/6/13 11:32 PM, Satoshi Nagayasu wrote:
 The revised patch for wal buffer statistics is attached.
 A test script is also attached. Please take a look.
 
 You have duplicate OIDs.  Run the script duplicate_oids to find them.
 

-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 23ebc11..cdced7f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1878,6 +1878,13 @@ include 'filename'
 results in most cases.
/para
 
+   para
+When you see pg_stat_walwriter.dirty_write, which means number
+of buffer flushing at buffer full, is continuously increasing
+in your running server, you may need to enlarge this buffer
+size.
+   /para
+
   /listitem
  /varlistentry
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4ec6981..15d9202 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry
+  entryOne row only, showing statistics about the wal writer
+   process's activity. See xref linkend=pg-stat-walwriter-view
+   for details.
+ /entry
+ /row
+
+ row
   
entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry
   entryOne row per database, showing database-wide statistics. See
xref linkend=pg-stat-database-view for details.
@@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
single row, containing global data for the cluster.
   /para
 
+  table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter
+   titlestructnamepg_stat_walwriter/structname View/title
+
+   tgroup cols=3
+thead
+row
+  entryColumn/entry
+  entryType/entry
+  entryDescription/entry
+ /row
+/thead
+
+tbody
+ row
+  entrystructfielddirty_writes//entry
+  entrytypebigint/type/entry
+  entryNumber of dirty writes, which means flushing wal buffers
+   because of its full./entry
+ /row
+ row
+  entrystructfieldstats_reset//entry
+  entrytypetimestamp with time zone/type/entry
+  entryTime at which these statistics were last reset/entry
+ /row
+/tbody
+/tgroup
+  /table
+
+  para
+   The structnamepg_stat_walwriter/structname view will always have a
+   single row, containing global data for the cluster.
+  /para
+
   table id=pg-stat-database-view xreflabel=pg_stat_database
titlestructnamepg_stat_database/structname View/title
tgroup cols=3
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index dc47c47..d0e85c9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2467,6 +2467,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
WriteRqst.Write = OldPageRqstPtr;
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false);
+   WalWriterStats.m_xlog_dirty_writes++;
LWLockRelease(WALWriteLock);

TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 575a40f..12a2ed0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -686,6 +686,11 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_walwriter AS
+SELECT
+pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+pg_stat_get_wal_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b5ce2f6..8c56af5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -129,6 +129,14 @@ char  *pgstat_stat_tmpname = NULL;
  */
 PgStat_MsgBgWriter BgWriterStats;
 
+/*
+ * WalWriter statistics counter.
+ * This counter is incremented by each XLogWrite call,
+ * both in the wal writer process and each backend.
+ * And then, sent to the stat collector process.
+ */
+PgStat_MsgWalWriter WalWriterStats;
+
 /* --
  * Local data
  * --
@@ -293,6 +301,7 @@ static void pgstat_recv_autovac(PgStat_MsgAutovacStart 
*msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void

Re: [HACKERS] New statistics for WAL buffer dirty writes

2013-09-10 Thread Satoshi Nagayasu

(2013/09/10 22:48), Peter Eisentraut wrote:
 On 9/10/13 3:37 AM, Satoshi Nagayasu wrote:
 Thanks for checking. Revised one attached.
 
 Please fix compiler warning:
 
 walwriter.c: In function ‘WalWriterMain’:
 walwriter.c:293:3: warning: implicit declaration of function
 ‘pgstat_send_walwriter’ [-Wimplicit-function-declaration]

Thanks. Fixed.

-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 23ebc11..cdced7f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1878,6 +1878,13 @@ include 'filename'
 results in most cases.
/para
 
+   para
+When you see pg_stat_walwriter.dirty_write, which means number
+of buffer flushing at buffer full, is continuously increasing
+in your running server, you may need to enlarge this buffer
+size.
+   /para
+
   /listitem
  /varlistentry
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4ec6981..15d9202 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry
+  entryOne row only, showing statistics about the wal writer
+   process's activity. See xref linkend=pg-stat-walwriter-view
+   for details.
+ /entry
+ /row
+
+ row
   
entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry
   entryOne row per database, showing database-wide statistics. See
xref linkend=pg-stat-database-view for details.
@@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
single row, containing global data for the cluster.
   /para
 
+  table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter
+   titlestructnamepg_stat_walwriter/structname View/title
+
+   tgroup cols=3
+thead
+row
+  entryColumn/entry
+  entryType/entry
+  entryDescription/entry
+ /row
+/thead
+
+tbody
+ row
+  entrystructfielddirty_writes//entry
+  entrytypebigint/type/entry
+  entryNumber of dirty writes, which means flushing wal buffers
+   because of its full./entry
+ /row
+ row
+  entrystructfieldstats_reset//entry
+  entrytypetimestamp with time zone/type/entry
+  entryTime at which these statistics were last reset/entry
+ /row
+/tbody
+/tgroup
+  /table
+
+  para
+   The structnamepg_stat_walwriter/structname view will always have a
+   single row, containing global data for the cluster.
+  /para
+
   table id=pg-stat-database-view xreflabel=pg_stat_database
titlestructnamepg_stat_database/structname View/title
tgroup cols=3
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index dc47c47..d0e85c9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2467,6 +2467,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
WriteRqst.Write = OldPageRqstPtr;
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false);
+   WalWriterStats.m_xlog_dirty_writes++;
LWLockRelease(WALWriteLock);

TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 575a40f..12a2ed0 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -686,6 +686,11 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_walwriter AS
+SELECT
+pg_stat_get_xlog_dirty_writes() AS dirty_writes,
+pg_stat_get_wal_stat_reset_time() AS stats_reset;
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b5ce2f6..8c56af5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -129,6 +129,14 @@ char  *pgstat_stat_tmpname = NULL;
  */
 PgStat_MsgBgWriter BgWriterStats;
 
+/*
+ * WalWriter statistics counter.
+ * This counter is incremented by each XLogWrite call,
+ * both in the wal writer process and each backend.
+ * And then, sent to the stat collector process.
+ */
+PgStat_MsgWalWriter WalWriterStats;
+
 /* --
  * Local data
  * --
@@ -293,6 +301,7 @@ static void pgstat_recv_autovac(PgStat_MsgAutovacStart 
*msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg

Re: [HACKERS] [rfc] overhauling pgstat.stat

2013-09-09 Thread Satoshi Nagayasu




(2013/09/09 8:19), Tomas Vondra wrote:

On 8.9.2013 23:04, Jeff Janes wrote:

On Tue, Sep 3, 2013 at 10:09 PM, Satoshi Nagayasu sn...@uptime.jp
wrote:

Hi,


(2013/09/04 13:07), Alvaro Herrera wrote:


Satoshi Nagayasu wrote:


As you may know, this file could be handreds of MB in size,
because pgstat.stat holds all access statistics in each
database, and it needs to read/write an entire pgstat.stat
frequently.

As a result, pgstat.stat often generates massive I/O operation,
particularly when having a large number of tables in the
database.



We already changed it:




commit 187492b6c2e8cafc5b39063ca3b67846e8155d24 Author: Alvaro
Herrera alvhe...@alvh.no-ip.org Date:   Mon Feb 18 17:56:08
2013 -0300

Split pgstat file in smaller pieces


Thanks for the comments. I forgot to mention that.

Yes, we have already split single pgstat.stat file into several
pieces.

However, we still need to read/write large amount of statistics
data when we have a large number of tables in single database or
multiple databases being accessed. Right?


Do you have a test case for measuring this?  I vaguely remember from
  when I was testing the split patch, that I thought that after that
improvement the load that was left was so low that there was little
point in optimizing it further.


This is actually a pretty good point. Creating a synthetic test case is
quite simple - just create 1.000.000 tables in a single database, but
I'm wondering if it's actually realistic. Do we have a real-world
example where the current one stat file per db is not enough?


I have several assumptions for that.

- Single shared database contains thousands of customers.
- Each customer has hundreds of tables and indexes.
- Customers are separated by schemas (namespaces) in single database.
- Application server uses connection pooling for performance reason.
- Workload (locality in the table access) can not be predicted.

Looks reasonable?


The reason why I worked on the split patch is that our application is
slightly crazy and creates a lot of tables (+ indexes) on the fly, and
as we have up to a thousand databases on each host, we often ended up
with a huge stat file.

Splitting the stat file improved that considerably, although that's
partially because we have the stats on a tmpfs, so I/O is not a problem,
and the CPU overhead is negligible thanks to splitting the stats per
database.


I agree that splitting a single large database into several pieces,
like thousands of tiny databases, could be an option in some cases.

However, what I intend here is eliminating those limitations on
database design.

In fact, when considering connection pooling, splitting a database
is not a good idea, because AFAIK, many connection poolers manage
connections per database.

So, I'd like to support 100k tables in single database.

Any comments?

Regards,


But AFAIK there are operating systems where creating a filesystem in RAM
is not that simple - e.g. Windows. In such cases even a moderate number
of objects may be a significant issue I/O-wise. But then again, I can't
really think of reasonable a system creating that many objects in a
single database (except for e.g. a shared database using schemas instead
of databases).

Tomas




--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New statistics for WAL buffer dirty writes

2013-09-06 Thread Satoshi Nagayasu


Hi,

The revised patch for wal buffer statistics is attached.
A test script is also attached. Please take a look.

Regards,

(2013/07/19 7:49), Satoshi Nagayasu wrote:

Will revise and re-resubmit for the next CF.

Regards,

2013/07/19 1:06, Alvaro Herrera wrote:


What happened to this patch?  We were waiting on an updated version from
you.


Satoshi Nagayasu wrote:

(2012/12/10 3:06), Tomas Vondra wrote:

On 29.10.2012 04:58, Satoshi Nagayasu wrote:

2012/10/24 1:12, Alvaro Herrera wrote:

Satoshi Nagayasu escribi�:


With this patch, walwriter process and each backend process
would sum up dirty writes, and send it to the stat collector.
So, the value could be saved in the stat file, and could be
kept on restarting.

The statistics could be retreive with using
pg_stat_get_xlog_dirty_writes() function, and could be reset
with calling pg_stat_reset_shared('walwriter').

Now, I have one concern.

The reset time could be captured in
globalStats.stat_reset_timestamp,
but this value is the same with the bgwriter one.

So, once pg_stat_reset_shared('walwriter') is called,
stats_reset column in pg_stat_bgwriter does represent
the reset time for walwriter, not for bgwriter.

How should we handle this?  Should we split this value?
And should we have new system view for walwriter?


I think the answer to the two last questions is yes.  It doesn't
seem to
make sense, to me, to have a single reset timings for what are
effectively two separate things.

Please submit an updated patch to next CF.  I'm marking this one
returned with feedback.  Thanks.



I attached the latest one, which splits the reset_time
for bgwriter and walwriter, and provides new system view,
called pg_stat_walwriter, to show the dirty write counter
and the reset time.


I've done a quick review of the v4 patch:


Thanks for the review, and sorry for my delayed response.


1) applies fine on HEAD, compiles fine

2) make installcheck fails because of a difference in the 'rules'
test suite (there's a new view pg_stat_walwriter - see the
attached patch for a fixed version or expected/rules.out)


Ah, I forgot about the regression test. I will fix it. Thanks.


3) I do agree with Alvaro that using the same struct for two separate
components (bgwriter and walwriter) seems a bit awkward. For
example
you need to have two separate stat_reset fields, the reset code
becomes much more verbose (because you need to list individual
fields) etc.

So I'd vote to either split this into two structures or keeping it
as a single structure (although with two views on top of it).


Ok, I will split it into two structs, PgStat_BgWriterGlobalStats and
PgStat_WalWriterGlobalStats, and will modify PgStat_GlobalStats to
hold those two structs in the stat collector.


4) Are there any other fields that might be interesting? Right now
there's just dirty_writes but I guess there are other values.
E.g.
how much data was actually written etc.?


AFAIK, I think those numbers can be obtained by calling
pg_current_xlog_insert_location() or pg_current_xlog_location(),
but if we need it, I will add it.

Regards,








--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 23ebc11..cdced7f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1878,6 +1878,13 @@ include 'filename'
 results in most cases.
/para
 
+   para
+When you see pg_stat_walwriter.dirty_write, which means number
+of buffer flushing at buffer full, is continuously increasing
+in your running server, you may need to enlarge this buffer
+size.
+   /para
+
   /listitem
  /varlistentry
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4ec6981..15d9202 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry
+  entryOne row only, showing statistics about the wal writer
+   process's activity. See xref linkend=pg-stat-walwriter-view
+   for details.
+ /entry
+ /row
+
+ row
   
entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry
   entryOne row per database, showing database-wide statistics. See
xref linkend=pg-stat-database-view for details.
@@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
single row, containing global data for the cluster.
   /para
 
+  table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter
+   titlestructnamepg_stat_walwriter/structname View/title
+
+   tgroup cols=3
+thead
+row
+  entryColumn/entry
+  entryType/entry
+  entryDescription/entry
+ /row
+/thead
+
+tbody

Re: [HACKERS] [rfc] overhauling pgstat.stat

2013-09-04 Thread Satoshi Nagayasu


(2013/09/04 15:23), Atri Sharma wrote:



Sent from my iPad

On 04-Sep-2013, at 10:54, Satoshi Nagayasu sn...@uptime.jp wrote:


Hi,

(2013/09/04 12:52), Atri Sharma wrote:

On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

Hi,

I'm considering overhauling pgstat.stat, and would like to know how many
people are interested in this topic.

As you may know, this file could be handreds of MB in size, because
pgstat.stat holds all access statistics in each database, and it needs
to read/write an entire pgstat.stat frequently.

As a result, pgstat.stat often generates massive I/O operation,
particularly when having a large number of tables in the database.

To support multi-tenancy or just a large number of tables (up to 10k
tables in single database), I think pgstat.stat needs to be overhauled.

I think using heap and btree in pgstat.stat would be preferred to reduce
read/write and to allow updating access statistics for specific tables
in pgstat.stat file.

Is this good for us?


Hi,

Nice thought. I/O reduction in pgstat can be really helpful.

I am trying to think of our aim here. Would we be looking to split
pgstat per table, so that the I/O write happens for only a portion of
pgstat? Or reduce the I/O in general?


I prefer the latter.

Under the current implementation, DBA need to split single database
into many smaller databases with considering access locality of the
tables. It's difficult and could be change in future.

And splitting the statistics data into many files (per table,
for example) would cause another performance issue when
collecting/showing statistics at once. Just my guess though.

So, I'm looking for a new way to reduce I/O for the statistics data
in general.

Regards,



If the later, how would using BTree help us? I would rather go for a
range tree or something. But again, I may be completely wrong.

Please elaborate a bit more on the solution we are trying to
achieve.It seems really interesting.

Regards,

Atri







Right,thanks.

How would using heap and BTree help here? Are we looking at a priority queue 
which supports the main storage system of the stats?


For example, when you read only a single block from your table,
then you need to write all values in your database statistics next.
It often generates large amount of i/o operation.

However, if random access is allowed in the statistics, you can
update only as single record for the specific table which you read.
It would be less than 100 bytes for each table.

I have no idea about how a priority queue can work here so far.
However, if the statistics is overhauled, PostgreSQL would be able
to host a much larger number of customers more easily.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [rfc] overhauling pgstat.stat

2013-09-04 Thread Satoshi Nagayasu


(2013/09/05 3:50), Pavel Stehule wrote:

  we very successfully use a tmpfs volume for pgstat files (use a
backport
  of multiple statfiles from 9.3 to 9.1

It works quite well as long as you have the objects (tables, indexes,
functions) spread across multiple databases. Once you have one database
with very large number of objects, tmpfs is not as effective.

It's going to help with stats I/O, but it's not going to help with high
CPU usage (you're reading and parsing the stat files over and over) and
every rewrite creates a copy of the file. So if you have 400MB stats,
you will need 800MB tmpfs + some slack (say, 200MB). That means you'll
use ~1GB tmpfs although 400MB would be just fine. And this 600MB won't
be used for page cache etc.

OTOH, it's true that if you have that many objects, 600MB of RAM is not
going to help you anyway.


and just idea - can we use a database for storing these files. It can be
used in unlogged tables. Second idea - hold a one bg worker as
persistent memory key value database and hold data in memory with some
optimizations - using anti cache and similar memory database fetures.


Yeah, I'm interested in this idea too.

If the stat collector has a dedicated connection to the backend in
order to store statistics into dedicated tables, we can easily take
advantages of index (btree, or hash?) and heap storage.

Is this worth trying?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [rfc] overhauling pgstat.stat

2013-09-04 Thread Satoshi Nagayasu


(2013/09/05 3:59), Alvaro Herrera wrote:

Tomas Vondra wrote:


My idea was to keep the per-database stats, but allow some sort of
random access - updating / deleting the records in place, adding
records etc. The simplest way I could think of was adding a simple
index - a mapping of OID to position in the stat file.

I.e. a simple  array of (oid, offset) pairs, stored in oid.stat.index or
something like that. This would make it quite simple to access existing
record

   1: get position from the index
   2: read sizeof(Entry) from the file
   3: if it's update, just overwrite the bytes, for delete set isdeleted
  flag (needs to be added to all entries)

or reading all the records (just read the whole file as today).


Sounds reasonable.  However, I think the index should be a real index,
i.e. have a tree structure that can be walked down, not just a plain
array.  If you have a 400 MB stat file, then you must have about 4
million tables, and you will not want to scan such a large array every
time you want to find an entry.


I thought an array structure at first.

But, for now, I think we should have a real index for the
statistics data because we already have several index storages,
and it will allow us to minimize read/write operations.

BTW, what kind of index would be preferred for this purpose?
btree or hash?

If we use btree, do we need range scan thing on the statistics
tables? I have no idea so far.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [rfc] overhauling pgstat.stat

2013-09-03 Thread Satoshi Nagayasu

Hi,

I'm considering overhauling pgstat.stat, and would like to know how many
people are interested in this topic.

As you may know, this file could be handreds of MB in size, because
pgstat.stat holds all access statistics in each database, and it needs
to read/write an entire pgstat.stat frequently.

As a result, pgstat.stat often generates massive I/O operation,
particularly when having a large number of tables in the database.

To support multi-tenancy or just a large number of tables (up to 10k
tables in single database), I think pgstat.stat needs to be overhauled.

I think using heap and btree in pgstat.stat would be preferred to reduce
read/write and to allow updating access statistics for specific tables
in pgstat.stat file.

Is this good for us?

Any comments or suggestions?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [rfc] overhauling pgstat.stat

2013-09-03 Thread Satoshi Nagayasu


Hi,

(2013/09/04 13:07), Alvaro Herrera wrote:

Satoshi Nagayasu wrote:


As you may know, this file could be handreds of MB in size, because
pgstat.stat holds all access statistics in each database, and it needs
to read/write an entire pgstat.stat frequently.

As a result, pgstat.stat often generates massive I/O operation,
particularly when having a large number of tables in the database.


We already changed it:


 commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
 Author: Alvaro Herrera alvhe...@alvh.no-ip.org
 Date:   Mon Feb 18 17:56:08 2013 -0300

  Split pgstat file in smaller pieces

Thanks for the comments. I forgot to mention that.

Yes, we have already split single pgstat.stat file into
several pieces.

However, we still need to read/write large amount of statistics
data when we have a large number of tables in single database
or multiple databases being accessed. Right?

I think the issue here is that it is necessary to write/read
statistics data even it's not actually changed.

So, I'm wondering how we can minimize read/write operations
on these statistics data files with using heap and btree.

Regards,



commit 187492b6c2e8cafc5b39063ca3b67846e8155d24
Author: Alvaro Herrera alvhe...@alvh.no-ip.org
Date:   Mon Feb 18 17:56:08 2013 -0300

 Split pgstat file in smaller pieces

 We now write one file per database and one global file, instead of
 having the whole thing in a single huge file.  This reduces the I/O that
 must be done when partial data is required -- which is all the time,
 because each process only needs information on its own database anyway.
 Also, the autovacuum launcher does not need data about tables and
 functions in each database; having the global stats for all DBs is
 enough.

 Catalog version bumped because we have a new subdir under PGDATA.

 Author: Tomas Vondra.  Some rework by Álvaro
 Testing by Jeff Janes
 Other discussion by Heikki Linnakangas, Tom Lane.


I don't oppose further tweaking, of course, but I wonder if you are
considering these changes.



--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [rfc] overhauling pgstat.stat

2013-09-03 Thread Satoshi Nagayasu


Hi,

(2013/09/04 12:52), Atri Sharma wrote:

On Wed, Sep 4, 2013 at 6:40 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

Hi,

I'm considering overhauling pgstat.stat, and would like to know how many
people are interested in this topic.

As you may know, this file could be handreds of MB in size, because
pgstat.stat holds all access statistics in each database, and it needs
to read/write an entire pgstat.stat frequently.

As a result, pgstat.stat often generates massive I/O operation,
particularly when having a large number of tables in the database.

To support multi-tenancy or just a large number of tables (up to 10k
tables in single database), I think pgstat.stat needs to be overhauled.

I think using heap and btree in pgstat.stat would be preferred to reduce
read/write and to allow updating access statistics for specific tables
in pgstat.stat file.

Is this good for us?


Hi,

Nice thought. I/O reduction in pgstat can be really helpful.

I am trying to think of our aim here. Would we be looking to split
pgstat per table, so that the I/O write happens for only a portion of
pgstat? Or reduce the I/O in general?


I prefer the latter.

Under the current implementation, DBA need to split single database
into many smaller databases with considering access locality of the
tables. It's difficult and could be change in future.

And splitting the statistics data into many files (per table,
for example) would cause another performance issue when
collecting/showing statistics at once. Just my guess though.

So, I'm looking for a new way to reduce I/O for the statistics data
in general.

Regards,



If the later, how would using BTree help us? I would rather go for a
range tree or something. But again, I may be completely wrong.

Please elaborate a bit more on the solution we are trying to
achieve.It seems really interesting.

Regards,

Atri




--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-08-29 Thread Satoshi Nagayasu




(2013/08/30 11:55), Fujii Masao wrote:

Hi,

Attached patch adds new GUC parameter 'compress_backup_block'.
When this parameter is enabled, the server just compresses FPW
(full-page-writes) in WAL by using pglz_compress() before inserting it
to the WAL buffers. Then, the compressed FPW is decompressed
in recovery. This is very simple patch.

The purpose of this patch is the reduction of WAL size.
Under heavy write load, the server needs to write a large amount of
WAL and this is likely to be a bottleneck. What's the worse is,
in replication, a large amount of WAL would have harmful effect on
not only WAL writing in the master, but also WAL streaming and
WAL writing in the standby. Also we would need to spend more
money on the storage to store such a large data.
I'd like to alleviate such harmful situations by reducing WAL size.

My idea is very simple, just compress FPW because FPW is
a big part of WAL. I used pglz_compress() as a compression method,
but you might think that other method is better. We can add
something like FPW-compression-hook for that later. The patch
adds new GUC parameter, but I'm thinking to merge it to full_page_writes
parameter to avoid increasing the number of GUC. That is,
I'm thinking to change full_page_writes so that it can accept new value
'compress'.

I measured how much WAL this patch can reduce, by using pgbench.

* Server spec
   CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
   Mem: 16GB
   Disk: 500GB SSD Samsung 840

* Benchmark
   pgbench -c 32 -j 4 -T 900 -M prepared
   scaling factor: 100

   checkpoint_segments = 1024
   checkpoint_timeout = 5min
   (every checkpoint during benchmark were triggered by checkpoint_timeout)


I believe that the amount of backup blocks in WAL files is affected
by how often the checkpoints are occurring, particularly under such
update-intensive workload.

Under your configuration, checkpoint should occur so often.
So, you need to change checkpoint_timeout larger in order to
determine whether the patch is realistic.

Regards,



* Result
   [tps]
   1386.8 (compress_backup_block = off)
   1627.7 (compress_backup_block = on)

   [the amount of WAL generated during running pgbench]
   4302 MB (compress_backup_block = off)
   1521 MB (compress_backup_block = on)

At least in my test, the patch could reduce the WAL size to one-third!

The patch is WIP yet. But I'd like to hear the opinions about this idea
before completing it, and then add the patch to next CF if okay.

Regards,






--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-08-29 Thread Satoshi Nagayasu




(2013/08/30 12:07), Satoshi Nagayasu wrote:



(2013/08/30 11:55), Fujii Masao wrote:

Hi,

Attached patch adds new GUC parameter 'compress_backup_block'.
When this parameter is enabled, the server just compresses FPW
(full-page-writes) in WAL by using pglz_compress() before inserting it
to the WAL buffers. Then, the compressed FPW is decompressed
in recovery. This is very simple patch.

The purpose of this patch is the reduction of WAL size.
Under heavy write load, the server needs to write a large amount of
WAL and this is likely to be a bottleneck. What's the worse is,
in replication, a large amount of WAL would have harmful effect on
not only WAL writing in the master, but also WAL streaming and
WAL writing in the standby. Also we would need to spend more
money on the storage to store such a large data.
I'd like to alleviate such harmful situations by reducing WAL size.

My idea is very simple, just compress FPW because FPW is
a big part of WAL. I used pglz_compress() as a compression method,
but you might think that other method is better. We can add
something like FPW-compression-hook for that later. The patch
adds new GUC parameter, but I'm thinking to merge it to full_page_writes
parameter to avoid increasing the number of GUC. That is,
I'm thinking to change full_page_writes so that it can accept new value
'compress'.

I measured how much WAL this patch can reduce, by using pgbench.

* Server spec
   CPU: 8core, Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
   Mem: 16GB
   Disk: 500GB SSD Samsung 840

* Benchmark
   pgbench -c 32 -j 4 -T 900 -M prepared
   scaling factor: 100

   checkpoint_segments = 1024
   checkpoint_timeout = 5min
   (every checkpoint during benchmark were triggered by
checkpoint_timeout)


I believe that the amount of backup blocks in WAL files is affected
by how often the checkpoints are occurring, particularly under such
update-intensive workload.

Under your configuration, checkpoint should occur so often.
So, you need to change checkpoint_timeout larger in order to
determine whether the patch is realistic.


In fact, the following chart shows that checkpoint_timeout=30min
also reduces WAL size to one-third, compared with 5min timeout,
in the pgbench experimentation.

https://www.oss.ecl.ntt.co.jp/ossc/oss/img/pglesslog_img02.jpg

Regards,



Regards,



* Result
   [tps]
   1386.8 (compress_backup_block = off)
   1627.7 (compress_backup_block = on)

   [the amount of WAL generated during running pgbench]
   4302 MB (compress_backup_block = off)
   1521 MB (compress_backup_block = on)

At least in my test, the patch could reduce the WAL size to one-third!

The patch is WIP yet. But I'd like to hear the opinions about this idea
before completing it, and then add the patch to next CF if okay.

Regards,








--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inconsistent state after crash recovery

2013-08-03 Thread Satoshi Nagayasu


(2013/08/02 21:19), Robert Haas wrote:

On Fri, Aug 2, 2013 at 8:17 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Robert Haas robertmh...@gmail.com writes:

On Fri, Jul 26, 2013 at 8:27 AM, Tom Lane t...@sss.pgh.pa.us wrote:

would you expect crash recovery to notice the disappearance of a file
that was touched nowhere in the replayed actions?



Eh, maybe not.  But should we try harder to detect the unexpected
disappearance of one that is?


We do, don't we?  The replay stuff should complain unless it sees a drop
or truncate covering any unaccounted-for pages.


Hmm.  Yeah.  But the OP seems to think it doesn't work.


Yes. I'm afraid that.

My attached script shows that crash recovery re-creates the lost
table file implicitly, and fills some of those blocks (maybe
lower ones) with zero without any notice. We can easily observe
it by using pg_filedump.

Thus, the table file can lose records, but DBA cannot recognize
it because no message is left in the server log.

I agree that this is not a PostgreSQL bug.

However, DBA still needs to detect this table corruption,
brought by several components which PostgreSQL relys on, to
consider restoring from database backup.

If PostgreSQL can detect and tell something about that, it
would be really helpful for DBA to make some critical decision.
I think PostgreSQL will be able to do that.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
#!/bin/sh

PGHOME=/tmp/pgsql
PGDATA=/tmp/pgsql/data
PATH=${PGHOME}/bin:${PATH}
export PGHOME PGDATA PATH

createdb testdb

psql -e testdbEOF
drop table if exists t1;

create table t1 ( c1 varchar(20) );
checkpoint;

select count(*) from t1;

select relname,relfilenode from pg_class where relname='t1';
select oid,* from pg_database where datname=current_database();

insert into t1 values (trim(to_char(generate_series(1,1000), 
'')) );

-- 1000 records
select count(*) from t1;
select pg_relation_size('t1');

-- write lower blocks in the table file.
checkpoint;

-- write backup blocks and new records in WAL
insert into t1 values (trim(to_char(generate_series(1001,2000), 
'')) );

-- 2000 records
select count(*) from t1;
select pg_relation_size('t1');
EOF

tablefile=`psql -A -t testdbEOF
select
  ( select setting from pg_settings where name='data_directory' ) || '/'  ||
  ( select pg_relation_filepath(oid) from pg_class where relname='t1' )
;
EOF
`

echo $tablefile

# force a process crash, then recovery after that.
killall -9 postgres

# remove the table file.
rm -rf $tablefile

rm -rf ${PGDATA}/*.pid
rm -rf ${PGDATA}/pg_log/*

pg_ctl -w -D ${PGDATA} start

cat ${PGDATA}/pg_log/*

psql -e testdbEOF
-- only a backup block and following wal records can be
-- recovered here. so the table is going to be inconsistent.

-- 1000+ records (inconsistent)
select count(*) from t1;
select pg_relation_size('t1');
EOF

#!/bin/sh

PGHOME=/tmp/pgsql
PGDATA=/tmp/pgsql/data
PATH=${PGHOME}/bin:${PATH}
export PGHOME PGDATA PATH

createdb testdb

psql -e testdbEOF
drop table if exists t1;

create table t1 ( c1 varchar(20) );
checkpoint;

select count(*) from t1;

select relname,relfilenode from pg_class where relname='t1';
select oid,* from pg_database where datname=current_database();

insert into t1 values (trim(to_char(generate_series(1,1000), 
'')) );

select count(*) from t1;
select pg_relation_size('t1');

-- write lower blocks in the table file.
checkpoint;

-- write backup blocks and new records in WAL
insert into t1 values (trim(to_char(generate_series(1001,2000), 
'')) );

select count(*) from t1;
select pg_relation_size('t1');
EOF

tablefile=`psql -A -t testdbEOF
select
  ( select setting from pg_settings where name='data_directory' ) || '/'  || 
  ( select pg_relation_filepath(oid) from pg_class where relname='t1' )
;
EOF
`

echo $tablefile

# force a process crash, then recovery after that.
killall -9 postgres

ls -l $tablefile
pg_filedump -y -S 8192 $tablefile  /tmp/dump_before_drop

# remove the table file.
rm -rf $tablefile

rm -rf ${PGDATA}/*.pid
rm -rf ${PGDATA}/pg_log/*

pg_ctl -w -D ${PGDATA} start -o -p 5433

cat ${PGDATA}/pg_log/*

psql -e testdbEOF
-- if execute a checkpoint here, the table would be recovered
-- with inconsistent state after crash recovery,
-- becaues only the upper blocks can be recovered.
select count(*) from t1;
select pg_relation_size('t1');
EOF

tablefile=`psql -A -t testdbEOF
select
  ( select setting from pg_settings where name='data_directory' ) || '/'  || 
  ( select pg_relation_filepath(oid) from pg_class where relname='t1' )
;
EOF
`
ls -l $tablefile
pg_filedump -y -S 8192 $tablefile  /tmp/dump_after_recovery

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [9.4 CF 1]Commitfest ... over!

2013-08-02 Thread Satoshi Nagayasu


(2013/08/03 8:47), Josh Berkus wrote:

Given that we can expect to be dealing with more patches per CF in the
future, I'd like some feedback about what things would make the CF
process more efficient.  For that matter, for the first time we tried
enforcing some of the rules of CFs this time, and I'd like to hear if
people think that helped.


The 5-day rule (and the notifications from CFM) seemed to be working
for me. It helped me focus on the patch review.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] inconsistent state after crash recovery

2013-07-25 Thread Satoshi Nagayasu

Hi,

I received a question about inconsistent state after crash recovery.

When a table file is broken (or just lost), PostgreSQL can not recover
a whole table, and does not show any notice while recoverying.
I think it means inconsistent state.

(1) create a table, and fill records.
(2) process a checkpoint.
(3) fill more records.
(4) force a crash, and delete the table file.
(5) run recovery on restarting.
(6) only records after the checkpoint can be recoverd.

For example, the attached log shows that PostgreSQL can recover
only 1058 records in the table which contains 2000 records
before the crash, and does not tell anything in the server log.

--
insert into t1 values (trim(to_char(generate_series(1,1000),
'')) );
INSERT 0 1000
select count(*) from t1;
 count
---
  1000
(1 row)

checkpoint;
CHECKPOINT
insert into t1 values (trim(to_char(generate_series(1001,2000),
'')) );
INSERT 0 1000
select count(*) from t1;
 count
---
  2000
(1 row)

(delete the table file)
(kill postgres)
(restart postgres with recovery)

select count(*) from t1;
 count
---
  1058
(1 row)
--

Is this expected or acceptable?

I think, at least, PostgreSQL should say something about this
situation in the server log, because DBA can not recognize
this situation if no server log exists.

To reproduce it, please check the attached test script.

Any comments?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
[snaga@devsv03 postgresql.git]$ sh test_recovery.sh
createdb: database creation failed: ERROR:  database testdb already exists
drop table if exists t1;
DROP TABLE
create table t1 ( c1 varchar(20) );
CREATE TABLE
checkpoint;
CHECKPOINT
select count(*) from t1;
 count
---
 0
(1 row)

select relname,relfilenode from pg_class where relname='t1';
 relname | relfilenode
-+-
 t1  |  147456
(1 row)

select oid,* from pg_database where datname=current_database();
  oid  | datname | datdba | encoding | datcollate | datctype | datistemplate | 
datallowconn | datconnlimit | datlastsysoid | datfrozenxid | datminmxid | 
dattablespace | datacl
---+-++--++--+---+--+--+---+--++---+
 16384 | testdb  | 10 |0 | C  | C| f | 
t|   -1 | 12895 | 1799 |  1 |   
   1663 |
(1 row)

insert into t1 values (trim(to_char(generate_series(1,1000), 
'')) );
INSERT 0 1000
select count(*) from t1;
 count
---
  1000
(1 row)

select pg_relation_size('t1');
 pg_relation_size
--
57344
(1 row)

checkpoint;
CHECKPOINT
insert into t1 values (trim(to_char(generate_series(1001,2000), 
'')) );
INSERT 0 1000
select count(*) from t1;
 count
---
  2000
(1 row)

select pg_relation_size('t1');
 pg_relation_size
--
   106496
(1 row)

/tmp/pgsql/data/base/16384/147456
waiting for server to start2013-07-26 13:33:03 JST [@ 14651] LOG:  loaded 
library pg_stat_statements
 done
server started
2013-07-26 13:33:03 JST [@ 14653] LOG:  database system was interrupted; last 
known up at 2013-07-26 13:33:03 JST
2013-07-26 13:33:03 JST [@ 14653] LOG:  database system was not properly shut 
down; automatic recovery in progress
2013-07-26 13:33:03 JST [@ 14653] LOG:  redo starts at 0/1D493A0
2013-07-26 13:33:03 JST [@ 14653] LOG:  record with zero length at 0/1D5D958
2013-07-26 13:33:03 JST [@ 14653] LOG:  redo done at 0/1D5D928
2013-07-26 13:33:03 JST [@ 14653] LOG:  last completed transaction was at log 
time 2013-07-26 13:33:03.158161+09
2013-07-26 13:33:03 JST [@ 14651] LOG:  database system is ready to accept 
connections
2013-07-26 13:33:03 JST [@ 14657] LOG:  autovacuum launcher started
select count(*) from t1;
 count
---
  1058
(1 row)

select pg_relation_size('t1');
 pg_relation_size
--
   106496
(1 row)

createdb: database creation failed: ERROR:  database testdb already exists
drop table if exists t1;
DROP TABLE
create table t1 ( c1 varchar(20) );
CREATE TABLE
checkpoint;
CHECKPOINT
select count(*) from t1;
 count
---
 0
(1 row)

select relname,relfilenode from pg_class where relname='t1';
 relname | relfilenode
-+-
 t1  |  155648
(1 row)

select oid,* from pg_database where datname=current_database();
  oid  | datname | datdba | encoding | datcollate | datctype | datistemplate | 
datallowconn | datconnlimit | datlastsysoid | datfrozenxid | datminmxid | 
dattablespace | datacl
---+-++--++--+---+--+--+---+--++---+
 16384

[HACKERS] [PoC] pgstattuple2: block sampling to reduce physical read

2013-07-23 Thread Satoshi Nagayasu

Hi,

I've been working on new pgstattuple function to allow
block sampling [1] in order to reduce block reads while
scanning a table. A PoC patch is attached.

[1] Re: [RFC] pgstattuple/pgstatindex enhancement

http://www.postgresql.org/message-id/ca+tgmoaxjhgz2c4ayfbr9muuvnhgwu4co-cthqpzrwwdtam...@mail.gmail.com

This new function, pgstattuple2(), samples only 3,000 blocks
(which accounts 24MB) from the table randomly, and estimates
several parameters of the entire table.

The function calculates the averages of the samples, estimates
the parameters (averages and SDs), and shows standard errors
(SE) to allow estimating status of the table with statistical
approach.

And, of course, it reduces number of physical block reads
while scanning a bigger table.

The following example shows that new pgstattuple2 function
runs x100 faster than the original pgstattuple function with
well-estimated results.

--
postgres=# select * from pgstattuple('pgbench_accounts');
-[ RECORD 1 ]--+---
table_len  | 1402642432
tuple_count| 1000
tuple_len  | 121000
tuple_percent  | 86.27
dead_tuple_count   | 182895
dead_tuple_len | 22130295
dead_tuple_percent | 1.58
free_space | 21012328
free_percent   | 1.5

Time: 1615.651 ms
postgres=# select * from pgstattuple2('pgbench_accounts');
NOTICE:  pgstattuple2: SE tuple_count 2376.47, tuple_len 287552.58,
dead_tuple_count 497.63, dead_tuple_len 60213.08, free_space 289752.38
-[ RECORD 1 ]--+---
table_len  | 1402642432
tuple_count| 9978074
tuple_len  | 1207347074
tuple_percent  | 86.08
dead_tuple_count   | 187315
dead_tuple_len | 22665208
dead_tuple_percent | 1.62
free_space | 23400431
free_percent   | 1.67

Time: 15.026 ms
postgres=#
--

In addition to that, see attached chart to know how pgstattuple2
estimates well during repeating (long-running) pgbench.

I understand that pgbench would generate random transactions,
and those update operations might not have any skew over the table,
so estimating table status seems to be easy in this test.

However, I'm still curious to know whether it would work in
real-world worklaod.

Is it worth having this? Any comment or suggestion?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
index 2783a63..8ebec6f 100644
--- a/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.1--1.2.sql
@@ -37,3 +37,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+CREATE FUNCTION pgstattuple2(IN relname regclass,
+OUT table_len BIGINT,  -- physical table length in bytes
+OUT tuple_count BIGINT,-- number of live tuples
+OUT tuple_len BIGINT,  -- total tuples length in bytes
+OUT tuple_percent FLOAT8,  -- live tuples in %
+OUT dead_tuple_count BIGINT,   -- number of dead tuples
+OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes
+OUT dead_tuple_percent FLOAT8, -- dead tuples in %
+OUT free_space BIGINT, -- free space in bytes
+OUT free_percent FLOAT8)   -- free space in %
+AS 'MODULE_PATHNAME', 'pgstattuple2'
+LANGUAGE C STRICT;
+
diff --git a/contrib/pgstattuple/pgstattuple--1.2.sql 
b/contrib/pgstattuple/pgstattuple--1.2.sql
index e5fa2f5..963eb00 100644
--- a/contrib/pgstattuple/pgstattuple--1.2.sql
+++ b/contrib/pgstattuple/pgstattuple--1.2.sql
@@ -77,3 +77,17 @@ CREATE FUNCTION pg_relpages(IN relname regclass)
 RETURNS BIGINT
 AS 'MODULE_PATHNAME', 'pg_relpagesbyid'
 LANGUAGE C STRICT;
+
+CREATE FUNCTION pgstattuple2(IN relname regclass,
+OUT table_len BIGINT,  -- physical table length in bytes
+OUT tuple_count BIGINT,-- number of live tuples
+OUT tuple_len BIGINT,  -- total tuples length in bytes
+OUT tuple_percent FLOAT8,  -- live tuples in %
+OUT dead_tuple_count BIGINT,   -- number of dead tuples
+OUT dead_tuple_len BIGINT, -- total dead tuples length in bytes
+OUT dead_tuple_percent FLOAT8, -- dead tuples in %
+OUT free_space BIGINT, -- free space in bytes
+OUT free_percent FLOAT8)   -- free space in %
+AS 'MODULE_PATHNAME', 'pgstattuple2'
+LANGUAGE C STRICT;
+
diff --git a/contrib/pgstattuple/pgstattuple.c 
b/contrib/pgstattuple/pgstattuple.c
index 7f41ec3..c2adb75 100644
--- a/contrib/pgstattuple/pgstattuple.c
+++ b/contrib/pgstattuple/pgstattuple.c
@@ -41,9 +41,22 @@ PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(pgstattuple);
 PG_FUNCTION_INFO_V1(pgstattuplebyid);
+PG_FUNCTION_INFO_V1(pgstattuple2);
 
 extern Datum pgstattuple(PG_FUNCTION_ARGS);
 extern

Re: [HACKERS] New statistics for WAL buffer dirty writes

2013-07-18 Thread Satoshi Nagayasu


Will revise and re-resubmit for the next CF.

Regards,

2013/07/19 1:06, Alvaro Herrera wrote:


What happened to this patch?  We were waiting on an updated version from
you.


Satoshi Nagayasu wrote:

(2012/12/10 3:06), Tomas Vondra wrote:

On 29.10.2012 04:58, Satoshi Nagayasu wrote:

2012/10/24 1:12, Alvaro Herrera wrote:

Satoshi Nagayasu escribi�:


With this patch, walwriter process and each backend process
would sum up dirty writes, and send it to the stat collector.
So, the value could be saved in the stat file, and could be
kept on restarting.

The statistics could be retreive with using
pg_stat_get_xlog_dirty_writes() function, and could be reset
with calling pg_stat_reset_shared('walwriter').

Now, I have one concern.

The reset time could be captured in globalStats.stat_reset_timestamp,
but this value is the same with the bgwriter one.

So, once pg_stat_reset_shared('walwriter') is called,
stats_reset column in pg_stat_bgwriter does represent
the reset time for walwriter, not for bgwriter.

How should we handle this?  Should we split this value?
And should we have new system view for walwriter?


I think the answer to the two last questions is yes.  It doesn't seem to
make sense, to me, to have a single reset timings for what are
effectively two separate things.

Please submit an updated patch to next CF.  I'm marking this one
returned with feedback.  Thanks.



I attached the latest one, which splits the reset_time
for bgwriter and walwriter, and provides new system view,
called pg_stat_walwriter, to show the dirty write counter
and the reset time.


I've done a quick review of the v4 patch:


Thanks for the review, and sorry for my delayed response.


1) applies fine on HEAD, compiles fine

2) make installcheck fails because of a difference in the 'rules'
test suite (there's a new view pg_stat_walwriter - see the
attached patch for a fixed version or expected/rules.out)


Ah, I forgot about the regression test. I will fix it. Thanks.


3) I do agree with Alvaro that using the same struct for two separate
components (bgwriter and walwriter) seems a bit awkward. For example
you need to have two separate stat_reset fields, the reset code
becomes much more verbose (because you need to list individual
fields) etc.

So I'd vote to either split this into two structures or keeping it
as a single structure (although with two views on top of it).


Ok, I will split it into two structs, PgStat_BgWriterGlobalStats and
PgStat_WalWriterGlobalStats, and will modify PgStat_GlobalStats to
hold those two structs in the stat collector.


4) Are there any other fields that might be interesting? Right now
there's just dirty_writes but I guess there are other values. E.g.
how much data was actually written etc.?


AFAIK, I think those numbers can be obtained by calling
pg_current_xlog_insert_location() or pg_current_xlog_location(),
but if we need it, I will add it.

Regards,






--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument

2013-07-17 Thread Satoshi Nagayasu


(2013/07/18 2:31), Fujii Masao wrote:

On Tue, Jul 16, 2013 at 3:00 PM, Satoshi Nagayasu sn...@uptime.jp wrote:

(2013/07/04 3:58), Fujii Masao wrote:

For the test, I just implemented the regclass-version of pg_relpages()
(patch attached) and tested some cases. But I could not get that problem.

  SELECT pg_relpages('hoge');-- OK
  SELECT pg_relpages(oid) FROM pg_class WHERE relname = 'hoge';-- OK
  SELECT pg_relpages(relname) FROM pg_class WHERE relname = 'hoge';-- OK


In the attached patch, I cleaned up three functions to have
two types of arguments for each, text and regclass.

   pgstattuple(text)
   pgstattuple(regclass)
   pgstatindex(text)
   pgstatindex(regclass)
   pg_relpages(text)
   pg_relpages(regclass)

I still think a regclass argument is more appropriate for passing
relation/index name to a function than text-type, but having both
arguments in each function seems to be a good choice at this moment,
in terms of backward-compatibility.

Docs needs to be updated if this change going to be applied.


Yes, please.


Updated docs and code comments, etc. PFA.


Any comments?


'make installcheck' failed in my machine.


Hmm, it works on my box...


Do we need to remove pgstattuple--1.1.sql and create pgstattuple--1.1--1.2.sql?

+/* contrib/pgstattuple/pgstattuple--1.1.sql */

Typo: s/1.1/1.2


Done.


You seem to have forgotten to update pgstattuple.c.


Should I change something in pgstattuple.c?

I just changed CREATE FUNCTION statement for pgstattuple
to replace oid input arg with the regclass.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index fc893d8..957742a 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -4,7 +4,7 @@ MODULE_big  = pgstattuple
 OBJS   = pgstattuple.o pgstatindex.o
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.1.sql pgstattuple--1.0--1.1.sql 
pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.2.sql pgstattuple--1.0--1.1.sql 
pgstattuple--unpackaged--1.0.sql
 
 REGRESS = pgstattuple
 
diff --git a/contrib/pgstattuple/expected/pgstattuple.out 
b/contrib/pgstattuple/expected/pgstattuple.out
index ab28f50..eaba306 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -11,12 +11,24 @@ select * from pgstattuple('test'::text);
  0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
 (1 row)
 
+select * from pgstattuple('test'::name);
+ table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | 
dead_tuple_len | dead_tuple_percent | free_space | free_percent 
+---+-+---+---+--++++--
+ 0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
+(1 row)
+
 select * from pgstattuple('test'::regclass);
  table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | 
dead_tuple_len | dead_tuple_percent | free_space | free_percent 
 
---+-+---+---+--++++--
  0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
 (1 row)
 
+select * from pgstattuple('test'::regclass::oid);
+ table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | 
dead_tuple_len | dead_tuple_percent | free_space | free_percent 
+---+-+---+---+--++++--
+ 0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
+(1 row)
+
 select * from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | 
leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
leaf_fragmentation 
 
-+++---+++-+---+--+
diff --git a/contrib/pgstattuple/pgstatindex.c 
b/contrib/pgstattuple/pgstatindex.c
index 97f897e..41e90e3 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -39,12 +39,24 @@
 #include utils/rel.h
 
 
+/*
+ * Because of backward-compatibility issue, we have decided to have
+ * two types of interfaces, with regclass-type input arg and text-type
+ * input arg, for each function.
+ *
+ * Those functions which have text-type input arg will be depreciated
+ * in the future release.
+ */
 extern Datum pgstatindex(PG_FUNCTION_ARGS);
+extern Datum pgstatindexbyid

Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument

2013-07-16 Thread Satoshi Nagayasu

(2013/07/04 3:58), Fujii Masao wrote:
 On Wed, Jun 26, 2013 at 12:39 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 20, 2013 at 2:32 PM, Fujii Masao masao.fu...@gmail.com wrote:
 Since pg_relpages(oid) doesn't exist, pg_relpages() is in the same
 situation as pgstatindex(), i.e., we cannot just replace pg_relpages(text)
 with pg_relpages(regclass) for the backward-compatibility. How do you
 think we should solve the pg_relpages() problem? Rename? Just
 add pg_relpages(regclass)?

 Adding a function with a new name seems likely to be smoother, since
 that way you don't have to worry about problems with function calls
 being thought ambiguous.
 
 Could you let me know the example that this problem happens?
 
 For the test, I just implemented the regclass-version of pg_relpages()
 (patch attached) and tested some cases. But I could not get that problem.
 
  SELECT pg_relpages('hoge');-- OK
  SELECT pg_relpages(oid) FROM pg_class WHERE relname = 'hoge';-- OK
  SELECT pg_relpages(relname) FROM pg_class WHERE relname = 'hoge';-- 
 OK

In the attached patch, I cleaned up three functions to have
two types of arguments for each, text and regclass.

  pgstattuple(text)
  pgstattuple(regclass)
  pgstatindex(text)
  pgstatindex(regclass)
  pg_relpages(text)
  pg_relpages(regclass)

I still think a regclass argument is more appropriate for passing
relation/index name to a function than text-type, but having both
arguments in each function seems to be a good choice at this moment,
in terms of backward-compatibility.

Docs needs to be updated if this change going to be applied.

Any comments?
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/contrib/pgstattuple/Makefile b/contrib/pgstattuple/Makefile
index fc893d8..957742a 100644
--- a/contrib/pgstattuple/Makefile
+++ b/contrib/pgstattuple/Makefile
@@ -4,7 +4,7 @@ MODULE_big  = pgstattuple
 OBJS   = pgstattuple.o pgstatindex.o
 
 EXTENSION = pgstattuple
-DATA = pgstattuple--1.1.sql pgstattuple--1.0--1.1.sql 
pgstattuple--unpackaged--1.0.sql
+DATA = pgstattuple--1.2.sql pgstattuple--1.0--1.1.sql 
pgstattuple--unpackaged--1.0.sql
 
 REGRESS = pgstattuple
 
diff --git a/contrib/pgstattuple/expected/pgstattuple.out 
b/contrib/pgstattuple/expected/pgstattuple.out
index ab28f50..eaba306 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -11,12 +11,24 @@ select * from pgstattuple('test'::text);
  0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
 (1 row)
 
+select * from pgstattuple('test'::name);
+ table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | 
dead_tuple_len | dead_tuple_percent | free_space | free_percent 
+---+-+---+---+--++++--
+ 0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
+(1 row)
+
 select * from pgstattuple('test'::regclass);
  table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | 
dead_tuple_len | dead_tuple_percent | free_space | free_percent 
 
---+-+---+---+--++++--
  0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
 (1 row)
 
+select * from pgstattuple('test'::regclass::oid);
+ table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | 
dead_tuple_len | dead_tuple_percent | free_space | free_percent 
+---+-+---+---+--++++--
+ 0 |   0 | 0 | 0 |0 |  
0 |  0 |  0 |0
+(1 row)
+
 select * from pgstatindex('test_pkey');
  version | tree_level | index_size | root_block_no | internal_pages | 
leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
leaf_fragmentation 
 
-+++---+++-+---+--+
diff --git a/contrib/pgstattuple/pgstatindex.c 
b/contrib/pgstattuple/pgstatindex.c
index 97f897e..9ec74e7 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -40,11 +40,15 @@
 
 
 extern Datum pgstatindex(PG_FUNCTION_ARGS);
+extern Datum pgstatindexbyid(PG_FUNCTION_ARGS);
 extern Datum pg_relpages(PG_FUNCTION_ARGS);
+extern Datum pg_relpagesbyid(PG_FUNCTION_ARGS);
 extern Datum pgstatginindex(PG_FUNCTION_ARGS);
 
 PG_FUNCTION_INFO_V1(pgstatindex);
+PG_FUNCTION_INFO_V1(pgstatindexbyid

Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument

2013-07-16 Thread Satoshi Nagayasu


Hi Rushabh,

(2013/07/16 14:58), Rushabh Lathia wrote:

Hello Satoshi,

I assigned myself for the reviewer of this patch. Issue status is waiting on
author.


Thank you for picking it up.


Now looking at the discussion under the thread it seems like we are waiting
for the suggestion for the new function name, right ?


Yes.


I am wondering why actually we need new name ? Can't we just overload the
same function and provide two version of the functions ?


I think the major reason is to avoid some confusion with old and new
function arguments.

My thought here is that having both arguments (text and regclass)
for each function is a good choice to clean up interfaces with keeping
the backward-compatibility.


In the last thread Fujii just did the same for pg_relpages and it seems
like an
good to go approach, isn't it ? Am I missing anything here ?


I just posted a revised patch to handle the issue in three functions
of the pgstattuple module. Please take a look.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add visibility map information to pg_freespace.

2013-07-16 Thread Satoshi Nagayasu




(2013/07/09 19:55), Kyotaro HORIGUCHI wrote:

Hello, I've brought visibilitymap extentions for pg_freespacemap
and pgstattuple.

At Mon, 08 Jul 2013 16:59:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
horiguchi.kyot...@lab.ntt.co.jp wrote in 
20130708.165905.118860769.horiguchi.kyot...@lab.ntt.co.jp

I'll come again with the first implementation of it. And as for
pg_freespacemap, I'll keep the current direction - adding column
to present output records format of pg_freespace(). And
documentation, if possible.


pg_freespace_vm_v2.patch:

   Interface has been changed from the first patch. The version of
   pg_freespace() provided with vm information is named
   pg_freespace_with_vminfo() and shows output like following.

| postgres=# select * from pg_freespace_with_vminfo('t'::regclass) limit 10;
|  blkno | avail | is_all_visible
| ---+---+
|  0 |64 | t
|  1 |32 | t
|  2 |96 | t
|  3 |64 | t
|  4 |96 | t
|  5 |96 | t
|  6 |   128 | t
|  7 |32 | t
|  8 |96 | t


I think we can simply add is_all_viible column to the existing
pg_freespace(), because adding column would not break
backward-compatibility in general. Any other thoughts?


pgstattuple_vm_v1.patch:

   The first version of VM extension for pgstattuple. According to
   the previous discussion, the added column is named
   'all_visible_percent'.

| postgres=# select * from pgstattuple('t');
| -[ RECORD 1 ]---+-
| table_len   | 71770112
| tuple_count | 989859
| tuple_len   | 31675488
| tuple_percent   | 44.13
| dead_tuple_count| 99
| dead_tuple_len  | 3168
| dead_tuple_percent  | 0
| free_space  | 31886052
| free_percent| 44.43
| all_visible_percent | 99.98


It seems working fine.

And I added a regression test for pg_freespacemap and additional
test cases for pgstattuple. Please take a look.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/contrib/pg_freespacemap/Makefile b/contrib/pg_freespacemap/Makefile
index d794df2..09d6ff8 100644
--- a/contrib/pg_freespacemap/Makefile
+++ b/contrib/pg_freespacemap/Makefile
@@ -6,6 +6,8 @@ OBJS = pg_freespacemap.o
 EXTENSION = pg_freespacemap
 DATA = pg_freespacemap--1.1.sql pg_freespacemap--1.0--1.1.sql 
pg_freespacemap--unpackaged--1.0.sql
 
+REGRESS = pg_freespacemap
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_freespacemap/expected/pg_freespacemap.out 
b/contrib/pg_freespacemap/expected/pg_freespacemap.out
new file mode 100644
index 000..cde954d
--- /dev/null
+++ b/contrib/pg_freespacemap/expected/pg_freespacemap.out
@@ -0,0 +1,100 @@
+create extension pg_freespacemap;
+create table t1 ( uid integer primary key, uname text not null );
+select * from pg_freespace('t1');
+ blkno | avail 
+---+---
+(0 rows)
+
+select * from pg_freespace('t1'::regclass);
+ blkno | avail 
+---+---
+(0 rows)
+
+select * from pg_freespace('t1', 1);
+ pg_freespace 
+--
+0
+(1 row)
+
+select * from pg_freespace_with_vminfo('t1');
+ blkno | avail | is_all_visible 
+---+---+
+(0 rows)
+
+select * from pg_freespace_with_vminfo('t1'::regclass);
+ blkno | avail | is_all_visible 
+---+---+
+(0 rows)
+
+insert into t1 values ( 100, 'postgresql' );
+select * from pg_freespace('t1');
+ blkno | avail 
+---+---
+ 0 | 0
+(1 row)
+
+select * from pg_freespace('t1', 1);
+ pg_freespace 
+--
+0
+(1 row)
+
+select * from pg_freespace_with_vminfo('t1');
+ blkno | avail | is_all_visible 
+---+---+
+ 0 | 0 | f
+(1 row)
+
+select * from pg_freespace('t1_pkey');
+ blkno | avail 
+---+---
+ 0 | 0
+ 1 | 0
+(2 rows)
+
+select * from pg_freespace('t1_pkey', 1);
+ pg_freespace 
+--
+0
+(1 row)
+
+select * from pg_freespace('t1_pkey', 2);
+ pg_freespace 
+--
+0
+(1 row)
+
+select * from pg_freespace_with_vminfo('t1_pkey');
+ blkno | avail | is_all_visible 
+---+---+
+ 0 | 0 | f
+ 1 | 0 | f
+(2 rows)
+
+vacuum t1;
+select * from pg_freespace('t1');
+ blkno | avail 
+---+---
+ 0 |  8096
+(1 row)
+
+select * from pg_freespace_with_vminfo('t1');
+ blkno | avail | is_all_visible 
+---+---+
+ 0 |  8096 | t
+(1 row)
+
+select * from pg_freespace('t1_pkey');
+ blkno | avail 
+---+---
+ 0 | 0
+ 1 | 0
+(2 rows)
+
+select * from pg_freespace_with_vminfo('t1_pkey');
+ blkno | avail | is_all_visible 
+---+---+
+ 0 | 0 | f
+ 1 | 0 | f
+(2 rows)
+
diff --git a/contrib/pg_freespacemap/sql/pg_freespacemap.sql 
b/contrib/pg_freespacemap/sql/pg_freespacemap.sql
new file mode 100644
index 000..79a458d
--- /dev/null
+++ b/contrib

Re: [HACKERS] pg_filedump 9.3: checksums (and a few other fixes)

2013-07-05 Thread Satoshi Nagayasu


Hi,

I have reviewed this patch as a CF reviewer.

(2013/06/27 4:07), Jeff Davis wrote:

On Mon, 2013-06-24 at 20:34 -0400, Josh Kupershmidt wrote:

This patch is in the current CommitFest, does it still need to be
reviewed? If so, I notice that the version in pgfoundry's CVS is
rather different than the version the patch seems to have been built
against (presumably the pg_filedump-9.2.0.tar.gz release), and
conflicts in several places with cvs tip.


Rebased against CVS tip; attached.


It looks fine, but I have one question here.

When I run pg_filedump with -k against a database cluster which
does not support checksums, pg_filedump produced checksum error as
following. Is this expected or acceptable?

-
***
* PostgreSQL File/Block Formatted Dump Utility - Version 9.3.0
*
* File: /tmp/pgsql/data/base/16384/16397
* Options used: -k
*
* Dump created on: Sat Jul  6 10:32:15 2013
***

Block0 
Header -
 Block Offset: 0x Offsets: Lower 268 (0x010c)
 Block: Size 8192  Version4Upper 384 (0x0180)
 LSN:  logid  0 recoff 0x  Special  8192 (0x2000)
 Items:   61  Free Space:  116
 Checksum: 0x  Prune XID: 0x  Flags: 0x ()
 Length (including item array): 268

 Error: checksum failure: calculated 0xf797.

Data --
 Item   1 -- Length:  121  Offset: 8064 (0x1f80)  Flags: NORMAL
 Item   2 -- Length:  121  Offset: 7936 (0x1f00)  Flags: NORMAL
 Item   3 -- Length:  121  Offset: 7808 (0x1e80)  Flags: NORMAL
-

Please check attached script to reproduce it.

Also, I have update the help message and README.
Please check attached patch.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
PGHOME=/tmp/pgsql
PGPORT=15433

function build_check {
echo make PGSQL_INCLUDE_DIR=...
make PGSQL_INCLUDE_DIR=../../src/include clean
make PGSQL_INCLUDE_DIR=../../src/include all
make PGSQL_INCLUDE_DIR=../../src/include install
make PGSQL_INCLUDE_DIR=../../src/include clean

echo make -f Makefile.contrib...
make -f Makefile.contrib clean
make -f Makefile.contrib all
make -f Makefile.contrib install
make -f Makefile.contrib clean

echo make -f Makefile.contrib USE_PGXS=1...
PATH=${PGHOME}/bin:$PATH
make -f Makefile.contrib USE_PGXS=1 clean
make -f Makefile.contrib USE_PGXS=1 all
make -f Makefile.contrib USE_PGXS=1 install
make -f Makefile.contrib USE_PGXS=1 clean
}

function do_builddb {
INITDB_OPTS=$1
killall -9 postmaster postgres
rm -rf ${PGHOME}/data

initdb ${INITDB_OPTS} --no-locale -D ${PGHOME}/data
pg_ctl -w -D ${PGHOME}/data start -o -p ${PGPORT}
createdb -p ${PGPORT} testdb
pgbench -p ${PGPORT} -i testdb

psql -A -t -p ${PGPORT} testdbEOF  file
select '${PGHOME}/data/' || pg_relation_filepath(oid) from pg_class where 
relname like 'pgbench%';
EOF
}

function builddb_checksum_enabled {
do_builddb -k
}

function builddb_checksum_disabled {
do_builddb 
}

function test_not_verify_checksum {
LOG=$1
sed 's/^/pg_filedump /'  file  _test.sh
sh _test.sh  $LOG
}

function test_verify_checksum {
LOG=$1
sed 's/^/pg_filedump -k /'  file  _test.sh
sh _test.sh  $LOG
}

build_check

builddb_checksum_enabled
test_not_verify_checksum test_enabled_not_verify.log
test_verify_checksum test_enabled_verify.log

builddb_checksum_disabled
test_not_verify_checksum test_disabled_not_verify.log
test_verify_checksum test_disabled_verify.log
diff --git a/contrib/pg_filedump/README.pg_filedump 
b/contrib/pg_filedump/README.pg_filedump
index 63d086f..b3050cd 100644
--- a/contrib/pg_filedump/README.pg_filedump
+++ b/contrib/pg_filedump/README.pg_filedump
@@ -2,7 +2,7 @@ pg_filedump - Display formatted contents of a PostgreSQL heap, 
index,
   or control file.
 
 Copyright (c) 2002-2010 Red Hat, Inc.
-Copyright (c) 2011-2012, PostgreSQL Global Development Group
+Copyright (c) 2011-2013, PostgreSQL Global Development Group
 
 This program is free software; you can redistribute it and/or modify
 it under the terms of the GNU General Public License as published by
@@ -59,10 +59,10 @@ not require any manual adjustments of the Makefile.
 
 Invocation:
 
-pg_filedump [-abcdfhixy] [-R startblock [endblock]] [-S blocksize] file
+pg_filedump [-abcdfhikxy] [-R startblock [endblock]] [-S blocksize] file
 
-Defaults are: relative addressing, range of the entire file, block size
-  as listed on block 0 in the file
+Defaults are: relative addressing, range of the entire file, block

Re: [HACKERS] Block write statistics WIP

2013-07-01 Thread Satoshi Nagayasu


Hi,

I'm looking into this patch as a reviewer.

(2013/05/24 10:33), Heikki Linnakangas wrote:

On 23.05.2013 19:10, Greg Smith wrote:

On 5/20/13 7:51 AM, Heikki Linnakangas wrote:
It strikes me as kind of weird that the read side and write side are
not more symmetrical.


It might seem weird if you expect the API to be similar to POSIX read()
and write(), where you can read() an arbitrary block at any location,
and write() an arbitrary block at any location. A better comparison
would be e.g open() and close(). open() returns a file descriptor, which
you pass to other functions to operate on the file. When you're done,
you call close(fd). The file descriptor is completely opaque to the user
program, you do all the operations through the functions that take the
fd as argument. Similarly, ReadBuffer() returns a Buffer, which is
completely opaque to the caller, and you do all the operations through
various functions and macros that operate on the Buffer. When you're
done, you release the buffer with ReleaseBuffer().

(sorry for the digression from the original topic, I don't have any
problem with what adding an optional Relation argument to MarkBuffer if
you need that :-) )


Or should we add some pointer, which is accociated with the Relation,
into BufferDesc? Maybe OID?

I'm thinking of FlushBuffer() too. How can we count physical write
for each relation in FlushBuffer()? Or any other appropriate place?

BTW, Greg's first patch could not be applied to the latest HEAD.
I guess it need to have some fix, I couldn't clafiry it though.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument

2013-06-19 Thread Satoshi Nagayasu


(2013/06/17 4:02), Fujii Masao wrote:

On Sat, Mar 9, 2013 at 3:23 PM, Satoshi Nagayasu sn...@uptime.jp wrote:

It is obviously easy to keep two types of function interfaces,
one with regclass-type and another with text-type, in the next
release for backward-compatibility like below:

pgstattuple(regclass)  -- safer interface.
pgstattuple(text)  -- will be depreciated in the future release.


So you're thinking to remove pgstattuple(oid) soon?


AFAIK, a regclass type argument would accept an OID value,
which means regclass type has upper-compatibility against
oid type.

So, even if the declaration is changed, compatibility could
be kept actually. This test case (in sql/pgstattuple.sql)
confirms that.


select * from pgstatindex('myschema.test_pkey'::regclass::oid);
 version | tree_level | index_size | root_block_no | internal_pages | 
leaf_pages | empty_pages | deleted_pages | avg_leaf_density | 
leaf_fragmentation

-+++---+++-+---+--+
   2 |  0 |   8192 | 1 |  0 | 
1 |   0 | 0 | 0.79 | 
   0

(1 row)



Having both interfaces for a while would allow users to have enough
time to rewrite their applications.

Then, we will be able to obsolete (or just drop) old interfaces
in the future release, maybe 9.4 or 9.5. I think this approach
would minimize an impact of such interface change.

So, I think we can clean up function arguments in the pgstattuple
module, and also we can have two interfaces, both regclass and text,
for the next release.

Any comments?


In the document, you should mark old functions as deprecated.


I'm still considering changing the function name as Tom pointed
out. How about pgstatbtindex?

In fact, pgstatindex does support only BTree index.
So, pgstatbtindex seems to be more appropriate for this function.

We can keep having both (old) pgstatindex and (new) pgstatbtindex
during next 2-3 major releases, and the old one will be deprecated
after that.

Any comments?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add visibility map information to pg_freespace.

2013-06-19 Thread Satoshi Nagayasu


I'm looking into this patch as a reviewer.

(2013/06/19 18:03), Simon Riggs wrote:

On 19 June 2013 09:19, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:


It should useful in other aspects but it seems a bit complicated
just to know about visibility bits for certain blocks.


With your current patch you can only see the visibility info for
blocks in cache, not for all blocks. So while you may think it is
useful, it is also unnecessarily limited in scope.

Let's just have something that is easy to use that lets us see the
visibility state for a block, not just blocks in freespace.


I think we can have this visibility map statistics also
in pgstattuple function (as a new column) for this purpose.

IMHO, we have several modules for different purposes.

- pageinspect provies several functions for debugging purpose.
- pg_freespace provies a view for monitoring purpose.
- pgstattuple provies several functions for collecting
  specific table/index statistics.

So, we can have similar feature in different modules.

Any comments?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] By now, why PostgreSQL 9.2 don't support SSDs?

2013-03-30 Thread Satoshi Nagayasu


2013/03/30 23:31, Bruce Momjian wrote:

On Sat, Mar 30, 2013 at 10:08:44PM +0800, 赖文豫 wrote:

As we know, SSDs are widely used in various kinds of applications. But the SMGR
in PostgreSQL still only
support magnetic disk. How do we make full use of SSDs to improve the
performance of PostgreSQL?


When the storage manager (SMGR) says magnetic disk, it is talking about
read/write media with random access capabillity, vs. something like
write-only media, which was originally supported in the code.  Postgres
works just fine with SSDs;  the only adjustment you might want to make
is to reduce random_page_cost.


BTW, using the larger block size (64kB) would improve performance
when using SSD drive?

I found that configure script supports --with-blocksize option to
change the block size up to 32kB. (and the configure script does
not support 64kb block size so far.)

But I heard that larger block size, like 256kB, would take
advantage of the SSD performance because of the block management
within SSD.

So, I'm just curious to know that.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument

2013-03-08 Thread Satoshi Nagayasu


(2013/03/05 22:46), Robert Haas wrote:

On Sun, Mar 3, 2013 at 5:32 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Maybe this is acceptable collateral damage.  I don't know.  But we
definitely stand a chance of breaking applications if we change
pgstatindex like this.  It might be better to invent a differently-named
function to replace pgstatindex.


If this were a built-in function, we might have to make a painful
decision between breaking backward compatibility and leaving this
broken forever, but as it isn't, we don't.  I think your suggestion of
adding a new function is exactly right.  We can remove the old one in
a future release, and support both in the meantime.  It strikes me
that if extension versioning is for anything, this is it.


It is obviously easy to keep two types of function interfaces,
one with regclass-type and another with text-type, in the next
release for backward-compatibility like below:

pgstattuple(regclass)  -- safer interface.
pgstattuple(text)  -- will be depreciated in the future release.

Having both interfaces for a while would allow users to have enough
time to rewrite their applications.

Then, we will be able to obsolete (or just drop) old interfaces
in the future release, maybe 9.4 or 9.5. I think this approach
would minimize an impact of such interface change.

So, I think we can clean up function arguments in the pgstattuple
module, and also we can have two interfaces, both regclass and text,
for the next release.

Any comments?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fix pgstattuple/pgstatindex to use regclass-type as the argument

2013-03-02 Thread Satoshi Nagayasu

Hi,

As I wrote before, I'd like to clean up pgstattuple functions to
allow the same expressions.

Re: [HACKERS] [RFC] pgstattuple/pgstatindex enhancement
http://www.postgresql.org/message-id/511ee19b.5010...@uptime.jp

My goal is to allow specifying a relation/index with several
expressions, 'relname', 'schemaname.relname' and oid in all
pgstattuple functions. pgstatindex() does not accept oid so far.

I have found that the backward-compatibility can be kept
when the arguments (text and/or oid) are replaced with regclass
type. regclass type seems to be more appropriate here.

So, I cleaned up those three functions, pgstattuple(), pgstatindex(),
pg_relpages(), to accept a regclass-type argument, instead of using
text and/or oid type, as the test cases show.

 select * from pgstatindex('test_pkey');
 select * from pgstatindex('public.test_pkey');
 select * from pgstatindex('myschema.test_pkey');
 select * from pgstatindex('myschema.test_pkey'::regclass::oid);

With attached patch, all functions in the pgstattuple module can
accept the same expression to specify the target relation/index.

Any comments or suggestions?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


pgstattuple_regclass_v1.diff.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC] pgstattuple/pgstatindex enhancement

2013-02-15 Thread Satoshi Nagayasu

(2013/02/15 1:55), Robert Haas wrote:
 On Tue, Feb 12, 2013 at 10:22 AM, Satoshi Nagayasu sn...@uptime.jp wrote:
 (1) Fix pgstatindex arguments to work same as pgstattuple.

 As the document describes, pgstattuple accepts 'schema.table'
 expression and oid of the table, but pgstatindex doesn't.
 (because I didn't add that when I created pgstatindex...)

 http://www.postgresql.org/docs/devel/static/pgstattuple.html

 So, I'd like to change pgstatindex arguments to allow
 schema name and oid.

 Does it make sense?
 
 Not sure.  It seems nice, but it's also a backward-compatibility
 break.  So I don't know.

Yeah, actually, the backward-compatibility issue is the first thing
I have considered, and now I think we can keep it.

Now, pgstattuple() function accepts following syntax:

  pgstattuple('table')  -- table name (searches in search_path)
  pgstattuple('schema.table')   -- schema and table name
  pgstattuple(1234) -- oid

and pgstatindex() function only accepts below so far:

  pgstatindex('index')  -- index name (searches in search_path)

Then, we can easily add new syntax:

  pgstatindex('schema.index')   -- schema and index name
  pgstatindex(1234) -- oid

I think this would allow us to modify pgstatindex() without breaking
the backward-compatibility.

 (2) Enhance pgstattuple/pgstatindex to allow block sampling.

 Now, we have large tables and indexes in PostgreSQL, and these are
 growing day by day.

 pgstattuple and pgstatindex are both very important to keep database
 performance well, but doing full-scans on large tables and indexes
 would generate big performance impact.

 So, now I think pgstattuple and pgstatindex should support
 'block sampling' to collect block statistics with avoiding full-scans.

 With this block sampling feature, pgstattuple/pgstatindex would be
 able to collect block statistics from 1~10% of the blocks in the
 table/index if the table/index is large (maybe 10GB or more).

 Now that sounds really nice.

Thanks. I will try it.

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [RFC] pgstattuple/pgstatindex enhancement

2013-02-12 Thread Satoshi Nagayasu

Hi,

I'm thinking of pgstattuple/pgstatindex enhancement.

There are a few things I need to change, but I'd like to have
some comments and suggestions from hackers before tackling.

(1) Fix pgstatindex arguments to work same as pgstattuple.

As the document describes, pgstattuple accepts 'schema.table'
expression and oid of the table, but pgstatindex doesn't.
(because I didn't add that when I created pgstatindex...)

http://www.postgresql.org/docs/devel/static/pgstattuple.html

So, I'd like to change pgstatindex arguments to allow
schema name and oid.

Does it make sense?

(2) Enhance pgstattuple/pgstatindex to allow block sampling.

Now, we have large tables and indexes in PostgreSQL, and these are
growing day by day.

pgstattuple and pgstatindex are both very important to keep database
performance well, but doing full-scans on large tables and indexes
would generate big performance impact.

So, now I think pgstattuple and pgstatindex should support
'block sampling' to collect block statistics with avoiding full-scans.

With this block sampling feature, pgstattuple/pgstatindex would be
able to collect block statistics from 1~10% of the blocks in the
table/index if the table/index is large (maybe 10GB or more).

It would allow us to run pgstattuple/pgstatindex easier.

Is it worth having?

Any comments?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql \l to accept patterns

2013-01-29 Thread Satoshi Nagayasu


Hi,

I have tried this patch.

https://commitfest.postgresql.org/action/patch_view?id=1051

2013/01/29 14:48, Peter Eisentraut wrote:

On Mon, 2013-01-07 at 07:14 -0500, Peter Eisentraut wrote:

Here is a patch for psql's \l command to accept patterns, like \d
commands do.  While at it, I also added an S option to show system
objects and removed system objects from the default display.  This might
be a bit controversial, but it's how it was decided some time ago that
the \d commands should act.


Most people didn't like the S option, so here is a revised patch that
just adds the pattern support.


It seems working well with the latest git master.
I think it's good enough to be committed.

BTW, is there any good place to put new regression test for the psql
command? I couldn't find it out.

Any comment or suggestion?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql \l to accept patterns

2013-01-29 Thread Satoshi Nagayasu

(2013/01/30 0:34), Tom Lane wrote:
 Satoshi Nagayasu sn...@uptime.jp writes:
 On Mon, 2013-01-07 at 07:14 -0500, Peter Eisentraut wrote:
 Here is a patch for psql's \l command to accept patterns, like \d
 
 BTW, is there any good place to put new regression test for the psql
 command? I couldn't find it out.
 
 As far as a test for this specific feature goes, I'd be against adding
 one, because it'd be likely to result in random failures in make
 installcheck mode, depending on what other databases are in the
 installation.
 
 More generally, we've tended to put tests of \d-style psql features
 together with the relevant backend-side tests.

Yes, I think so too.

First of all, I was looking for some regression tests for
CREATE/ALTER/DROP DATABASE commands, but I couldn't find them
in the test/regress/sql/ directory. So, I asked the question.

I guess these database tests are in pg_regress.c. Right?

 The proposed patch to add \gset adds a separate regression test file
 specifically for psql.  I've been debating whether that was worth
 committing; but if there's near-term interest in adding any more tests
 for psql features that aren't closely connected to backend features,
 maybe it's worth having such a file.

Personally, I'm interested in having regression tests whatever
the target is, because software tends to be more complicated.
So, if we reach consensus to have dedicated tests for the psql
command (or other client-side commands), I wish to contribute to it.

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PERFORM] pgbench to the MAXINT

2013-01-26 Thread Satoshi Nagayasu

Hi,

I have reviewed this patch.

https://commitfest.postgresql.org/action/patch_view?id=1068

2012/12/21 Gurjeet Singh singh.gurj...@gmail.com:
 The patch is very much what you had posted, except for a couple of
 differences due to bit-rot. (i) I didn't have to #define MAX_RANDOM_VALUE64
 since its cousin MAX_RANDOM_VALUE is not used by code anymore, and (ii) I
 used ternary operator in DDLs[] array to decide when to use bigint vs int
 columns.

 Please review.

 As for tests, I am currently running 'pgbench -i -s 21474' using
 unpatched pgbench, and am recording the time taken;Scale factor 21475 had
 actually failed to do anything meaningful using unpatched pgbench. Next I'll
 run with '-s 21475' on patched version to see if it does the right thing,
 and in acceptable time compared to '-s 21474'.

 What tests would you and others like to see, to get some confidence in
 the patch? The machine that I have access to has 62 GB RAM, 16-core
 64-hw-threads, and about 900 GB of disk space.

I have tested this patch, and hvae confirmed that the columns
for aid would be switched to using bigint, instead of int,
when the scalefactor = 20,000.
(aid columns would exeed the upper bound of int when sf21474.)

Also, I added a few fixes on it.

- Fixed to apply for the current git master.
- Fixed to surpress few more warnings about INT64_FORMAT.
- Minor improvement in the docs. (just my suggestion)

I attached the revised one.

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC http://www.uptime.jp/


pgbench-64-v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] buffer assertion tripping under repeat pgbench load

2013-01-26 Thread Satoshi Nagayasu

Hi,

I just reviewed this patch.

https://commitfest.postgresql.org/action/patch_view?id=1035

2013/1/13 Greg Smith g...@2ndquadrant.com:
 On 12/26/12 7:23 PM, Greg Stark wrote:

 It's also possible it's a bad cpu, not bad memory. If it affects
 decrement or increment in particular it's possible that the pattern of
 usage on LocalRefCount is particularly prone to triggering it.


 This looks to be the winning answer.  It turns out that under extended
 multi-hour loads at high concurrency, something related to CPU overheating
 was occasionally flipping a bit.  One round of compressed air for all the
 fans/vents, a little tweaking of the fan controls, and now the system goes
24 hours with no problems.

 Sorry about all the noise over this.  I do think the improved warning
 messages that came out of the diagnosis ideas are useful.  The reworked code
 must slows down the checking a few cycles, but if you care about performance
 these assertions are tacked onto the biggest pig around.

 I added the patch to the January CF as Improve buffer refcount leak warning
 messages.  The sample I showed with the patch submission was a simulated
 one.  Here's the output from the last crash before resolving the issue,
 where the assertion really triggered:

 WARNING:  buffer refcount leak: [170583] (rel=base/16384/16578,
 blockNum=302295, flags=0x106, refcount=0 1073741824)

 WARNING:  buffers with non-zero refcount is 1
 TRAP: FailedAssertion(!(RefCountErrors == 0), File: bufmgr.c, Line:
 1712)

This patch is intended to improve warning message at
AtEOXact_Buffers(), but I guess another function,
AtProcExit_Buffers(), needs to be modified as well. Right?

With this additional fix, the patch could be applied to the
current git master, and could be compiled with --enable-cassert
option.

Then, I need some suggestion from hackers to continue this review.
How should I reproduce this message for review?

This is a debug warning message, so it's not easy for me
to reproduce this message.

Any suggestion?
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC http://www.uptime.jp/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New statistics for WAL buffer dirty writes

2013-01-20 Thread Satoshi Nagayasu


(2012/11/27 7:42), Alvaro Herrera wrote:

Satoshi Nagayasu escribió:


I attached the latest one, which splits the reset_time
for bgwriter and walwriter, and provides new system view,
called pg_stat_walwriter, to show the dirty write counter
and the reset time.


Thanks.  I gave this a look and I have a couple of comments:

1. The counter is called dirty_write.  I imagine that this comes
directly from the name of the nearby DTrace probe,
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START.  That probe comes from
email 494c1565.3060...@sun.com committed in 4ee79fd20d9a.  But there
wasn't much discussion about the name back then.  Maybe that was fine at
the time because it was pretty much an internal matter, being so deep in
the code.  But we're now going to expose it to regular users, so we'd
better choose a very good name because we're going to be stuck with it
for a very long time.  And the name dirty doesn't ring with me too
well; what matters here is not that we're writing a buffer that is
dirty, but the fact that we're writing while holding the WalInsertLock,
so the name should convey the fact that this is a locked write or
something like that.  Or at least that's how I see the issue.  Note the
documentation wording:

+  entrystructfielddirty_writes//entry
+  entrytypebigint/type/entry
+  entryNumber of dirty writes, which means flushing wal buffers
+   because of its full./entry


Yes, dirty_writes came from the probe name, and if it needs to be 
changed, buffers_flush would make sense for me in this situation, 
because this counter is intended to show WAL writes due to wal buffer 
full.



2. Should we put bgwriter and walwriter data in separate structs?  I
don't think this is necessary, but maybe someone opines differently?


I tried to minimize an impact of this patch, but if I can change this 
struct, yes, I'd like to split into two structs.



3.
+/*
+ * WalWriter global statistics counter.
+ * Despite its name, this counter is actually used not only in walwriter,
+ * but also in each backend process to sum up xlog dirty writes.
+ * Those processes would increment this counter in each XLogWrite call,
+ * then send it to the stat collector process.
+ */
+PgStat_MsgWalWriter WalWriterStats;

Maybe we should use a different name for the struct, to avoid having to
excuse ourselves for the name being wrong ...


Ok. How about WalBufferStats? I think this name could be accepted in 
both the wal writer and each backend process.


--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New statistics for WAL buffer dirty writes

2013-01-20 Thread Satoshi Nagayasu


(2012/12/10 3:06), Tomas Vondra wrote:

On 29.10.2012 04:58, Satoshi Nagayasu wrote:

2012/10/24 1:12, Alvaro Herrera wrote:

Satoshi Nagayasu escribi�:


With this patch, walwriter process and each backend process
would sum up dirty writes, and send it to the stat collector.
So, the value could be saved in the stat file, and could be
kept on restarting.

The statistics could be retreive with using
pg_stat_get_xlog_dirty_writes() function, and could be reset
with calling pg_stat_reset_shared('walwriter').

Now, I have one concern.

The reset time could be captured in globalStats.stat_reset_timestamp,
but this value is the same with the bgwriter one.

So, once pg_stat_reset_shared('walwriter') is called,
stats_reset column in pg_stat_bgwriter does represent
the reset time for walwriter, not for bgwriter.

How should we handle this?  Should we split this value?
And should we have new system view for walwriter?


I think the answer to the two last questions is yes.  It doesn't seem to
make sense, to me, to have a single reset timings for what are
effectively two separate things.

Please submit an updated patch to next CF.  I'm marking this one
returned with feedback.  Thanks.



I attached the latest one, which splits the reset_time
for bgwriter and walwriter, and provides new system view,
called pg_stat_walwriter, to show the dirty write counter
and the reset time.


I've done a quick review of the v4 patch:


Thanks for the review, and sorry for my delayed response.


1) applies fine on HEAD, compiles fine

2) make installcheck fails because of a difference in the 'rules'
test suite (there's a new view pg_stat_walwriter - see the
attached patch for a fixed version or expected/rules.out)


Ah, I forgot about the regression test. I will fix it. Thanks.


3) I do agree with Alvaro that using the same struct for two separate
components (bgwriter and walwriter) seems a bit awkward. For example
you need to have two separate stat_reset fields, the reset code
becomes much more verbose (because you need to list individual
fields) etc.

So I'd vote to either split this into two structures or keeping it
as a single structure (although with two views on top of it).


Ok, I will split it into two structs, PgStat_BgWriterGlobalStats and 
PgStat_WalWriterGlobalStats, and will modify PgStat_GlobalStats to hold 
those two structs in the stat collector.



4) Are there any other fields that might be interesting? Right now
there's just dirty_writes but I guess there are other values. E.g.
how much data was actually written etc.?


AFAIK, I think those numbers can be obtained by calling 
pg_current_xlog_insert_location() or pg_current_xlog_location(),

but if we need it, I will add it.

Regards,



Tomas







--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RFC: Timing Events

2012-11-04 Thread Satoshi Nagayasu


(2012/11/03 10:44), Josh Berkus wrote:



I don't see all that going into core without a much bigger push than I
think people will buy.  What people really want for all these is a
proper trending system, and that means graphs and dashboards and
bling--not a history table.


Well, I'm particularly thinking for autoconfiguration.  For example, to
set vacuum_freeze_min_age properly, you have to know the XID burn rate
of the server, which is only available via history.  I really don't want
to be depending on a graphical monitoring utility to find these things out.


This whole approach has the assumption that things are going to fall off
sometimes.  To expand on that theme for a second, right now I'm more
worried about the 99% class of problems.  Neither pg_stat_statements
nor this idea are very good for tracking the rare rogue problem down.
They're both aimed to make things that happen a lot more statistically
likely to be seen, by giving an easier UI to glare at them frequently.
That's not ideal, but I suspect really fleshing the whole queue consumer
- table idea needs to happen to do much better.


I'm just concerned that for some types of incidents, it would be much
more than 1% *of what you want to look at* which fall off.  For example,
consider a server which does 95% reads at a very high rate, but has 2%
of its writes cronically having lock waits.  That's something you want
to solve, but it seems fairly probably that these relatively infrequent
queries would have fallen off the bottom of pg_stat_statements.  Same
thing with the relative handful of queries which do large on-disk sorts.

The problem I'm worried about is that pg_stat_statements is designed to
keep the most frequent queries, but sometimes the thing you really need
to look at is not in the list of most frequent queries.


I think auto_explain would help you solve such rare incidents
if it could dump several statistics into server log, including lock
waits and block reads/writes statistic per-session, for example.

Do we have something to add to auto_explain?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New statistics for WAL buffer dirty writes

2012-10-28 Thread Satoshi Nagayasu


2012/10/24 1:12, Alvaro Herrera wrote:

Satoshi Nagayasu escribió:


With this patch, walwriter process and each backend process
would sum up dirty writes, and send it to the stat collector.
So, the value could be saved in the stat file, and could be
kept on restarting.

The statistics could be retreive with using
pg_stat_get_xlog_dirty_writes() function, and could be reset
with calling pg_stat_reset_shared('walwriter').

Now, I have one concern.

The reset time could be captured in globalStats.stat_reset_timestamp,
but this value is the same with the bgwriter one.

So, once pg_stat_reset_shared('walwriter') is called,
stats_reset column in pg_stat_bgwriter does represent
the reset time for walwriter, not for bgwriter.

How should we handle this?  Should we split this value?
And should we have new system view for walwriter?


I think the answer to the two last questions is yes.  It doesn't seem to
make sense, to me, to have a single reset timings for what are
effectively two separate things.

Please submit an updated patch to next CF.  I'm marking this one
returned with feedback.  Thanks.



I attached the latest one, which splits the reset_time
for bgwriter and walwriter, and provides new system view,
called pg_stat_walwriter, to show the dirty write counter
and the reset time.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b4fcbaf..0ae885b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1907,6 +1907,13 @@ include 'filename'
 results in most cases.
/para
 
+   para
+When you see pg_stat_walwriter.dirty_write, which means number
+of buffer flushing at buffer full, is continuously increasing
+in your running server, you may need to enlarge this buffer
+size.
+   /para
+
   /listitem
  /varlistentry
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 39ccfbb..3117f91 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -278,6 +278,14 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
  /row
 
  row
+  
entrystructnamepg_stat_walwriter/indextermprimarypg_stat_walwriter/primary/indexterm/entry
+  entryOne row only, showing statistics about the wal writer
+   process's activity. See xref linkend=pg-stat-walwriter-view
+   for details.
+ /entry
+ /row
+
+ row
   
entrystructnamepg_stat_database/indextermprimarypg_stat_database/primary/indexterm/entry
   entryOne row per database, showing database-wide statistics. See
xref linkend=pg-stat-database-view for details.
@@ -735,6 +743,39 @@ postgres: replaceableuser/ replaceabledatabase/ 
replaceablehost/ re
single row, containing global data for the cluster.
   /para
 
+  table id=pg-stat-walwriter-view xreflabel=pg_stat_walwriter
+   titlestructnamepg_stat_walwriter/structname View/title
+
+   tgroup cols=3
+thead
+row
+  entryColumn/entry
+  entryType/entry
+  entryDescription/entry
+ /row
+/thead
+
+tbody
+ row
+  entrystructfielddirty_writes//entry
+  entrytypebigint/type/entry
+  entryNumber of dirty writes, which means flushing wal buffers
+   because of its full./entry
+ /row
+ row
+  entrystructfieldstats_reset//entry
+  entrytypetimestamp with time zone/type/entry
+  entryTime at which these statistics were last reset/entry
+ /row
+/tbody
+/tgroup
+  /table
+
+  para
+   The structnamepg_stat_walwriter/structname view will always have a
+   single row, containing global data for the cluster.
+  /para
+
   table id=pg-stat-database-view xreflabel=pg_stat_database
titlestructnamepg_stat_database/structname View/title
tgroup cols=3
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index d251d08..631a0af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1347,6 +1347,7 @@ AdvanceXLInsertBuffer(bool new_segment)
WriteRqst.Write = OldPageRqstPtr;
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false, false);
+   WalWriterStats.m_xlog_dirty_writes++;
LWLockRelease(WALWriteLock);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 607a72f..40f0c34 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -671,6 +671,11 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_walwriter AS
+SELECT

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-19 Thread Satoshi Nagayasu


2012/10/19 4:36, Robert Haas wrote:

On Tue, Oct 16, 2012 at 11:31 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.


This analogy is inapposite, though, because a flight recorder rarely
crashes the aircraft.  If it did, people might have second thoughts
about the never disable the flight recorder rule.  I have had a
couple of different excuses to look into the overhead of timing
lately, and it does indeed seem that on many modern Linux boxes even
extremely frequent gettimeofday calls produce only very modest amounts
of overhead.


I agree with that such performance instrument needs to be
improved if it has critical performance issue against
production environment. So, I'm still looking for a better
implementation to decrease performance impact.

However, the most important question here is that How can
we understand postgresql behavior without looking into
tons of source code and hacking skill?

Recently, I've been having lots of conversation with
database specialists (sales guys and DBAs) trying to use
PostgreSQL instead of a commercial database, and they are
always struggling with understand PostgreSQL behavior,
because no one can observe and/or tell that.


 Sadly, the situation on Windows doesn't look so good.  I
don't remember the exact numbers but I think it was something like 40
or 60 or 80 times slower on the Windows box one of my colleagues
tested than it is on Linux.  And it turns out that that overhead
really is measurable and does matter if you do it in a code path that
gets run frequently.  Of course I am enough of a Linux geek that I
don't use Windows myself and curse my fate when I do have to use it,
but the reality is that we have a huge base of users who only use
PostgreSQL at all because it runs on Windows, and we can't just throw
those people under the bus.  I think that older platforms like HP/UX
likely have problems in this area as well although I confess to not
having tested.


Do you mean my stat patch should have more performance test
on the other platforms? Yes, I agree with that.


That having been said, if we're going to do this, this is probably the
right approach, because it only calls gettimeofday() in the case where
the lock acquisition is contended, and that is a lot cheaper than
calling it in all cases.  Maybe it's worth finding a platform where
pg_test_timing reports that timing is very slow and then measuring how
much impact this has on something like a pgbench or pgbench -S
workload.  We might find that it is in fact negligible.  I'm pretty
certain that it will be almost if not entirely negligible on Linux but
that's not really the case we need to worry about.


Thanks for a suggestion for a better implementation.
As I mentioned above, I'm always looking for a better idea
and solution to meet our purpose.

Here, I want to share my concern with you again.
PostgreSQL is getting more complicated in order to improve
performance and stability, and I think it may be a good news.
But also getting more difficult to understand without deep
knowledge nowadays, and that would be some bad news actually.

From my point of view, that's a huge hurdle to educate DBAs
and expand PostgreSQL user base.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-19 Thread Satoshi Nagayasu


2012/10/19 23:48, Fujii Masao wrote:

On Wed, Oct 17, 2012 at 12:31 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

2012/10/16 2:40, Jeff Janes wrote:


On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane t...@sss.pgh.pa.us wrote:


Satoshi Nagayasu sn...@uptime.jp writes:


(2012/10/14 13:26), Fujii Masao wrote:


The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage,
e.g.,
by using trace_lwlock parameter, I think.




Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.



This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus).  It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.



Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.


So I feel it isn't something that should be turned on in production
builds.  I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.



I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a benchmarking copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.



Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a built-in instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.


pg_stat_lwlocks seems not adequate 'flight-recorder'. It collects
only narrow performance data concerning lwlock. What we should
have as 'flight-recorder' is something like Oracle wait event, I think.
Not only lwlocks but also all of wait events should be collected for
DBA to investigate the performance bottleneck.


That's the reason why I said I accept that it's not enough
for DBA, and I'm going to work on another lock stats.


This idea was
proposed by Itagaki-san before. Though he implemented the
sampling-profiler patch, it failed to be committed. I'm not sure why
not.


Yeah, I know the previous patch posted by Itagaki-san.
So, I'm questioning why (again) for this time.
I think this is very important question because it would
be critical in order to involve new DBAs to PostgreSQL.


Anyway, I think that this would be more right approach to
provide the 'flight-recorder' to DBA.


Ok, I guess we have reached the consensus to have
some flight-recorder. Right?

Actually, it seems a great progress from my point of view. :)

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-19 Thread Satoshi Nagayasu

2012/10/20 2:45, Tom Lane wrote:
 Satoshi Nagayasu sn...@uptime.jp writes:
 Ok, I guess we have reached the consensus to have
 some flight-recorder. Right?
 
 I remain unconvinced.  I think the argument that this will promote
 novice understanding is complete hogwash: making any sense of
 lwlock-level stats will require deep PG understanding, not create it.
 And despite Jeff's optimistic views upthread, I don't think very many
 people have or will acquire such understanding.  So I'm really resistant
 to accepting even minimal overhead in pursuit of this goal --- and I
 do not believe the overhead will be minimal, either.

As I wrote some, I'm actually not pushing lwlock stat itself
in this context, I still believe it would be useful though.
(I don't think it would be hogwash, because I needed it.)

I'm just thinking of how to help DBA understand PostgreSQL behavior
without deep dive into the code when he/she faces some kind of
performance issue, and also how to make it easier to help newbies
by PG experts, including tech support people.

As I posted before, I did an investigation on WAL performance when
I faced with random commit delays, and I found some problem on the
storage device by observing WALInsertLock and WALWriteLock statistic
with using SystemTap.

How could I do that without SystemTap on the production database?
SystemTap would kill performance (more than my patch), and sometimes
process, too.

Do you really think that we do not need any kind of lock statistic
anymore?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-16 Thread Satoshi Nagayasu


2012/10/16 2:40, Jeff Janes wrote:

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane t...@sss.pgh.pa.us wrote:

Satoshi Nagayasu sn...@uptime.jp writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.



Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.


This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus).  It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.


Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.


So I feel it isn't something that should be turned on in production
builds.  I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.


I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a benchmarking copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.


Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a built-in instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

Regards,



Cheers,

Jeff




--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-15 Thread Satoshi Nagayasu

2012/10/15 1:43, Tom Lane wrote:
 Satoshi Nagayasu sn...@uptime.jp writes:
 (2012/10/14 13:26), Fujii Masao wrote:
 The tracing lwlock usage seems to still cause a small performance
 overhead even if reporting is disabled. I believe some users would
 prefer to avoid such overhead even if pg_stat_lwlocks is not available.
 It should be up to a user to decide whether to trace lwlock usage, e.g.,
 by using trace_lwlock parameter, I think.
 
 Frankly speaking, I do not agree with disabling performance
 instrument to improve performance. DBA must *always* monitor
 the performance metrix when having such heavy workload.
 
 This brings up a question that I don't think has been honestly
 considered, which is exactly whom a feature like this is targeted at.
 TBH I think it's of about zero use to DBAs (making the above argument
 bogus).  It is potentially of use to developers, but a DBA is unlikely
 to be able to do anything about lwlock-level contention even if he has
 the knowledge to interpret the data.

Actually, I'm not a developer. I'm just a DBA, and I needed such
instrument when I was asked to investigate storange WAL behavior
that produced unexpected/random commit delays under heavy workload.

And another patch (WAL dirty flush statistic patch) I have submitted
is coming from the same reason.

https://commitfest.postgresql.org/action/patch_view?id=893

Unfortunately, since I didn't have such instrument at that time,
I used SystemTap to investigate WAL behaviors, including calls and
waited time, but using SystemTap was really awful, and I thought
PostgreSQL needs to have some built-in instrument for DBA.

I needed to determine the bottleneck around WAL, such as lock contension
and/or write performance of the device, but I couldn't find anything
without an instrument.

I accept that I'm focusing on only WAL related lwlocks, and it is not
enough for ordinally DBAs, but I still need it to understand PostgreSQL
behavior without having deep knowledge and experience on WAL design and
implementation.

 So I feel it isn't something that should be turned on in production
 builds.  I'd vote for enabling it by a non-default configure option,
 and making sure that it doesn't introduce any overhead when the option
 is off.

There is another option to eliminate performance overhead for this
purpose.

As I tried in the first patch, instead of reporting through pgstat
collector process, each backend could directly increment lwlock
counters allocated in the shared memory.

http://archives.postgresql.org/message-id/4fe9a6f5.2080...@uptime.jp

Here are another benchmark results, including my first patch.

[HEAD]
number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

[My first patch]
number of transactions actually processed: 3453745
tps = 57562.196971 (including connections establishing)
tps = 57569.197838 (excluding connections establishing)

Actually, I'm not sure why my patch makes PostgreSQL faster, :D
but the performance seems better than my second patch.

I think it still needs some trick to keep counters in pgstat.stat
over restarting, but it would be more acceptable in terms of
performance overhead.

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-14 Thread Satoshi Nagayasu


(2012/10/14 13:26), Fujii Masao wrote:

On Sun, Oct 14, 2012 at 10:46 AM, Satoshi Nagayasu sn...@uptime.jp wrote:

HEAD

number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

snip

pg_stat_lwlocks patch (reporting disabled)
==
number of transactions actually processed: 3429370
tps = 57155.286475 (including connections establishing)
tps = 57163.996943 (excluding connections establishing)

So, I think some additional hack to reduce reporting is needed.
Would it be acceptable in terms of the performance?


The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.


Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.

But it's ok to add a parameter to switch enable/disable it.
Any other comments?


 Another comment is; local_calls/waits/time_ms are really required?
 I'm not sure how those info would help the performance debugging.



I think there are some needs to observe/determine how your test
query is affected by the other workload from the other sessions.
So, splitting local and shared statistics would be nice to have.
Just my thought though.


What I don't like is that a session can see only local stats of its own
session. It's hard to monitor local stats. Imagine the case where you'd
like to monitor local stats of each pgbench session. To monitor such
stats, you need to modify pgbench so that its each session monitors
its own local stats. Even if you run a monitoring software, it cannot
collect those stats because they don't belong to the session that it uses.


Ok. I'm waiting more comments from others.
Dropping it is easy for me, but any other comments? Josh?

Regards,



Regards,




--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-13 Thread Satoshi Nagayasu

Hi all,

I have fixed my previous patch for pg_stat_lwlocks view, and
as Josh commented, it now supports local and global (shared)
statistics in the same system view.

Local statistics means the counters are only effective in the
same session, and shared ones means the counters are shared within
the entire cluster.

Also the global statistics would be collected via pgstat collector
process like other statistics do.

Now, the global statistics struct has been splitted into two parts
for different use, for bgwriter stats and lwlock stats.

Therefore, calling pg_stat_reset_shared('bgwriter') or
pg_stat_reset_shared('lwlocks') would reset dedicated struct,
not entire PgStat_GlobalStats.

Comments and review are always welcome.

Regards,

--
postgres=# SELECT * FROM pg_stat_lwlocks;
 lwlockid | local_calls | local_waits | local_time_ms | shared_calls |
shared_waits | shared_time_ms
--+-+-+---+--+--+
0 |   0 |   0 | 0 | 4268 |
  0 |  0
1 |  43 |   0 | 0 |  387 |
  0 |  0
2 |   0 |   0 | 0 |   19 |
  0 |  0
3 |   0 |   0 | 0 |   28 |
  0 |  0
4 |   3 |   0 | 0 |  315 |
  0 |  0
5 |   0 |   0 | 0 |   24 |
  0 |  0
6 |   1 |   0 | 0 |   76 |
  0 |  0
7 |   0 |   0 | 0 |16919 |
  0 |  0
8 |   0 |   0 | 0 |0 |
  0 |  0
9 |   0 |   0 | 0 |0 |
  0 |  0
   10 |   0 |   0 | 0 |0 |
  0 |  0
   11 |   0 |   0 | 0 |   75 |
  0 |  0
   12 |   0 |   0 | 0 |0 |
  0 |  0
   13 |   0 |   0 | 0 |0 |
  0 |  0
   14 |   0 |   0 | 0 |0 |
  0 |  0
   15 |   0 |   0 | 0 |0 |
  0 |  0
   16 |   0 |   0 | 0 |0 |
  0 |  0
   17 |   0 |   0 | 0 |61451 |
  6 |  0
   18 |   0 |   0 | 0 |0 |
  0 |  0
   19 |   0 |   0 | 0 |0 |
  0 |  0
   20 |   0 |   0 | 0 |0 |
  0 |  0
   21 |   1 |   0 | 0 |9 |
  0 |  0
   22 |   0 |   0 | 0 |0 |
  0 |  0
   23 |   0 |   0 | 0 |0 |
  0 |  0
   24 |   0 |   0 | 0 |1 |
  0 |  0
   25 |   0 |   0 | 0 |0 |
  0 |  0
   26 |   2 |   0 | 0 |   18 |
  0 |  0
   27 |   0 |   0 | 0 |0 |
  0 |  0
   28 |   0 |   0 | 0 |0 |
  0 |  0
   29 |   0 |   0 | 0 |0 |
  0 |  0
   30 |   0 |   0 | 0 |0 |
  0 |  0
   31 |   0 |   0 | 0 |0 |
  0 |  0
   32 |   0 |   0 | 0 |0 |
  0 |  0
   33 |   4 |   0 | 0 |   207953 |
  0 |  0
   50 |   8 |   0 | 0 |33388 |
  0 |  0
   67 |   0 |   0 | 0 |0 |
  0 |  0
(36 rows)

postgres=#
--


2012/06/26 21:11, Satoshi Nagayasu wrote:
 Hi all,
 
 I've modified the pg_stat_lwlocks patch to be able to work with
 the latest PostgreSQL Git code.
 
 This patch provides:
pg_stat_lwlocks   New system view

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-13 Thread Satoshi Nagayasu

Hi,

2012/10/13 23:05, Satoshi Nagayasu wrote:
 Hi all,
 
 I have fixed my previous patch for pg_stat_lwlocks view, and
 as Josh commented, it now supports local and global (shared)
 statistics in the same system view.

Sorry, I found my mistakes. New fixed one is attached to this mail.

Regards,

 
 Local statistics means the counters are only effective in the
 same session, and shared ones means the counters are shared within
 the entire cluster.
 
 Also the global statistics would be collected via pgstat collector
 process like other statistics do.
 
 Now, the global statistics struct has been splitted into two parts
 for different use, for bgwriter stats and lwlock stats.
 
 Therefore, calling pg_stat_reset_shared('bgwriter') or
 pg_stat_reset_shared('lwlocks') would reset dedicated struct,
 not entire PgStat_GlobalStats.
 
 Comments and review are always welcome.
 
 Regards,
 
 --
 postgres=# SELECT * FROM pg_stat_lwlocks;
   lwlockid | local_calls | local_waits | local_time_ms | shared_calls |
 shared_waits | shared_time_ms
 --+-+-+---+--+--+
  0 |   0 |   0 | 0 | 4268 |
0 |  0
  1 |  43 |   0 | 0 |  387 |
0 |  0
  2 |   0 |   0 | 0 |   19 |
0 |  0
  3 |   0 |   0 | 0 |   28 |
0 |  0
  4 |   3 |   0 | 0 |  315 |
0 |  0
  5 |   0 |   0 | 0 |   24 |
0 |  0
  6 |   1 |   0 | 0 |   76 |
0 |  0
  7 |   0 |   0 | 0 |16919 |
0 |  0
  8 |   0 |   0 | 0 |0 |
0 |  0
  9 |   0 |   0 | 0 |0 |
0 |  0
 10 |   0 |   0 | 0 |0 |
0 |  0
 11 |   0 |   0 | 0 |   75 |
0 |  0
 12 |   0 |   0 | 0 |0 |
0 |  0
 13 |   0 |   0 | 0 |0 |
0 |  0
 14 |   0 |   0 | 0 |0 |
0 |  0
 15 |   0 |   0 | 0 |0 |
0 |  0
 16 |   0 |   0 | 0 |0 |
0 |  0
 17 |   0 |   0 | 0 |61451 |
6 |  0
 18 |   0 |   0 | 0 |0 |
0 |  0
 19 |   0 |   0 | 0 |0 |
0 |  0
 20 |   0 |   0 | 0 |0 |
0 |  0
 21 |   1 |   0 | 0 |9 |
0 |  0
 22 |   0 |   0 | 0 |0 |
0 |  0
 23 |   0 |   0 | 0 |0 |
0 |  0
 24 |   0 |   0 | 0 |1 |
0 |  0
 25 |   0 |   0 | 0 |0 |
0 |  0
 26 |   2 |   0 | 0 |   18 |
0 |  0
 27 |   0 |   0 | 0 |0 |
0 |  0
 28 |   0 |   0 | 0 |0 |
0 |  0
 29 |   0 |   0 | 0 |0 |
0 |  0
 30 |   0 |   0 | 0 |0 |
0 |  0
 31 |   0 |   0 | 0 |0 |
0 |  0
 32 |   0 |   0 | 0 |0 |
0 |  0
 33 |   4 |   0 | 0 |   207953 |
0 |  0
 50 |   8 |   0 | 0 |33388 |
0 |  0
 67 |   0 |   0 | 0 |0 |
0 |  0
 (36 rows)
 
 postgres

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-13 Thread Satoshi Nagayasu


Thanks for the review.

2012/10/14 8:55, Michael Paquier wrote:



On Sun, Oct 14, 2012 at 6:00 AM, Fujii Masao masao.fu...@gmail.com
mailto:masao.fu...@gmail.com wrote:

On Sun, Oct 14, 2012 at 3:34 AM, Fujii Masao masao.fu...@gmail.com
mailto:masao.fu...@gmail.com wrote:
  On Sat, Oct 13, 2012 at 11:34 PM, Satoshi Nagayasu
sn...@uptime.jp mailto:sn...@uptime.jp wrote:
  Hi,
 
  2012/10/13 23:05, Satoshi Nagayasu wrote:
  Hi all,
 
  I have fixed my previous patch for pg_stat_lwlocks view, and
  as Josh commented, it now supports local and global (shared)
  statistics in the same system view.
 
  Sorry, I found my mistakes. New fixed one is attached to this mail.
 
  Thanks for revising the patch. Here are the comments:
 
  The document needs to be updated.
 
  The patch caused the following compile warnings in my machine.
 
  pgstat.c:1357: warning: no previous prototype for
'pgstat_report_lwlockstat'
  postgres.c:3922: warning: implicit declaration of function
  'pgstat_report_lwlockstat'
  pgstatfuncs.c:1854: warning: no previous prototype for
'pg_stat_reset_lwlocks'


Oops. I just fixed them. Thanks.


  In my test, this patch caused the measurable performance overhead.
  I created the test database by pgbench -s10 and ran pgbench -c8
-j8 -T60 -S.
  Results are:
 
  [HEAD]
  number of transactions actually processed: 1401369
  tps = 23351.375811 (including connections establishing)
  tps = 23355.900043 (excluding connections establishing)
 
  [PATCH]
  number of transactions actually processed: 1401369
  tps = 23351.375811 (including connections establishing)
  tps = 23355.900043 (excluding connections establishing)

Oops! Obviously I copied and pasted the test result wrongly...
Here is the right result.

[HEAD]
number of transactions actually processed: 1401369
tps = 23351.375811 (including connections establishing)
tps = 23355.900043 (excluding connections establishing)

[PATCH]
number of transactions actually processed: 1092400
tps = 18179.498013 tel:18179.498013 (including connections
establishing)
tps = 18182.450824 tel:18182.450824 (excluding connections
establishing)

Performance difference is due to only the mutex lock taken?


I think it is coming from high-frequent reporting through
pgstat collector process, which means calling
pgstat_report_lwlocks() at PostgresMain().

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 585db1a..5ca2c6f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3919,6 +3919,8 @@ PostgresMain(int argc, char *argv[], const char 
*username)

pgstat_report_activity(STATE_IDLE, NULL);
}

+   pgstat_report_lwlockstat();
+
ReadyForQuery(whereToSendOutput);
send_ready_for_query = false;
}

When I reduced reporting (or just disabled reporting),
it shows that the performance would not be affected
by this patch.

Here are some additional results of the performance test
which is the same one Fujii-san did.

HEAD

number of transactions actually processed: 3439971
tps = 57331.891602 (including connections establishing)
tps = 57340.932324 (excluding connections establishing)

pg_stat_lwlocks patch (reporting enabled)
=
number of transactions actually processed: 2665550
tps = 44425.038125 (including connections establishing)
tps = 44430.565651 (excluding connections establishing)

pg_stat_lwlocks patch (reporting disabled)
==
number of transactions actually processed: 3429370
tps = 57155.286475 (including connections establishing)
tps = 57163.996943 (excluding connections establishing)

pg_stat_lwlocks patch (reporting reduced 1/100)
===
number of transactions actually processed: 3421879
tps = 57030.660814 (including connections establishing)
tps = 57038.950498 (excluding connections establishing)

So, I think some additional hack to reduce reporting is needed.
Would it be acceptable in terms of the performance?


Another comment is; local_calls/waits/time_ms are really required?
I'm not sure how those info would help the performance debugging.


I think there are some needs to observe/determine how your test
query is affected by the other workload from the other sessions.
So, splitting local and shared statistics would be nice to have.
Just my thought though.

Regards,



Regards,

--
Fujii Masao


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org
mailto:pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Michael Paquier

Re: [HACKERS] New statistics for WAL buffer dirty writes

2012-09-25 Thread Satoshi Nagayasu


Hi,

2012/08/12 7:11, Jeff Janes wrote:

On Sat, Jul 28, 2012 at 3:33 PM, Jeff Janes jeff.ja...@gmail.com wrote:

On Sat, Jul 7, 2012 at 9:17 PM, Satoshi Nagayasu sn...@uptime.jp wrote:

Hi,

Jeff Janes has pointed out that my previous patch could hold
a number of the dirty writes only in single local backend, and
it could not hold all over the cluster, because the counter
was allocated in the local process memory.

That's true, and I have fixed it with moving the counter into
the shared memory, as a member of XLogCtlWrite, to keep total
dirty writes in the cluster.




...


The comment XLogCtrlWrite must be protected with WALWriteLock
mis-spells XLogCtlWrite.

The final patch will need to add a sections to the documentation.


Thanks to Robert and Tom for addressing my concerns about the pointer
volatility.

I think there is enough consensus that this is useful without adding
more things to it, like histograms or high water marks.

However, I do think we will want to add a way to query for the time of
the last reset, as other monitoring features are going that way.

Is it OK that the count is reset upon a server restart?
pg_stat_bgwriter, for example, does not do that.  Unfortunately I
think fixing this in an acceptable way will be harder than the entire
rest of the patch was.


The coding looks OK to me, it applies and builds, and passes make
check, and does what it says.  I didn't do performance testing, as it
is hard to believe it would have a meaningful effect.

I'll marked it as waiting on author, for the documentation and reset
time.  I'd ask a more senior hacker to comment on the durability over
restarts.


I have rewritten the patch to deal with dirty write statistics
through pgstat collector as bgwriter does.
Yeah, it's a bit bigger rewrite.

With this patch, walwriter process and each backend process
would sum up dirty writes, and send it to the stat collector.
So, the value could be saved in the stat file, and could be
kept on restarting.

The statistics could be retreive with using
pg_stat_get_xlog_dirty_writes() function, and could be reset
with calling pg_stat_reset_shared('walwriter').

Now, I have one concern.

The reset time could be captured in globalStats.stat_reset_timestamp,
but this value is the same with the bgwriter one.

So, once pg_stat_reset_shared('walwriter') is called,
stats_reset column in pg_stat_bgwriter does represent
the reset time for walwriter, not for bgwriter.

How should we handle this?  Should we split this value?
And should we have new system view for walwriter?

Of course, I will work on documentation next.

Regards,



Cheers,

Jeff


--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index ff56c26..234d568 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1518,6 +1518,7 @@ AdvanceXLInsertBuffer(bool new_segment)
WriteRqst.Write = OldPageRqstPtr;
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false, false);
+   WalWriterStats.m_xlog_dirty_writes++;
LWLockRelease(WALWriteLock);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8389d5c..f031be1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -125,6 +125,15 @@ char  *pgstat_stat_tmpname = NULL;
  */
 PgStat_MsgBgWriter BgWriterStats;
 
+/*
+ * WalWriter global statistics counter.
+ * Despite its name, this counter is actually used not only in walwriter,
+ * but also in each backend process to sum up xlog dirty writes.
+ * Those processes would increment this counter in each XLogWrite call,
+ * then send it to the stat collector process.
+ */
+PgStat_MsgWalWriter WalWriterStats;
+
 /* --
  * Local data
  * --
@@ -279,6 +288,7 @@ static void pgstat_recv_autovac(PgStat_MsgAutovacStart 
*msg, int len);
 static void pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len);
 static void pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len);
 static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
+static void pgstat_recv_walwriter(PgStat_MsgWalWriter *msg, int len);
 static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
 static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
 static void pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int 
len);
@@ -1188,6 +1198,8 @@ pgstat_reset_shared_counters(const char *target)
 
if (strcmp(target, bgwriter) == 0)
msg.m_resettarget = RESET_BGWRITER;
+   else if (strcmp(target, walwriter) == 0)
+   msg.m_resettarget = RESET_WALWRITER;
else
ereport(ERROR

Re: [HACKERS] [PoC] load balancing in libpq

2012-09-24 Thread Satoshi Nagayasu


2012/09/24 1:07, Christopher Browne wrote:

We historically have connection pooling as an external thing; with the
high degree to which people keep implementing and reimplementing this, I
think *something* more than we have ought to be built in.

This, with perhaps better implementation, might be an apropos start.

Parallel with LDAP: it takes very much this approach, where
configuration typically offers a list of LDAP servers.  I am not certain
if OpenLDAP does round robin on the list, or if it tries targets in
order, stopping when it succeeds.  A decent debate fits in, there.

I could see this being implemented instead via something alongside
PGSERVICE; that already offers a well-defined way to capture a
registry of connection configuration.  Specifying a list of service
names would allow the command line configuration to remain short and yet
very flexible.


Thanks for the comment.

As you pointed out, I think it would be a start point to
implement new simple load-balancing stuff. That's what
I actually intended.

My clients often ask me easier way to take advantage of
replication and load-balancing.

I know there are several considerations to be discussed,
such as API compatibility issue, but it would be worth
having in the core (or around the core).

And I also know many people are struggling with load-balancing
and master-failover things for the PostgreSQL replication.

If those people are trying implementing their own load-balancing
stuff in their apps again and again, it's time to consider
implementing it to deliver and/or leverage with the potential
of PostgreSQL replication.

Regards,



On 2012-09-23 10:01 AM, Euler Taveira eu...@timbira.com
mailto:eu...@timbira.com wrote:

On 23-09-2012 07:50, Satoshi Nagayasu wrote:
  I have just written the first PoC code to enable load balancing
  in the libpq library.
 
Your POC is totally broken. Just to point out two problems: (i)
semicolon (;)
is a valid character for any option in the connection string and
(ii) you
didn't think about PQsetdb[Login](), PQconnectdbParams() and
PQconnectStartParams(). If you want to pursue this idea, you should
think a
way to support same option multiple times (one idea is host1, host2,
etc).

Isn't it easier to add support on your application or polling software?


--
Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org
mailto:pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers




--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_reorg in core?

2012-09-24 Thread Satoshi Nagayasu


2012/09/25 0:15, Simon Riggs wrote:

On 21 September 2012 08:42, Michael Paquier michael.paqu...@gmail.com wrote:



On Fri, Sep 21, 2012 at 1:00 PM, Hitoshi Harada umi.tan...@gmail.com
wrote:


I'm not familiar with pg_reorg, but I wonder why we need a separate
program for this task.  I know pg_reorg is ok as an external program
per se, but if we could optimize CLUSTER (or VACUUM which I'm a little
pessimistic about) in the same way, it's much nicer than having
additional binary + extension.  Isn't it possible to do the same thing
above within the CLUSTER command?  Maybe CLUSTER .. CONCURRENTLY?


CLUSTER might be more adapted in this case as the purpose is to reorder the
table.
The same technique used by pg_reorg (aka table coupled with triggers) could
lower the lock access of the table.
Also, it could be possible to control each sub-operation in the same fashion
way as CREATE INDEX CONCURRENTLY.
By the way, whatever the operation, VACUUM or CLUSTER used, I got a couple
of doubts:
1) isn't it be too costly for a core operation as pg_reorg really needs many
temporary objects? Could be possible to reduce the number of objects created
if added to core though...
2) Do you think the current CLUSTER is enough and are there wishes to
implement such an optimization directly in core?



For me, the Postgres user interface should include
* REINDEX CONCURRENTLY
* CLUSTER CONCURRENTLY
* ALTER TABLE CONCURRENTLY
and also that autovacuum would be expanded to include REINDEX and
CLUSTER, renaming it to automaint.

The actual implementation mechanism for those probably looks something
like pg_reorg, but I don't see it as preferable to include the utility
directly into core, though potentially some of the underlying code
might be.


I think it depends on what trade-off we can see.

AFAIK, basically, rebuilding tables and/or indexes has
a trade-off between lock-free and disk-space.

So, if we have enough disk space to build a temporary
table/index when rebuilding a table/index, concurrently
would be a great option, and I would love it to have
in core.

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] [PoC] load balancing in libpq

2012-09-23 Thread Satoshi Nagayasu

Hi all,

I have just written the first PoC code to enable load balancing
in the libpq library.

This libpq enhancement is intended to allow PostgreSQL users to
take advantage of the replication in easier way.

With using this patch, PQconnectdb() function accepts multiple
connection info strings, and pick one of them by round-robin basis
to connect.

This example, shown in below, shows that PQconnect() accepts
a connection string that contains four different databases
(db1~db4) on different servers (dbhost1~dbhost4), and then,
connects them in round-robin basis.

I know there are several things to be considered, but at first,
I need your feedback or comments for this enhancement.

Do you think it would be useful?

Regards,

[snaga@devvm03 libpq_repli]$ cat libpq_lb_test.c
#include stdio.h
#include libpq-fe.h

void
_connect()
{
  PGconn *conn;
  PGresult *res;

  conn = PQconnectdb(host=dbhost1 dbname=db1 user=snaga; host=dbhost2
dbname=db2 user=snaga; host=dbhost3 dbname=db3 user=snaga; host=dbhost4
dbname=db4 user=snaga);

  res = PQexec(conn, SELECT current_database());

  if ( PQresultStatus(res)==PGRES_TUPLES_OK )
  {
printf(current_database = %s on %s\n, PQgetvalue(res, 0, 0),
PQhost(conn));
  }

  PQfinish(conn);
}

int
main(void)
{
  int i;
  for (i=0 ; i8 ; i++)
_connect();

  return 0;
}
[snaga@devvm03 libpq_repli]$ ./libpq_lb_test
current_database = db1 on dbhost1
current_database = db2 on dbhost2
current_database = db3 on dbhost3
current_database = db4 on dbhost4
current_database = db1 on dbhost1
current_database = db2 on dbhost2
current_database = db3 on dbhost3
current_database = db4 on dbhost4
[snaga@devvm03 libpq_repli]$


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/interfaces/libpq/fe-connect.c 
b/src/interfaces/libpq/fe-connect.c
index 9eaf410..14e31b6 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -569,6 +569,81 @@ PQconnectStartParams(const char *const * keywords,
return conn;
 }
 
+struct ConninfoRepli {
+   int n_conninfo;
+   char *conninfo_array[128];
+};
+
+static bool
+parse_conninfo_repli(char *conninfo, size_t len, const char *conninfo_repli)
+{
+   char buf[1024];
+   char *conninfo_ptr;
+   char *end_ptr;
+   struct ConninfoRepli conninfo_r;
+   static int conninfo_idx = 0;
+
+   int i;
+
+   memset(conninfo_r, 0, sizeof(struct ConninfoRepli));
+
+   conninfo_r.n_conninfo = 0;
+
+#ifdef REPLI_DEBUG
+   printf(DEBUG: %s\n, conninfo_repli);
+#endif
+
+   conninfo_ptr = (char *)conninfo_repli;
+
+   while (1)
+   {
+   int len;
+
+   end_ptr = strchr(conninfo_ptr, ';');
+   if ( end_ptr )
+   len = end_ptr - conninfo_ptr;
+   else
+   len = strlen(conninfo_ptr);
+
+   memset(buf, 0, sizeof(buf));
+   strncpy(buf, conninfo_ptr, len);
+
+   conninfo_r.conninfo_array[conninfo_r.n_conninfo] = strdup(buf);
+   conninfo_r.n_conninfo++;
+
+   if ( !end_ptr )
+   break;
+
+   conninfo_ptr = end_ptr + 1;
+   }
+
+#ifdef REPLI_DEBUG
+   printf(DEBUG: n_conninfo = %d\n, conninfo_r.n_conninfo);
+
+   for (i=0 ; iconninfo_r.n_conninfo ; i++)
+   {
+   printf(DEBUG: %s\n, conninfo_r.conninfo_array[i]);
+   }
+#endif
+
+   strncpy(conninfo, conninfo_r.conninfo_array[conninfo_idx], len);
+   conninfo_idx++;
+
+#ifdef REPLI_DEBUG
+   printf(DEBUG: conninfo = %s\n, conninfo);
+#endif
+
+   if ( conninfo_idx = conninfo_r.n_conninfo )
+   conninfo_idx = 0;
+
+   for (i=0 ; iconninfo_r.n_conninfo ; i++)
+   {
+   free(conninfo_r.conninfo_array[i]);
+   }
+
+   return true;
+}
+
 /*
  * PQconnectStart
  *
@@ -592,6 +667,7 @@ PGconn *
 PQconnectStart(const char *conninfo)
 {
PGconn *conn;
+   char conninfo2[1024];
 
/*
 * Allocate memory for the conn structure
@@ -601,9 +677,15 @@ PQconnectStart(const char *conninfo)
return NULL;
 
/*
+* Parse an user-specified conninfo string that contains
+* multiple conninfo strings, and pick one by round-robin basis.
+*/
+   parse_conninfo_repli(conninfo2, sizeof(conninfo2), conninfo);
+
+   /*
 * Parse the conninfo string
 */
-   if (!connectOptions1(conn, conninfo))
+   if (!connectOptions1(conn, conninfo2))
return conn;
 
/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_reorg in core?

2012-09-23 Thread Satoshi Nagayasu

2012/09/23 12:37, Greg Sabino Mullane wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: RIPEMD160
 
 
 I think it's time to consider some *umbrella project* for maintaining
 several small projects outside the core.

 Well, that was pgfoundry, and it didn't work out.
 
 I'm not sure that is quite analogous to what was being proposed.
 I read it as more of let's package a bunch of these small utilities
 together into a single project, such that installing one installs them
 all (e.g. aptitude install pg_tools), and they all have a single bug
 tracker, etc. That tracker could be github, of course.

Exactly --- I do not care the SCM system though. :)

 I'm not convinced of the merit of that plan, but that's an alternative
 interpretation that doesn't involve our beloved pgfoundry. :)

For example, xlogdump had not been maintained for 5 years when
I picked it up last year. And the latest pg_filedump that supports 9.2
has not been released yet. pg_reorg as well.

If those tools are in a single project, it would be easier to keep
attention on it. Then, developers can easily build *all of them*
at once, fix them, and post any patch on the single mailing list.
Actually, it would save developers from waisting their time.

From my viewpoint, it's not just a SCM or distributing issue.
It's about how to survive for such small projects around the core
even if these could not come in the core.

Regards,

 
 Oh, and -1 for putting it in core. Way too early, and not
 important enough.
 
 - -- 
 Greg Sabino Mullane g...@turnstep.com
 PGP Key: 0x14964AC8 201209222334
 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
 -BEGIN PGP SIGNATURE-
 
 iEYEAREDAAYFAlBeg/AACgkQvJuQZxSWSsjL5ACgimT71B4lSb1ELhgMw5EBzAKs
 xHIAn08vxGzmM6eSmDfZfxlJDTousq7h
 =KgXW
 -END PGP SIGNATURE-
 
 
 
 


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_reorg in core?

2012-09-22 Thread Satoshi Nagayasu

(2012/09/22 11:01), sakamoto wrote:
 (2012/09/22 10:02), Christopher Browne wrote:

 If the present project is having a tough time doing enhancements, I 
 should think it mighty questionable to try to draw it into core, that 
 presses it towards a group of already very busy developers.

 On the other hand, if the present development efforts can be made more 
 public, by having them take place in a more public repository, that at 
 least has potential to let others in the community see and 
 participate.  There are no guarantees, but privacy is liable to hurt.

 I wouldn't expect any sudden huge influx of developers, but a steady 
 visible stream of development effort would be mighty useful to a 
 merge into core argument.

 A *lot* of projects are a lot like this.  On the Slony project, we 
 have tried hard to maintain this sort of visibility.  Steve Singer, 
 Jan Wieck and I do our individual efforts on git repos visible at 
 GitHub to ensure ongoing efforts aren't invisible inside a corporate 
 repo.  It hasn't led to any massive of extra developers, but I am 
 always grateful to see Peter Eisentraut's bug reports.

 
 Agreed.  What reorg project needs first is transparency, including
 issue traking, bugs,  listup todo items, clearfied release schedules,
 quarity assurance and so force.
 Only after all that done, the discussion to put them to core can be 
 started.
 
 Until now, reorg is developed and maintained behind corporate repository.
 But now that its activity goes slow, what I should do as a maintainer is to
 try development process more public and finds someone to corporate with:)

I think it's time to consider some *umbrella project* for maintaining
several small projects outside the core.

As you pointed out, the problem here is that it's difficult to keep
enough eyeballs and development resource on tiny projects outside
the core.

For examples, NTT OSSC has created lots of tools, but they're facing
some difficulties to keep them being maintained because of their
development resources. There're diffrent code repositories, different
web sites, diffirent issus tracking system and different dev mailing
lists, for different small projects. My xlogdump as well.

Actually, that's the reason why it's difficult to keep enough eyeballs
on small third-party projects. And also the reason why some developers
want to push their tools into the core, isn't it? :)

To solve this problem, I would like to have some umbrella project.
It would be called pg dba utils, or something like this.
This umbrella project may contain several third-party tools (pg_reorg,
pg_rman, pg_filedump, xlogdump, etc, etc...) as its sub-modules.

And also it may have single web site, code repository, issue tracking
system and developer mailing list in order to share its development
resource for testing, maintening and releasing. I think it would help
third-party projects keep enough eyeballs even outside the core.

Of course, if a third-party project has faster pace on its development
and enough eyeballs to maintain, it's ok to be an independent project.
However when a tool have already got matured with less eyeballs,
it needs to be merged into this umbrella project.

Any comments?

 
 Sakamoto
 
 


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Adding probes for smgr

2012-07-29 Thread Satoshi Nagayasu


(2012/07/29 12:14), Tom Lane wrote:
 Peter Geoghegan pe...@2ndquadrant.com writes:
 On 28 July 2012 17:15, Tom Lane t...@sss.pgh.pa.us wrote:
 IMV smgr is pretty vestigial.  I wouldn't recommend loading more
 functionality onto that layer, because it's as likely as not that
 we'll just get rid of it someday.
 
 Agreed. I recently found myself reading a paper written by Stonebraker
 back in the Berkeley days:
 http://dislab2.hufs.ac.kr/dislab/seminar/2007/ERL-M87-06.pdf
 This paper appears to have been published in about 1988, and it shows.
 It's fairly obvious from reading the opening paragraph that the
 original rationale for the design of the storage manager doesn't hold
 these days. Of course, it's also obvious from reading the code, since
 for example there is only one storage manager module.
 
 Yeah.  There were actually two storage managers in what we inherited
 from Berkeley, but we soon got rid of the other one as being useless.
 (IIRC it was meant for magnetic-core memory ... anybody seen any of
 that lately?)

I remember that I had found mm.c when I started looking around the
storage manager code, maybe a decade ago. It seemed very interesting,
but I've never seen magnetic-core memory itself. :)

Anyway, I'm realizing that it's reasonable to add new probes into the
md module at this time.

Thanks for the comments.


 I think basically what happened since then is that the functionality
 Stonebraker et al imagined as being in per-storage-manager code all
 migrated into the kernel device drivers, or even down into the hardware
 itself.  (SSDs are *way* smarter than the average '80s storage device,
 and even those were an order of magnitude smarter than what they'd been
 ten years previously.  I used to do device drivers back in the 80's...)
 There's no longer any good reason to have anything but md.c, which isn't
 so much a magnetic disk interface as an interface to something that
 has a Unix block device driver.

BTW, I'm still interested in keeping the storage manager architecture
theoretically pluggable. From my point of view, current pluggable
architecture was designed to serve for different storage manager with
by-passing the posix file system API.

I agree with that most of recent storage devices can be managed under
ordinally file systems, ext3/ext4 or xfs for examples.

However, I'm still curious to see new project which will intend to
extend storage manager to earn more performance for some special
storage devices by by-passing ordinally file system API.

For example, Fusion-IO is offering dedicated API for their PCI-Express
Flash storage, and I guess the database performace could tremendously
benefit from it. (I accept that it's just my curious though. :)

It's just a possible option, but keeping the storage architecture
pluggable does make sense to me even if the community could give only
one storage manager.

Regards,

 
 This state of affairs sort of reminds me of mcxt.c . The struct
 MemoryContextData is described as an abstract type that can have
 multiple implementations, despite the fact that since 2000 (and
 perhaps earlier), the underlying type is invariably AllocSetContext. I
 never investigated if that indirection still needs to exist, but I
 suspect that it too is a candidate for refactoring. Do you agree?
 
 Meh.  Having invented the MemoryContext interface, I am probably not
 the best-qualified person to evaluate it objectively.  The original
 thought was that we might have (a) a context type that could allocate
 storage in shared memory, and/or (b) a context type that could provide
 better allocation speed at a loss of storage efficiency (eg, lose the
 ability to pfree individual chunks).  Case (a) has never become
 practical given the inability of SysV-style shared memory to expand at
 all.  I don't know if that might change when/if we switch to some other
 shmem API.  The idea of a different allocation strategy for some usages
 still seems like something we'll want to do someday, though.
 
   regards, tom lane
 


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Adding probes for smgr

2012-07-28 Thread Satoshi Nagayasu

Hi,

I'm thinking of adding new probes to trace smgr activities.

In this implementation, I just found that md.c has its own probes
within it, but I'm wondering why we do not have those probes
within the generic smgr routines itself.

Which would be a better choice?

Any ideas or comments?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] XLogReader v2

2012-07-23 Thread Satoshi Nagayasu


2012/07/24 1:15, Robert Haas wrote:

On Mon, Jul 23, 2012 at 12:13 PM, Andres Freund and...@2ndquadrant.com wrote:

Could that be fixed by moving the debugging routines into a separate
set of files, instead of having them lumped in with the code that
applies those xlog records?

Its a major effort. Those function use elog(), stringinfo and lots of other
stuff... I am hesitant to start working on that.
On the other hand - I think an in-core xlogdump would be great and sensible
thing; but I can live with using my hacked up version that simply links to the
backend...


The stringinfo thing has long been an annoyance to me.  libpq has
PQExpBuffer which is the exact same thing.  I don't like that we have
two implementations of that in two different code bases, and you have
to remember to spell it right depending on where you are.  I'm not
sure exactly what the best way to fix that is, but it sure is a pain
in the neck.


Does it make sense to make some static library which can be
referred from both the backend and several client utilities,
including libpq? Or just a dynamic link be preferred?

Despite I do not have a clear idea right now, is it time to
start thinking of it?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] XLogReader v2

2012-07-19 Thread Satoshi Nagayasu


2012/07/19 19:29, Andres Freund wrote:

Hi,

Attached is v2 of the patch.

Changes are:
* more comments
* significantly cleaned/simpliefied coded
* crc validation
* addition of XLogReaderReadOne

Definitely needed are:
* better validation of records
* customizable error handling

The first is just work that needs to be done, nothing complicated.
The second is a bit more complicated:
- We could have an bool had_error and a static char that contains the error
message, the caller can handle that as wanted
- We could have a callback for error handling

I think I prefer the callback solution.


The second attached patch is a very, very preliminary xlog dumping utility
which currently is more of a debugging facility (as evidenced by the fact that
it needs and existing /tmp/xlog directory for writing out data) for the
XLogReader. It reuses the builtin xlog dumping logic and thus has to link with
backend code. I couldn't find a really sensible way to do this:

xlogdump: $(OBJS) $(shell find ../../backend ../../timezone -name
objfiles.txt|xargs cat|tr -s   \012|grep -v /main.o|sed 's/^/..\/..\/..
 $(CC) $(CFLAGS) $^ $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X)

Perhaps somebody has a better idea? I think having an xlogdump utility in
core/contrib would be a good idea now that it can be done without a huge
amount of code duplication. I plan to check Satoshi-san's version of xlogdump
whether I can crib some of the commandline interface and some code from there.


I agree with that we need more sophisticated way to share the code
between the backend and several utilities (including xlogdump),
but AFAIK, a contrib module must allow to be built *without* the core
source tree.

Any contrib module must be able to be built with only the header files
and the shared libraries when using PGXS. So, it could not assume
that it has the core source tree. (If we need to assume that, I think
xlogdump needs to be put into the core/bin directory.)

On the other hand, I have an issue to improve maintainancability of
the duplicated code at the xlogdump project.

Gather all the code which has been copied from the core.
https://github.com/snaga/xlogdump/issues/26

So, I agree with that we need another way to share the code
between the backend and the related utilities. Any good ideas?


I have one more concern for putting xlogdump into the core.

xlogdump is intended to deliver any new features and enhancements
to all the users who are using not only the latest major version,
but also older major versions maintained by the community, because
xlogdump must be a quite important tool when DBA needs it.

In fact, the latest xlogdump is now supporting 5 major versions,
from 8.3 to 9.2.
https://github.com/snaga/xlogdump/blob/master/README.xlogdump

But AFAIK, putting xlogdump into the core/contrib would mean that
a source tree of each major version could not have a large modification
after each release (or each code freeze, actually).

It would mean that the users using older major version could not take
advantage of new features and enhancements of the latest xlogdump,
but it's not what I wanted, actually.

Regards,



Greetings,

Andres







--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] New statistics for WAL buffer dirty writes

2012-07-07 Thread Satoshi Nagayasu

Hi all,

I've created new patch to get/reset statistics of WAL buffer
writes (flushes) caused by WAL buffer full.

This patch provides two new functions, pg_stat_get_xlog_dirty_write()
and pg_stat_reset_xlog_dirty_write(), which have been designed to
determine an appropriate value for WAL buffer size.

If this counter is increasing in the production environment,
it would mean that the WAL buffer size is too small to hold
xlog records generated the transactions. So, you can increase
your WAL buffer size to keep xlog records and to reduce WAL writes.

I think this patch would not affect to WAL write performance,
but still paying attention to it.

Any comments or suggestions?

Regards,

---
[snaga@devvm03 src]$ psql -p 15432 postgres
psql (9.3devel)
Type help for help.

postgres=# SELECT pg_stat_get_xlog_dirty_write();
 pg_stat_get_xlog_dirty_write
--
0
(1 row)

postgres=# \q
[snaga@devvm03 src]$ pgbench -p 15432 -s 10 -c 32 -t 1000 postgres
Scale option ignored, using pgbench_branches table count = 10
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 32
number of threads: 1
number of transactions per client: 1000
number of transactions actually processed: 32000/32000
tps = 141.937738 (including connections establishing)
tps = 142.123457 (excluding connections establishing)
[snaga@devvm03 src]$ psql -p 15432 postgres
psql (9.3devel)
Type help for help.

postgres=# SELECT pg_stat_get_xlog_dirty_write();
 pg_stat_get_xlog_dirty_write
--
0
(1 row)

postgres=# begin;
BEGIN
postgres=# DELETE FROM pgbench_accounts;
DELETE 100
postgres=# commit;
COMMIT
postgres=# SELECT pg_stat_get_xlog_dirty_write();
 pg_stat_get_xlog_dirty_write
--
19229
(1 row)

postgres=# SELECT pg_stat_reset_xlog_dirty_write();
 pg_stat_reset_xlog_dirty_write


(1 row)

postgres=# SELECT pg_stat_get_xlog_dirty_write();
 pg_stat_get_xlog_dirty_write
--
0
(1 row)

postgres=# \q
[snaga@devvm03 src]$
---


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 642c129..df1e6d4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -280,6 +280,11 @@ static XLogRecPtr RedoRecPtr;
  */
 static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr;
 
+/*
+ * Counter for WAL dirty buffer writes.
+ */
+static uint64  WalBufferWriteDirtyCount = 0;
+
 /*--
  * Shared-memory data structures for XLOG control
  *
@@ -1513,6 +1518,7 @@ AdvanceXLInsertBuffer(bool new_segment)
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false, false);
LWLockRelease(WALWriteLock);
+   WalBufferWriteDirtyCount++;
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
}
@@ -10492,3 +10498,15 @@ SetWalWriterSleeping(bool sleeping)
xlogctl-WalWriterSleeping = sleeping;
SpinLockRelease(xlogctl-info_lck);
 }
+
+uint64
+xlog_dirty_write_counter_get()
+{
+   return WalBufferWriteDirtyCount;
+}
+
+void
+xlog_dirty_write_counter_reset()
+{
+   WalBufferWriteDirtyCount = 0;
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c 
b/src/backend/utils/adt/pgstatfuncs.c
index 7c0705a..d544a5b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -117,6 +117,9 @@ extern Datum pg_stat_reset_shared(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_table_counters(PG_FUNCTION_ARGS);
 extern Datum pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS);
 
+extern Datum pg_stat_get_xlog_dirty_write(PG_FUNCTION_ARGS);
+extern Datum pg_stat_reset_xlog_dirty_write(PG_FUNCTION_ARGS);
+
 /* Global bgwriter statistics, from bgwriter.c */
 extern PgStat_MsgBgWriter bgwriterStats;
 
@@ -1700,3 +1703,16 @@ pg_stat_reset_single_function_counters(PG_FUNCTION_ARGS)
 
PG_RETURN_VOID();
 }
+
+Datum
+pg_stat_get_xlog_dirty_write(PG_FUNCTION_ARGS)
+{
+   PG_RETURN_INT64(xlog_dirty_write_counter_get());
+}
+
+Datum
+pg_stat_reset_xlog_dirty_write(PG_FUNCTION_ARGS)
+{
+   xlog_dirty_write_counter_reset();
+   PG_RETURN_VOID();
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ec79870..01343b9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -325,6 +325,9 @@ extern XLogRecPtr do_pg_start_backup(const char 
*backupidstr, bool fast, char **
 extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive

Re: [HACKERS] New statistics for WAL buffer dirty writes

2012-07-07 Thread Satoshi Nagayasu

2012/07/07 22:07, Euler Taveira wrote:
 On 07-07-2012 09:00, Satoshi Nagayasu wrote:
 I've created new patch to get/reset statistics of WAL buffer
 writes (flushes) caused by WAL buffer full.

 This new statistic doesn't solve your problem (tune wal_buffers). It doesn't
 give you the wal_buffers value. It only says hey, I needed more buffers so I
 write those dirty ones. It doesn't say how many. I would like to have
 something that says hey, you have 1000 buffers available and  you are using
 100 buffers (10%). This new statistic is only useful for decreasing the
 WALWriteLock contention.

I agree with that it would not tell the exact number for wal_buffers,
but it would help DBA understand what's actually happening around WAL
buffers.

Also, decreasing the WALWriteLock contention is obviously important
for DBA in terms of improving database performance.

Actually, that's the reason why I'm working on another statistics. :)
http://archives.postgresql.org/pgsql-hackers/2012-06/msg01489.php

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] New statistics for WAL buffer dirty writes

2012-07-07 Thread Satoshi Nagayasu

Hi,

Jeff Janes has pointed out that my previous patch could hold
a number of the dirty writes only in single local backend, and
it could not hold all over the cluster, because the counter
was allocated in the local process memory.

That's true, and I have fixed it with moving the counter into
the shared memory, as a member of XLogCtlWrite, to keep total
dirty writes in the cluster.

Regards,

2012/07/07 21:00, Satoshi Nagayasu wrote:
 Hi all,
 
 I've created new patch to get/reset statistics of WAL buffer
 writes (flushes) caused by WAL buffer full.
 
 This patch provides two new functions, pg_stat_get_xlog_dirty_write()
 and pg_stat_reset_xlog_dirty_write(), which have been designed to
 determine an appropriate value for WAL buffer size.
 
 If this counter is increasing in the production environment,
 it would mean that the WAL buffer size is too small to hold
 xlog records generated the transactions. So, you can increase
 your WAL buffer size to keep xlog records and to reduce WAL writes.
 
 I think this patch would not affect to WAL write performance,
 but still paying attention to it.
 
 Any comments or suggestions?
 
 Regards,
 
 ---
 [snaga@devvm03 src]$ psql -p 15432 postgres
 psql (9.3devel)
 Type help for help.
 
 postgres=# SELECT pg_stat_get_xlog_dirty_write();
   pg_stat_get_xlog_dirty_write
 --
  0
 (1 row)
 
 postgres=# \q
 [snaga@devvm03 src]$ pgbench -p 15432 -s 10 -c 32 -t 1000 postgres
 Scale option ignored, using pgbench_branches table count = 10
 starting vacuum...end.
 transaction type: TPC-B (sort of)
 scaling factor: 10
 query mode: simple
 number of clients: 32
 number of threads: 1
 number of transactions per client: 1000
 number of transactions actually processed: 32000/32000
 tps = 141.937738 (including connections establishing)
 tps = 142.123457 (excluding connections establishing)
 [snaga@devvm03 src]$ psql -p 15432 postgres
 psql (9.3devel)
 Type help for help.
 
 postgres=# SELECT pg_stat_get_xlog_dirty_write();
   pg_stat_get_xlog_dirty_write
 --
  0
 (1 row)
 
 postgres=# begin;
 BEGIN
 postgres=# DELETE FROM pgbench_accounts;
 DELETE 100
 postgres=# commit;
 COMMIT
 postgres=# SELECT pg_stat_get_xlog_dirty_write();
   pg_stat_get_xlog_dirty_write
 --
  19229
 (1 row)
 
 postgres=# SELECT pg_stat_reset_xlog_dirty_write();
   pg_stat_reset_xlog_dirty_write
 
 
 (1 row)
 
 postgres=# SELECT pg_stat_get_xlog_dirty_write();
   pg_stat_get_xlog_dirty_write
 --
  0
 (1 row)
 
 postgres=# \q
 [snaga@devvm03 src]$
 ---
 
 


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 642c129..893acf8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -370,6 +370,11 @@ typedef struct XLogCtlWrite
 {
int curridx;/* cache index of next 
block to write */
pg_time_t   lastSegSwitchTime;  /* time of last xlog 
segment switch */
+
+   /*
+* Counter for WAL dirty buffer writes.
+*/
+   uint64  WalBufferWriteDirtyCount;
 } XLogCtlWrite;
 
 /*
@@ -1504,6 +1509,8 @@ AdvanceXLInsertBuffer(bool new_segment)
}
else
{
+   XLogCtlWrite *Write = XLogCtl-Write;
+
/*
 * Have to write buffers while holding insert 
lock. This is
 * not good, so only write as much as we 
absolutely must.
@@ -1512,6 +1519,10 @@ AdvanceXLInsertBuffer(bool new_segment)
WriteRqst.Write = OldPageRqstPtr;
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false, false);
+   /*
+* XLogCtrlWrite must be protected with 
WALWriteLock.
+*/
+   Write-WalBufferWriteDirtyCount++;
LWLockRelease(WALWriteLock);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
@@ -10492,3 +10503,26 @@ SetWalWriterSleeping(bool sleeping)
xlogctl-WalWriterSleeping = sleeping;
SpinLockRelease(xlogctl-info_lck);
 }
+
+uint64
+xlog_dirty_write_counter_get()
+{
+   XLogCtlWrite *Write = XLogCtl-Write;
+   uint64 count;
+
+   LWLockAcquire(WALWriteLock, LW_SHARED);
+   count = Write-WalBufferWriteDirtyCount;
+   LWLockRelease

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics

2012-06-26 Thread Satoshi Nagayasu

Hi all,

I've modified the pg_stat_lwlocks patch to be able to work with
the latest PostgreSQL Git code.

This patch provides:
  pg_stat_lwlocks   New system view to show lwlock statistics.
  pg_stat_get_lwlocks() New function to retrieve lwlock statistics.
  pg_stat_reset_lwlocks()   New function to reset lwlock statistics.

Please try it out.

Regards,

2012/06/26 5:29, Satoshi Nagayasu wrote:
 Hi all,
 
 I've been working on a new system view, pg_stat_lwlocks, to observe
 LWLock, and just completed my 'proof-of-concept' code that can work
 with version 9.1.
 
 Now, I'd like to know the possibility of this feature for future
 release.
 
 With this patch, DBA can easily determine a bottleneck around lwlocks.
 --
 postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
   lwlockid | calls  | waits | time_ms
 --++---+-
 49 | 193326 | 32096 |   23688
  8 |   3305 |   133 |1335
  2 | 21 | 0 |   0
  4 | 135188 | 0 |   0
  5 |  57935 | 0 |   0
  6 |141 | 0 |   0
  7 |  24580 | 1 |   0
  3 |   3282 | 0 |   0
  1 | 41 | 0 |   0
  9 |  3 | 0 |   0
 (10 rows)
 
 postgres=#
 --
 
 In this view,
'lwlockid' column represents LWLockId used in the backends.
'calls' represents how many times LWLockAcquire() was called.
'waits' represents how many times LWLockAcquire() needed to wait
within it before lock acquisition.
'time_ms' represents how long LWLockAcquire() totally waited on
a lwlock.
 
 And lwlocks that use a LWLockId range, such as BufMappingLock or
 LockMgrLock, would be grouped and summed up in a single record.
 For example, lwlockid 49 in the above view represents LockMgrLock
 statistics.
 
 Now, I know there are some considerations.
 
 (1) Performance
 
I've measured LWLock performance both with and without the patch,
and confirmed that this patch does not affect the LWLock perfomance
at all.
 
pgbench scores with the patch:
  tps = 900.906658 (excluding connections establishing)
  tps = 908.528422 (excluding connections establishing)
  tps = 903.900977 (excluding connections establishing)
  tps = 910.470595 (excluding connections establishing)
  tps = 909.685396 (excluding connections establishing)
 
pgbench scores without the patch:
  tps = 909.096785 (excluding connections establishing)
  tps = 894.868712 (excluding connections establishing)
  tps = 910.074669 (excluding connections establishing)
  tps = 904.022770 (excluding connections establishing)
  tps = 895.673830 (excluding connections establishing)
 
Of course, this experiment was not I/O bound, and the cache hit ratio
was99.9%.
 
 (2) Memory space
 
In this patch, I added three new members to LWLock structure
as uint64 to collect statistics.
 
It means that those members must be held in the shared memory,
but I'm not sure whether it's appropriate.
 
I think another possible option is holding those statistics
values in local (backend) process memory, and send them through
the stat collector process (like other statistics values).
 
 (3) LWLock names (or labels)
 
Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
not easy for DBA to determine actual lock type.
 
So, I want to show LWLock names (or labels), like 'WALWriteLock'
or 'LockMgrLock', but how should I implement it?
 
 Any comments?
 
 Regards,


-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 7cc1d41..f832b45 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_lwlocks AS
+SELECT
+S.lwlockid,
+S.calls,
+S.waits,
+S.time_ms
+FROM pg_stat_get_lwlocks() AS S;
+
 CREATE VIEW pg_user_mappings AS
 SELECT
 U.oid   AS umid,
diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c
index 95d4b37..2a2c197 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -32,6 +32,7 @@
 #include storage/proc.h
 #include storage/spin.h
 
+#include sys/time.h
 
 /* We use the ShmemLock spinlock to protect LWLockAssign */
 extern slock_t *ShmemLock;
@@ -46,6 +47,11 @@ typedef struct LWLock
PGPROC *head;   /* head of list of waiting 
PGPROCs */
PGPROC *tail;   /* tail of list of waiting 
PGPROCs

[HACKERS] pg_stat_lwlocks view - lwlocks statistics

2012-06-25 Thread Satoshi Nagayasu

Hi all,

I've been working on a new system view, pg_stat_lwlocks, to observe
LWLock, and just completed my 'proof-of-concept' code that can work
with version 9.1.

Now, I'd like to know the possibility of this feature for future
release.

With this patch, DBA can easily determine a bottleneck around lwlocks.
--
postgres=# SELECT * FROM pg_stat_lwlocks ORDER BY time_ms DESC LIMIT 10;
 lwlockid | calls  | waits | time_ms
--++---+-
   49 | 193326 | 32096 |   23688
8 |   3305 |   133 |1335
2 | 21 | 0 |   0
4 | 135188 | 0 |   0
5 |  57935 | 0 |   0
6 |141 | 0 |   0
7 |  24580 | 1 |   0
3 |   3282 | 0 |   0
1 | 41 | 0 |   0
9 |  3 | 0 |   0
(10 rows)

postgres=#
--

In this view,
  'lwlockid' column represents LWLockId used in the backends.
  'calls' represents how many times LWLockAcquire() was called.
  'waits' represents how many times LWLockAcquire() needed to wait
  within it before lock acquisition.
  'time_ms' represents how long LWLockAcquire() totally waited on
  a lwlock.

And lwlocks that use a LWLockId range, such as BufMappingLock or
LockMgrLock, would be grouped and summed up in a single record.
For example, lwlockid 49 in the above view represents LockMgrLock
statistics.

Now, I know there are some considerations.

(1) Performance

  I've measured LWLock performance both with and without the patch,
  and confirmed that this patch does not affect the LWLock perfomance
  at all.

  pgbench scores with the patch:
tps = 900.906658 (excluding connections establishing)
tps = 908.528422 (excluding connections establishing)
tps = 903.900977 (excluding connections establishing)
tps = 910.470595 (excluding connections establishing)
tps = 909.685396 (excluding connections establishing)

  pgbench scores without the patch:
tps = 909.096785 (excluding connections establishing)
tps = 894.868712 (excluding connections establishing)
tps = 910.074669 (excluding connections establishing)
tps = 904.022770 (excluding connections establishing)
tps = 895.673830 (excluding connections establishing)

  Of course, this experiment was not I/O bound, and the cache hit ratio
  was 99.9%.

(2) Memory space

  In this patch, I added three new members to LWLock structure
  as uint64 to collect statistics.

  It means that those members must be held in the shared memory,
  but I'm not sure whether it's appropriate.

  I think another possible option is holding those statistics
  values in local (backend) process memory, and send them through
  the stat collector process (like other statistics values).

(3) LWLock names (or labels)

  Now, pg_stat_lwlocks view shows LWLockId itself. But LWLockId is
  not easy for DBA to determine actual lock type.

  So, I want to show LWLock names (or labels), like 'WALWriteLock'
  or 'LockMgrLock', but how should I implement it?

Any comments?

Regards,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp
diff -rc postgresql-9.1.2.orig/src/backend/catalog/postgres.bki 
postgresql-9.1.2/src/backend/catalog/postgres.bki
*** postgresql-9.1.2.orig/src/backend/catalog/postgres.bki  2012-06-20 
03:32:46.0 +0900
--- postgresql-9.1.2/src/backend/catalog/postgres.bki   2012-06-26 
01:51:52.0 +0900
***
*** 1553,1558 
--- 1553,1559 
  insert OID = 3071 ( pg_xlog_replay_pause 11 10 12 1 0 0 f f f t f v 0 0 2278 
 _null_ _null_ _null_ _null_ pg_xlog_replay_pause _null_ _null_ _null_ )
  insert OID = 3072 ( pg_xlog_replay_resume 11 10 12 1 0 0 f f f t f v 0 0 2278 
 _null_ _null_ _null_ _null_ pg_xlog_replay_resume _null_ _null_ _null_ )
  insert OID = 3073 ( pg_is_xlog_replay_paused 11 10 12 1 0 0 f f f t f v 0 0 
16  _null_ _null_ _null_ _null_ pg_is_xlog_replay_paused _null_ _null_ _null_ 
)
+ insert OID = 3764 ( pg_stat_get_lwlocks 11 10 12 1 100 0 f f f f t s 0 0 2249 
 {20,20,20,20} {o,o,o,o} {lwlockid,calls,waits,time_ms} _null_ 
pg_stat_get_lwlocks _null_ _null_ _null_ )
  insert OID = 2621 ( pg_reload_conf 11 10 12 1 0 0 f f f t f v 0 0 16  
_null_ _null_ _null_ _null_ pg_reload_conf _null_ _null_ _null_ )
  insert OID = 2622 ( pg_rotate_logfile 11 10 12 1 0 0 f f f t f v 0 0 16  
_null_ _null_ _null_ _null_ pg_rotate_logfile _null_ _null_ _null_ )
  insert OID = 2623 ( pg_stat_file 11 10 12 1 0 0 f f f t f v 1 0 2249 25 
{25,20,1184,1184,1184,1184,16} {i,o,o,o,o,o,o} 
{filename,size,access,modification,change,creation,isdir} _null_ pg_stat_file 
_null_ _null_ _null_ )
diff -rc postgresql-9.1.2.orig/src/backend/catalog/postgres.description 
postgresql-9.1.2/src/backend/catalog/postgres.description
*** postgresql-9.1.2.orig/src/backend/catalog/postgres.description  
2012-06-20 03:32:46.0 +0900

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics

2012-06-25 Thread Satoshi Nagayasu


2012/06/26 6:44, Josh Berkus wrote:

On 6/25/12 1:29 PM, Satoshi Nagayasu wrote:

(1) Performance

   I've measured LWLock performance both with and without the patch,
   and confirmed that this patch does not affect the LWLock perfomance
   at all.


This would be my main concern with this patch; it's hard for me to
imagine that it has no performance impact *at all*, since trace_lwlocks
has quite a noticable one in my experience.  However, the answer to that
is to submit the patch and let people test.


Thanks. I will submit the patch to the CommitFest page with some fixes
to be able to work with the latest PostgreSQL on Git.


I will remark that it would be far more useful to me if we could also
track lwlocks per session.  Overall counts are somewhat useful, but more
granular counts are even more useful.  What period of time does the
table cover?  Since last reset?


Yes. it has not yet been implemented yet since this code is just a PoC
one, but it is another design issue which needs to be discussed.

To implement it, a new array can be added in the local process memory
to hold lwlock statistics, and update counters both in the shared
memory and the local process memory at once. Then, the session can
retrieve 'per-session' statistics from the local process memory
via some dedicated function.

Does it make sense? Any comments?

Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics

2012-06-25 Thread Satoshi Nagayasu


2012/06/26 7:04, Kevin Grittner wrote:

Josh Berkusj...@agliodbs.com  wrote:

On 6/25/12 1:29 PM, Satoshi Nagayasu wrote:

(1) Performance

   I've measured LWLock performance both with and without the
   patch, and confirmed that this patch does not affect the LWLock
   perfomance at all.


This would be my main concern with this patch; it's hard for me to
imagine that it has no performance impact *at all*, since
trace_lwlocks has quite a noticable one in my experience.
However, the answer to that is to submit the patch and let people
test.


I think overhead is going to depend quite a bit on the
gettimeofday() implementation, since that is called twice per lock
wait.


Yes.
It's one of my concerns, and what I actually want hackers to test.


It looked to me like there was nothing to prevent concurrent updates
of the counts while gathering the accumulated values for display.
Won't this be a problem on 32-bit builds?


Actually, I'd like to know how I can improve my code in a 32bit box.

However, unfortunately I don't have any 32bit (physical) box now,
so I want someone to test it if it needs to be tested.


Please add this to the Open COmmitFest for a proper review:

https://commitfest.postgresql.org/action/commitfest_view/open


Will submit soon. Thanks.



-Kevin



Regards,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] log messages for archive recovery progress

2012-01-12 Thread Satoshi Nagayasu


2012/01/13 0:13, Robert Haas wrote:

On Thu, Jan 12, 2012 at 10:04 AM, Simon Riggssi...@2ndquadrant.com  wrote:

On Thu, Jan 12, 2012 at 2:13 PM, Robert Haasrobertmh...@gmail.com  wrote:

On Wed, Jan 11, 2012 at 9:01 AM, Simon Riggssi...@2ndquadrant.com  wrote:

On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasusn...@uptime.jp  wrote:


However, I'm a bit afraid that it will confuse DBA if we use
restored under the pg_xlog replay context, because we have
already used restored that means a WAL file as successfully
copied (not replayed) from archive directory into pg_xlog
directory under the archive recovery context.

So, to determine the status of copying WAL files from
archive directory, I think we can use restored, or
could not restore on failure.

And to determine the status of replaying WAL files
in pg_xlog directory (even if a WAL is copied from archive),
we have to use recover or replay.


Agreed. I can change restored to using, so we have two message types

LOG:  restored log file 00080047 from archive
LOG:  using pre-existing log file 00080047 from pg_xlog


using seems pretty fuzzy to me.  replaying?


That was my first thought, but the message relates to which file has
been selected, and how. Once it has been selected it will be replayed.
The idea is to have the two messages look similar.

The original message was restored log file... and says nothing about
replaying.

We could change the old message (ugh! backwards compatibility alert)

  LOG:  replaying log file 00080047 after restore from archive
  LOG:  replaying log file 00080047 already in pg_xlog

which doesn't sound much stronger to me... not sure.


Hmm, I don't know.  But that phrasing does at least have the advantage
of being clearly parallel, which I like.


It seems difficult to keep backward compatibility. :)

Anyway, how about this one?

If we have 47 in archive, and 48 in pg_xlog,

(1) LOG: restored log file 00080047 from archive
(2) LOG: replaying log file 00080047
(3) LOG: could not restore file 00080048 from archive
(4) LOG: replaying log file 00080048

In this case, (4) replying after (3) could not restore from archive
would means that it has started replaying a WAL from pg_xlog.

And if we have 47 in archive, and do not have 48 in pg_xlog,

(5) LOG: restored log file 00080047 from archive
(6) LOG: replaying log file 00080047
(7) LOG: could not restore file 00080048 from archive
(8) LOG: could not replay file 00080048

In this case, (8) could not replay file after (7) could not restore
from archive would means that 48 is not found both archive and
pg_xlog, so that the latest WAL would gone.

I just got another option in my mind.

Telling both two numbers of WAL files, from archive and
pg_xlog directory, those have been applied during archive recovery
would make sense?

How about this one?

LOG: XXX file(s) from archive, YYY file(s) from pg_xlog successfully 
applied.


Thanks,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] log messages for archive recovery progress

2012-01-11 Thread Satoshi Nagayasu


2012/01/11 19:56, Simon Riggs wrote:

2012/1/11 Euler Taveira de Oliveiraeu...@timbira.com:

On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote:

[2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
00080046 from archive
[2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080046
[2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
00080047 from archive
[2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080047
cp: cannot stat `/backups/archlog/00080048': No such file or 
directory
[2011-12-08 15:14:37 JST] 16758: LOG:  could not restore file 
00080048 from archive
[2011-12-08 15:14:37 JST] 16758: LOG:  attempting to look into pg_xlog
[2011-12-08 15:14:37 JST] 16758: LOG:  recoverying 00080048



What about just 'restored log file 00080048 from pg_xlog'
instead of the last two messages? If you can't read from pg_xlog emit 'could
not restore file 00080048 from pg_xlog'.


Yes, simple is better.

We already have a message if the file is not present, we just need one
if we switch to using pg_xlog.

Please look at this.


Thanks for a patch. I agree that simple is better.

However, I'm a bit afraid that it will confuse DBA if we use
restored under the pg_xlog replay context, because we have
already used restored that means a WAL file as successfully
copied (not replayed) from archive directory into pg_xlog
directory under the archive recovery context.

So, to determine the status of copying WAL files from
archive directory, I think we can use restored, or
could not restore on failure.

And to determine the status of replaying WAL files
in pg_xlog directory (even if a WAL is copied from archive),
we have to use recover or replay.

I'll try to revise my proposed log messages again.

Thanks,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] log messages for archive recovery progress

2012-01-08 Thread Satoshi Nagayasu / Uptime Technologies, LLC.

Hi,

When I look into archive recovery deeply as DBA point of view, I found
that it's difficult to know (1) when the recovery process switched
reading WAL segments files from archive directory to pg_xlog directory,
and (2) whether it succeeded applying the latest WAL segments in the
pg_xlog directory.

For example, when I had 47 WAL segment in the archive directory and
48 WAL segment in the pg_xlog directory, PostgreSQL said:

 [2011-12-08 05:59:03 JST] 9003: LOG:  restored log file 
 00080046 from archive
 [2011-12-08 05:59:03 JST] 9003: LOG:  restored log file 
 00080047 from archive
 cp: cannot stat `/backups/archlog/00080048': No such file or 
 directory
 [2011-12-08 05:59:03 JST] 9003: LOG:  record with zero length at 0/489F8B74
 [2011-12-08 05:59:03 JST] 9003: LOG:  redo done at 0/489F8B38
 [2011-12-08 05:59:03 JST] 9003: LOG:  last completed transaction was at log 
 time 2011-12-08 05:52:01.507063+09
 cp: cannot stat `/backups/archlog/0009.history': No such file or directory
 [2011-12-08 05:59:03 JST] 9003: LOG:  selected new timeline ID: 9
 [2011-12-08 05:59:03 JST] 9003: LOG:  restored log file 0008.history 
 from archive
 [2011-12-08 05:59:04 JST] 9003: LOG:  archive recovery complete

I felt it's difficult to determine whether PostgreSQL applied 48 WAL
segment in the pg_xlog, or not.

So, I want to propose new log messages to show archive log progress as
reading WAL segments.

With applying my tiny modification, the recovery process reports its
progress for each WAL segment file, and also tells switching reading
archive directory to pg_xlog directory explicitly as shown below.

 [2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
 00080046 from archive
 [2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080046
 [2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
 00080047 from archive
 [2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080047
 cp: cannot stat `/backups/archlog/00080048': No such file or 
 directory
 [2011-12-08 15:14:37 JST] 16758: LOG:  could not restore file 
 00080048 from archive
 [2011-12-08 15:14:37 JST] 16758: LOG:  attempting to look into pg_xlog
 [2011-12-08 15:14:37 JST] 16758: LOG:  recoverying 00080048
 [2011-12-08 15:14:37 JST] 16758: LOG:  record with zero length at 0/489F8B74
 [2011-12-08 15:14:37 JST] 16758: LOG:  redo done at 0/489F8B38
 [2011-12-08 15:14:37 JST] 16758: LOG:  last completed transaction was at log 
 time 2011-12-08 05:52:01.507063+09
 cp: cannot stat `/backups/archlog/0009.history': No such file or directory
 [2011-12-08 15:14:37 JST] 16758: LOG:  selected new timeline ID: 9
 [2011-12-08 15:14:37 JST] 16758: LOG:  restored log file 0008.history 
 from archive
 [2011-12-08 15:14:38 JST] 16758: LOG:  archive recovery complete

Do you think this is useful?

Thanks,
-- 
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] LWLOCK_STATS

2012-01-07 Thread Satoshi Nagayasu / Uptime Technologies, LLC.


2012/01/07 16:58, Heikki Linnakangas wrote:

On 07.01.2012 00:24, Robert Haas wrote:

It's been a while since I did any testing with LWLOCK_STATS defined,
so I thought it might be about time to do that again and see how
things look. Here's how they looked back in July:

http://archives.postgresql.org/pgsql-hackers/2011-07/msg01373.php


Interesting.

A couple of weeks ago I wrote a little patch that's similar to LWLOCK_STATS, 
but it prints out % of wallclock time that is spent acquiring, releasing, or 
waiting for a lock. I find that more useful than the counters.

Here's the patch, I hope it's useful to others. It uses timer_create() and 
timer_settime(), so it probably won't work on all platforms, and requires 
linking with -lrt.


I have just written up a systemtap script to observe
lock statistics.

It shows acquire counts, wait counts and
total waiting time for each lwlock every 5 seconds.

Screenshot here:
http://twitpic.com/83p2cz

Is this useful for pg developers?

--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] xlogdump 0.4.0

2011-09-03 Thread Satoshi Nagayasu

Hi hackers,

I'm pleased to announce the latest release of xlogdump.
xlogdump is a tool for extracting data from WAL segment files.

Here is xlogdump README:
https://github.com/snaga/xlogdump/blob/master/xlogdump/README.xlogdump

xlogdump was originally developed by Tom Lane and Diogo Biazus around
a five years ago, but not be maintained these few years.

So, I picked it up and have fixed some problems and enhanced.
It has still several issues, but it seems to start working,
so it's time to try it again, folks!

https://github.com/snaga/xlogdump/downloads

xlogdump is now able to be compiled against 8.2~9.0 (it was tough fixes!),
so it could be a fun (of course, also useful) stuff for many hackers and
users.

Please visit the new project repository, and try it.

https://github.com/snaga/xlogdump

If you have any bug reports and/or comments, please feel free to email me.

Thanks in advance,

p.s.
I really appreciate Tom and Diogo for their previous great work.
-- 
NAGAYASU Satoshi satoshi.nagay...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] top for postgresql (ptop?)

2007-09-26 Thread Satoshi Nagayasu


Mark,

Very interesting. I'm looking for such tool.

Unfortunately, I can't compile it on my Solaris right now,
but I hope it will be shipped with PostgreSQL distribution.

Mark Wong wrote:

Hi everyone,

I was playing with converting unixtop (the version of top used in
FreeBSD) to only show PostgreSQL processes pulled from the
pg_stat_activity table.  I have a version that kind of works here:

http://pgfoundry.org/frs/download.php/1468/ptop-3.6.1-pre6.tar.gz

I've tried it on FreeBSD and Linux, not sure about other platforms
though.  So it looks a lot like top and can currently do a few simple
things like display the current_query from pg_stat_activity for a
given process, show the locks held by a process and on which tables,
and show the query plan for the current query.  It is a ways from
polished (not really documented, etc.) but I wanted to see what people
thought of a text/curses sort of monitoring tool like this.  Maybe
something to distribute with PostgreSQL? :)

Forgive me if I didn't try out pgtop (the CPAN module.)

Regards,
Mark

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster



--
NAGAYASU Satoshi [EMAIL PROTECTED]
Phone: +81-50-5546-2496


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings

1 2 >

1 - 100 of 177 matches

Mail list logo