from:"\"KONDO Mitsumasa\""

Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread KONDO Mitsumasa

And I find new useful point of this feature. Under following results are
'--gaussian=20' case and '--gaussian=2' case, and postgresql setting is same.

> [mitsu-ko@pg-rex31 pgbench]$ ./pgbench -c8 -j4 --gaussian=20 -T30 -P 5
> starting vacuum...end.
> progress: 5.0 s, 4285.8 tps, lat 1.860 ms stddev 0.425
> progress: 10.0 s, 4249.2 tps, lat 1.879 ms stddev 0.372
> progress: 15.0 s, 4230.3 tps, lat 1.888 ms stddev 0.430
> progress: 20.0 s, 4247.3 tps, lat 1.880 ms stddev 0.400
> LOG:  checkpoints are occurring too frequently (12 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 25.0 s, 4269.0 tps, lat 1.870 ms stddev 0.427
> progress: 30.0 s, 4318.1 tps, lat 1.849 ms stddev 0.415
> transaction type: Gaussian distribution TPC-B (sort of)
> scaling factor: 10
> standard deviation threshold: 20.0
> access probability of top 20%, 10% and 5% records: 0.4 0.95450 0.68269
> query mode: simple
> number of clients: 8
> number of threads: 4
> duration: 30 s
> number of transactions actually processed: 128008
> latency average: 1.871 ms
> latency stddev: 0.412 ms
> tps = 4266.266374 (including connections establishing)
> tps = 4267.312022 (excluding connections establishing)


> [mitsu-ko@pg-rex31 pgbench]$ ./pgbench -c8 -j4 --gaussian=2 -T30 -P 5
> starting vacuum...end.
> LOG:  checkpoints are occurring too frequently (13 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (1 second apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 5.0 s, 3927.9 tps, lat 2.030 ms stddev 0.691
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (1 second apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 10.0 s, 4045.8 tps, lat 1.974 ms stddev 0.835
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (1 second apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 15.0 s, 4042.5 tps, lat 1.976 ms stddev 0.613
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 20.0 s, 4103.9 tps, lat 1.946 ms stddev 0.540
> LOG:  checkpoints are occurring too frequently (1 second apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 25.0 s, 4003.1 tps, lat 1.995 ms stddev 0.526
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (1 second apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> LOG:  checkpoints are occurring too frequently (2 seconds apart)
> HINT:  Consider increasing the configuration parameter "checkpoint_segments".
> progress: 30.0 s, 4025.5 tps, lat 1.984 ms stddev 0.568
> transaction type: Gaussian distribution TPC-B (sort of)
> scaling factor: 10
> standard deviation threshold: 2.0
> access probability of top 20%, 10% and 5% records: 0.32566 0.16608 0.08345
> query mode: simple
> number of clients: 8
> number of threads: 4
> duration: 30 s
> number of transactions actually processed: 120752
> latency average: 1.984 ms
> latency stddev: 0.638 ms
> tps = 4024.823433 (including connections establishing)
> tps = 4025.87 (excluding connections establishing)

In '--gaussian=2' benchmark, checkpoint is frequently happen than 
'--gaussian=20'
benchmark. Because former update large range of records
so that fullpage write

Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread KONDO Mitsumasa


(2014/03/17 23:29), Robert Haas wrote:

On Sat, Mar 15, 2014 at 4:50 AM, Mitsumasa KONDO
 wrote:

There are explanations and computations as comments in the code. If it is
about the documentation, I'm not sure that a very precise mathematical
definition will help a lot of people, and might rather hinder understanding,
so the doc focuses on an intuitive explanation instead.


Yeah, I think that we had better to only explain necessary infomation for
using this feature. If we add mathematical theory in docs, it will be too
difficult for user.  And it's waste.


Well, if you *don't* include at least *some* mathematical description
of what the feature does in the documentation, then users who need to
understand it will have to read the source code to figure it out,
which is going to be even more difficult.
I had fixed this problem. Please see the v12 patch. I think it doesn't includ 
mathematical
description, but user will be able to understand intuitive from the explanation 
of document.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-18 Thread KONDO Mitsumasa

(2014/03/17 22:37), Tom Lane wrote:
> KONDO Mitsumasa  writes:
>> (2014/03/17 18:02), Heikki Linnakangas wrote:
>>> On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:
>>> There is an infinite number of variants of the TPC-B test that we could 
>>> include
>>> in pgbench. If we start adding every one of them, we're quickly going to 
>>> have
>>> hundreds of options to choose the workload. I'd like to keep pgbench simple.
>>> These two new test variants, gaussian and exponential, are not that special 
>>> that
>>> they'd deserve to be included in the program itself.
> 
>> Well, I add only two options, and they are major distribution that are seen 
>> in
>> real database system than uniform distiribution. I'm afraid, I think you are 
>> too
>> worried and it will not be added hundreds of options. And pgbench is still 
>> simple.
> 
> FWIW, I concur with Heikki on this.  Adding new versions of \setrandom is
> useful functionality.  Embedding them in the "standard" test is not,
> because that just makes it (even) less standard.  And pgbench has too darn
> many switches already.
Hmm, I cooled down and see the pgbench option. I can understand his arguments,
there are many sitches already and it will become more largear options unless we
stop adding new option. However, I think that the man who added the option in
the past thought the option will be useful for PostgreSQL performance
improvement. But now, they are disturb the new option such like my feature which
can create more real system benchmark distribution. I think it is very
unfortunate and also tending to stop progress of improvement of PostgreSQL
performance, not only pgbench. And if we remove command line option, I think new
feature will tend to reject. It is not also good.

By the way, if we remove command line option, it is difficult to understand
distirbution of gaussian, because threshold parameter is very sensitive and it 
is
also very useful feature. It is difficult and taking labor that analyzing and
visualization pgbench_history using SQL.

What do you think about this problem? This is not disscussed yet.

> [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=2
> ~
> access probability of top 20%, 10% and 5% records: 0.32566 0.16608 0.08345
> ~
> [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=4
> ~
> access probability of top 20%, 10% and 5% records: 0.57633 0.31086 0.15853
> ~
> [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=10
> ~
> access probability of top 20%, 10% and 5% records: 0.95450 0.68269 0.38292
> ~

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread KONDO Mitsumasa


(2014/03/17 18:02), Heikki Linnakangas wrote:

On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote:

By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM
command options. Can you tell me the objective reason? I think pgbench is the
benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark.
It is written in documents, and default benchmark wasn't changed by my patch.
So we need not remove command options, and they are one of the variety of
benchmark options. Maybe you have something misunderstanding about my patch...


There is an infinite number of variants of the TPC-B test that we could include
in pgbench. If we start adding every one of them, we're quickly going to have
hundreds of options to choose the workload. I'd like to keep pgbench simple.
These two new test variants, gaussian and exponential, are not that special that
they'd deserve to be included in the program itself.
Well, I add only two options, and they are major distribution that are seen in 
real database system than uniform distiribution. I'm afraid, I think you are too 
worried and it will not be added hundreds of options. And pgbench is still simple.



pgbench already has a mechanism for running custom scripts, in which you can
specify whatever workload you want. Let's use that. If it's missing something 
you
need to specify the workload you want, let's enhance the script language.
I have not seen user who is using pgbench custom script very much. And gaussian 
and exponential distribution are much better to measure the real system 
perfomance, so I'd like to use it command option. In now pgbench, we can only 
measure about database size, but it isn't realistic situation. We want to forcast 
the required system from calculating the size of hot spot or distirbution of 
access pettern.


I'd realy like to include it on my heart:)  Please...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread KONDO Mitsumasa

(2014/03/17 17:46), Heikki Linnakangas wrote:

On 03/15/2014 08:53 AM, Fabien COELHO wrote:

>* Does min and max really make sense for gaussian and exponential
>distributions? For gaussian, I would expect mean and standard deviation as
>the parameters, not min/max/threshold.

Yes... and no:-) The aim is to draw an integer primary key from a table,
so it must be in a specified range.

Well, I don't agree with that aim. It's useful for choosing a primary key, as in
the pgbench TPC-B workload, but a gaussian distributed random number could be
used for many other things too. For example:

\setrandom foo ... gaussian

select * from cheese where weight > :foo

And :foo should be a float, not an integer. That's what I was trying to say
earlier, when I said that the variable should be a float. If you need an 
integer,
just cast or round it in the query.

I realize that the current \setrandom sets the variable to an integer, so
gaussian/exponential would be different. But so what? An option to generate
uniformly distributed floats would be handy too, though.
Well, it seems new feature. If you want to realise it as double, add 
'\setrandomd' as a double random generator in pgbebch. I will agree with that.

This is approximated by drawing a
double value with the expected distribution (gaussian or exponential) and
project it carefully onto integers. If it is out of range, there is a loop
and another value is drawn. The minimal threshold constraint (2.0) ensures
that the probability of looping is low.

Well, that's one way to do constraint it to the given range, but there are many
other ways to do it. Like, clamp it to the min/max if it's out of range.

It's too heavy method.. Client calculation must be light.

I don't
think we need to choose any particular method, you can handle that in the test
script.

I think our implementation is the best way to realize it.
It is fast and robustness for the probability of looping is low.

If you have better idea, please teach us.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-17 Thread KONDO Mitsumasa


Hi Heikki-san,

(2014/03/17 14:39), KONDO Mitsumasa wrote:

(2014/03/15 15:53), Fabien COELHO wrote:


Hello Heikki,


A couple of comments:

* There should be an explicit "\setrandom ... uniform" option too, even though
you get that implicitly if you don't specify the distribution

Fix. We can use "\setrandom val min max uniform" without error messages.


* What exactly does the "threshold" mean? The docs informally explain that "the
larger the thresold, the more frequent values close to the middle of the
interval are drawn", but that's pretty vague.


There are explanations and computations as comments in the code. If it is about
the documentation, I'm not sure that a very precise mathematical definition will
help a lot of people, and might rather hinder understanding, so the doc focuses
on an intuitive explanation instead.

Add more detail information in the document. Is it OK? Please confirm it.


* Does min and max really make sense for gaussian and exponential
distributions? For gaussian, I would expect mean and standard deviation as the
parameters, not min/max/threshold.


Yes... and no:-) The aim is to draw an integer primary key from a table, so it
must be in a specified range. This is approximated by drawing a double value 
with
the expected distribution (gaussian or exponential) and project it carefully 
onto
integers. If it is out of range, there is a loop and another value is drawn. The
minimal threshold constraint (2.0) ensures that the probability of looping is 
low.

It make sense. Please see the attached picutre in last day.


* How about setting the variable as a float instead of integer? Would seem more
natural to me. At least as an option.


Which variable? The values set by setrandom are mostly used for primary keys. We
really want integers in a range.

Oh, I see. He said about documents.

The document was mistaken.
Threshold parameter must be double and fix the document.

By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM 
command options. Can you tell me the objective reason? I think pgbench is the

benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark.
It is written in documents, and default benchmark wasn't changed by my patch.
So we need not remove command options, and they are one of the variety of
benchmark options. Maybe you have something misunderstanding about my patch...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = "";
  char	   *pgport = "";
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	"\\set nbranches " CppAsS

Re: [HACKERS] gaussian distribution pgbench

2014-03-16 Thread KONDO Mitsumasa


(2014/03/15 15:53), Fabien COELHO wrote:


Hello Heikki,


A couple of comments:

* There should be an explicit "\setrandom ... uniform" option too, even though
you get that implicitly if you don't specify the distribution


Indeed. I agree. I suggested it, but it got lost.


* What exactly does the "threshold" mean? The docs informally explain that "the
larger the thresold, the more frequent values close to the middle of the
interval are drawn", but that's pretty vague.


There are explanations and computations as comments in the code. If it is about
the documentation, I'm not sure that a very precise mathematical definition will
help a lot of people, and might rather hinder understanding, so the doc focuses
on an intuitive explanation instead.


* Does min and max really make sense for gaussian and exponential
distributions? For gaussian, I would expect mean and standard deviation as the
parameters, not min/max/threshold.


Yes... and no:-) The aim is to draw an integer primary key from a table, so it
must be in a specified range. This is approximated by drawing a double value 
with
the expected distribution (gaussian or exponential) and project it carefully 
onto
integers. If it is out of range, there is a loop and another value is drawn. The
minimal threshold constraint (2.0) ensures that the probability of looping is 
low.


* How about setting the variable as a float instead of integer? Would seem more
natural to me. At least as an option.


Which variable? The values set by setrandom are mostly used for primary keys. We
really want integers in a range.

Oh, I see. He said about documents.

+   Moreover, set gaussian or exponential with threshold interger value,
+   we can get gaussian or exponential random in integer value between
+   min and max bounds inclusive.

Collectry,
+   Moreover, set gaussian or exponential with threshold double value,
+   we can get gaussian or exponential random in integer value between
+   min and max bounds inclusive.


And I am going to fix the document more easily understanding for user.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread KONDO Mitsumasa


(2014/03/13 23:00), Fujii Masao wrote:

On Thu, Mar 13, 2014 at 10:51 PM, Heikki Linnakangas
 wrote:

On 03/13/2014 03:17 PM, Fujii Masao wrote:


On Tue, Mar 11, 2014 at 1:49 PM, KONDO Mitsumasa
 wrote:


(2014/03/09 1:49), Fabien COELHO wrote:



I'm okay with this UI and itsaccess probability of top implementation.



OK.



We should do the same discussion for the UI of command-line option?
The patch adds two options --gaussian and --exponential, but this UI
seems to be a bit inconsistent with the UI for \setrandom. Instead,
we can use something like --distribution=[uniform | gaussian |
exponential].



IMHO we should just implement the \setrandom changes, and not add any of
these options to modify the standard test workload. If someone wants to run
TPC-B workload with gaussian or exponential distribution, they can implement
it as a custom script. The docs include the script for the standard TPC-B
workload; just copy-paster that and modify the \setrandom lines.
Well, when we set '--gaussian=NUM' or '--exponential=NUM' on command line, we can 
see access probability of top N records in result of final output. This out put 
is under following,



[mitsu-ko@localhost pgbench]$ ./pgbench --exponential=10 postgres
starting vacuum...end.
transaction type: Exponential distribution TPC-B (sort of)
scaling factor: 1
exponential threshold: 10.0
access probability of top 20%, 10% and 5% records: 0.86466 0.63212 0.39347
~
This feature helps user to understand bias of distribution for tuning threshold 
parameter.
If this feature is nothing, it is difficult to understand distribution of access 
pattern, and it cannot realized on custom script. Because range of distribution 
(min, max, and SQL pattern) are unknown on custom script. So I think present UI 
is not bad and should not change.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-13 Thread KONDO Mitsumasa


Hi,

(2014/03/14 4:21), Fabien COELHO wrote:



We should do the same discussion for the UI of command-line option? The patch
adds two options --gaussian and --exponential, but this UI seems to be a bit
inconsistent with the UI for \setrandom.
Instead, we can use something like --distribution=[uniform | gaussian |
exponential].


Hmmm. That is possible, obviously.

Note that it does not need to resort to a custom script, if one can do something
like "--define=exp_threshold=5.6".
Yeah, threshold paramter should be needed by generating distribution algorithms 
in my patch. And it is important that we can control distribution pattern by this 
paramter.



If so, maybe one simpler named variable could
be used, say "threshold", instead of separate names for each options.
If we separate threshold option, I think it is difficult to understand dependency 
of this parameter. Because "threshold" is very general term, and
when we will add other new feature, it is difficult to undestand which parameter 
is dependent and be needed.



However there is a catch: currently the option allows to check that the 
threshold
is large enough so as to avoid loops in the generator. So this mean moving the
check in the generator, and doing it over and over. Possibly this is a good 
idea,
because otherwise a custom script could circumvent the check. Well, the current
status is that the check can be avoided with --define...

Also, a shorter possibly additional name, would be nice, maybe something like:
--dist=exp|gauss|uniform? Not sure. I like long options not to be too long.
Well, if we run standard benchmark in pgbench, we need not set option because it 
is default benmchmark, and it is same as uniform distribution. And if we run 
extra benchmarks in pgbench which are like '-S' or '-N',  we need to set option. 
Because they are non-standard benchmark setting, and it is same as gaussian or 
exponential distribution. So present UI keeps consistency and along the pgbench 
history.


> I like long options not to be too long.
Yes, I like so too. Present UI is very simple and useful for combination using 
such like '-S' and '--gaussian'. So I hope not changing UI.


ex)
pgbench -S --gaussian=5
pgbench -N --exponential=2 --sampling-rate=0.8

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-10 Thread KONDO Mitsumasa


(2014/03/09 1:49), Fabien COELHO wrote:


Hello Mitsumasa-san,


New "\setrandom" interface is here.
 \setrandom var min max [gaussian threshold | exponential threshold]



Attached patch realizes this interface, but it has little bit ugly codeing in
executeStatement() and process_commands()..


I think it is not too bad. The "ignore extra arguments on the line" is a little
pre-existing mess anyway.

All right.


What do you think?


I'm okay with this UI and its implementation.

OK.

Attached patch is updated in the document. I don't like complex sentence,
so I　use  tag a lot. If you like this documents, please mark ready for 
commiter.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center






*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = "";
  char	   *pgport = "";
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -S case */
+ static char *exponential_select_only = {
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ };
+ 
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --gaussian with -N case */
+ static char *gaussian_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"

Re: [HACKERS] gaussian distribution pgbench

2014-03-06 Thread KONDO Mitsumasa


(2014/03/07 16:02), KONDO Mitsumasa wrote:

And other cases are classified under following.
   \setrandom var min max gaussian #hoge --> uniform

Oh, it's wrong... It will be..
\setrandom var min max gaussian #hoge --> ERROR

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-06 Thread KONDO Mitsumasa


Hi,

(2014/03/04 17:42), KONDO Mitsumasa wrote:> (2014/03/04 17:28), Fabien COELHO 
wrote:
>>> OK. I'm not sure which idia is the best. So I wait for comments in 
community:)
>> Hmmm. Maybe you can do what Tom voted for, he is the committer:-)
> Yeah, but he might change his mind by our disscuttion. So I wait untill 
tomorrow,
> and if nothing to comment, I will start to fix what Tom voted for.
I create the patch which is fixed UI. If we agree with this interface,
I also start to fix the document.


New "\setrandom" interface is here.
  \setrandom var min max [gaussian threshold | exponential threshold]

Attached patch realizes this interface, but it has little bit ugly codeing in 
executeStatement() and process_commands().. That is under following.

if(argc == 4)
{
... /* uniform */
}
else if (argv[4]== gaussian or exponential)
{
... /* gaussian or exponential */
}
else
{
... /* uniform with extra argments */
}

It is beacause pgbench custom script allows extra comments or extra argument in 
its file. For example, under following cases are no problem case.

  \setrandom var min max #hoge   --> uniform random
  \setrandom var min max #hoge1 #hoge2  --> uniform random
  \setrandom var min max gaussian threshold #hoge  -->gaussian random

And other cases are classified under following.
  \setrandom var min max gaussian #hoge --> uniform
  \setrandom var min max max2 gaussian threshold --> uniform
  \setrandom var min gaussian #hoge --> ERROR

However, if we wrong grammer in pgbench custom script,
pgbench outputs error log on user terminal. So I think it is especially no 
problem.

What do you think?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = "";
  char	   *pgport = "";
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-03-04 Thread KONDO Mitsumasa


Hi all,

I think this patch is completely forgotten, and feel very unfortunate:(

Min, max, and stdev is basic statistics in general monitoring tools,
So I'd like to push it.

(2014/02/12 15:45), KONDO Mitsumasa wrote:

(2014/01/29 17:31), Rajeev rastogi wrote:

No Issue, you can share me the test cases, I will take the performance report.

Attached patch is supported to latest pg_stat_statements. It includes min, max,
and stdev statistics. Could you run compiling test on your windows enviroments?
I think compiling error was fixed.

We had disscuttion about which is needed useful statistics in community, I think
both of statistics have storong and weak point. When we see the less(2 or 3)
executed statement, stdev will be meaningless because it cannot calculate
estimated value precisely very much, however in this situation, min and max will
be propety work well because it isn't estimated value but fact value. On the
other hand, when we see the more frequency executed statement, they will be
contrary position
statistics, stdev will be very useful statistics for estimating whole 
statements,
and min and max might be extremely value.
At the end of the day, these value were needed each other for more useful
statistics when we want to see several actual statments. And past my experience
showed no performance problems in this patch. So I'd like to implements all 
these
values in pg_stat_statements.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-04 Thread KONDO Mitsumasa


(2014/03/04 17:28), Fabien COELHO wrote:

OK. I'm not sure which idia is the best. So I wait for comments in community:)

Hmmm. Maybe you can do what Tom voted for, he is the committer:-)
Yeah, but he might change his mind by our disscuttion. So I wait untill tomorrow, 
and if nothing to comment, I will start to fix what Tom voted for.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-03 Thread KONDO Mitsumasa


(2014/03/03 16:51), Fabien COELHO wrote:>>>\setrandom foo 1 10 [uniform]
>>>\setrandom foo 1 :size gaussian 3.6
>>>\setrandom foo 1 100 exponential 7.2
>> It's good design. I think it will become more low overhead at part of parsing
>> in pgbench, because comparison of strings will be redeced(maybe). And I'd 
like
>> to remove [uniform], beacause we have to have compatibility for old scripts,
>> and random function always gets uniform distribution in common sense of
>> programming.
>
> I just put "uniform" as an optional default, hence the brackets.
All right. I was misunderstanding. However, if we select this format, I'd like to 
remove it. Because pgbench needs to check counts of argment number. If we allow 
brackets, it will not be simple.


> Otherwise, what I would have in mind if this would be designed from scratch:
>
>\set foo 124
>\set foo "string value" (?)
>\set foo :variable
>\set foo 12 + :shift
>
> And then
>
>\set foo uniform 1 10
>\set foo gaussian 1 10 4.2
>\set foo exponential 1 100 5.2
>
> or maybe functions could be repended with something like "&uniform".
> But that would be for another life:-)
I don't agree with that.. They are more overhead in parsing part and more complex 
for user.


>> However, new grammer is little bit long in user script. It seems trade-off 
that
>> are visibility of scripts and user writing cost.
>
> Yep.
OK. I'm not sure which idia is the best. So I wait for comments in community:)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-03-02 Thread KONDO Mitsumasa


(2014/03/02 22:32), Fabien COELHO wrote:

Alvaro Herrera  writes:

Seems that in the review so far, Fabien has focused mainly in the
mathematical properties of the new random number generation.  That seems
perfectly fine, but no comment has been made about the chosen UI for the
feature.
 Per the few initial messages in the thread, in the patch as submitted you ask
for a gaussian random number by using \setgaussian, and exponential via
\setexp.  Is this the right UI?

I thought it would be both concise & clear to have that as another form of 
\set*.

Yeah, but we got only two or three? concise. So I agree with discussing about 
UI.


There is an additional argument expected. That would make:

   \setrandom foo 1 10 [uniform]
   \setrandom foo 1 :size gaussian 3.6
   \setrandom foo 1 100 exponential 7.2
It's good design. I think it will become more low overhead at part of parsing in 
pgbench, because comparison of strings will be redeced(maybe). And I'd like to 
remove [uniform], beacause we have to have compatibility for old scripts, and 
random function always gets uniform distribution in common sense of programming.


However, new grammer is little bit long in user script. It seems trade-off that 
are visibility of scripts and user writing cost.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] What behavior is in this loop?

2014-02-28 Thread KONDO Mitsumasa


(2014/02/27 20:19), Heikki Linnakangas wrote:

On 02/27/2014 12:38 PM, KONDO Mitsumasa wrote:

I found interesting "for" and "while" loop in WaitForWALToBecomeAvailable() in
xlog.c. Can you tell me this behavior?

for (;;)
{
~
} while (StanbyMode)

I confirmed this code is no problem in gcc compiler:)


Oh wow :-). That's clearly a thinko, although harmless in this case. Looking at
the git history, I made that mistake in commit abf5c5c9a. Before that, there was
no "while".

That's easier to understand with some extra formatting. That's two loops, like 
this:

/* loop 1 */
for (;;)
{
   ...
}

/* loop2 */
while(StandbyMode);

The second loop is obviously completely pointless. Thankfully, the there are no
"breaks" inside the first loop (the ones within the switch-statements don't
count), so the endless while-loop is never reached.

Yeah, StandbyMode flag doesn't change in this loop.


I'll go fix that... Thanks for the report!

Thanks for your kind!

By the way, I found cannot promote problem in PG9.3.3 in standby mode with 
archive revovery and crash recovery situations. I analyze this problem and fix it 
now. Please take care of then!


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unfortunate choice of short switch name in pgbench

2014-02-27 Thread KONDO Mitsumasa

(2014/02/28 2:39), Tom Lane wrote:
> Fabien COELHO  writes:
>>> Yeah, but they don't make -P take an integer argument.  It's that
>>> little frammish that makes this problem significant.
> 
>> I do not see a strong case to make options with arguments case insensitive
>> as a general rule. If this is done for -p/-P, why not -t/-T?
I'll say the same thing. And if we remove -P short option in pgbench, it means
that -P with integer will be forbided in postgres command. Surely, we don't 
hope so.

>> If you really fell you must remove -P, please replace it by another
>> one-letter, I use this option nearly everytime a run pgbench.
> 
> Meh.  If I thought -P would be that popular, I'd expect people to get
> used to the issue.  I don't believe it though.
At least, a user which is interested in postgres performance tuning(include
kernel options, etc) will often use this option. I recommended this feature,
because we can see the bottle-neck which we have not seen:) I believe you will
also become to like it more and more, while you use it.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] What behavior is in this loop?

2014-02-27 Thread KONDO Mitsumasa

Hi,

I found interesting "for" and "while" loop in WaitForWALToBecomeAvailable() in
xlog.c. Can you tell me this behavior?

for (;;)
{
~
} while (StanbyMode)

I confirmed this code is no problem in gcc compiler:)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-02-18 Thread KONDO Mitsumasa


(2014/02/17 21:44), Rajeev rastogi wrote:

It got compiled successfully on Windows.

Thank you for checking on Windows! It is very helpful for me.

Can we allow to add three statistics? I think only adding stdev is difficult to
image for user. But if there are min and max, we can image each statements 
situations more easily. And I don't want to manage this feature in my monitoring 
tool that is called pg_statsinfo. Because it is beneficial for a lot of 
pg_stat_statements user and for improvement of postgres performance in the future.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Exposing currentTransactionWALVolume

2014-02-17 Thread KONDO Mitsumasa


(2014/02/15 23:04), Andres Freund wrote:

Hi Simon,

On 2014-01-14 17:12:35 +, Simon Riggs wrote:

  /*
- * MarkCurrentTransactionIdLoggedIfAny
+ * ReportTransactionInsertedWAL
   *
- * Remember that the current xid - if it is assigned - now has been wal logged.
+ * Remember that the current xid - if it is assigned - has now inserted WAL
   */
  void
-MarkCurrentTransactionIdLoggedIfAny(void)
+ReportTransactionInsertedWAL(uint32 insertedWALVolume)
  {
+   currentTransactionWALVolume += insertedWALVolume;
if (TransactionIdIsValid(CurrentTransactionState->transactionId))
CurrentTransactionState->didLogXid = true;
  }


Not a big fan of combining those two. One works on the toplevel
transaction, the other on the current subtransaction... The new name
also ignores that it's only taking effect if there's actually a
transaction in progress.

Oh, yes. I don't have good idea, but we need to change function name or add new
function for WAL adding volume. If it will be fixed, I set ready for commiter,
because I cannot see any bad point in this patch.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-02-17 Thread KONDO Mitsumasa


(2014/02/16 7:38), Fabien COELHO wrote:

   I have updated the patch (v7) based on Mitsumasa latest v6:
   - some code simplifications & formula changes.
   - I've added explicit looping probability computations in comments
 to show the (low) looping probability of the iterative search.
   - I've tried to clarify the sgml documentation.
   - I've removed the 5.0 default value as it was not used anymore.
   - I've renamed some variables to match the naming style around.
Thank you for yor detail review and fix some code! I checked your modification 
version,

it seems better than previos version and very helpful for documents.


* Mathematical soundness

   I've checked again the mathematical soundness for the methods involved.

   After further thoughts, I'm not that sure that there is not a bias induced
   by taking the second value based on "cos" when the first based on "sin"
   as failed the test. So I removed the cos computation for the gaussian 
version,
   and simplified the code accordingly. This mean that it may be a little
   less efficient, but I'm more confident that there is no bias.
I tried to confirm which method is better. However, at the end of the day, it is 
not a problem because other part of implementations have bigger overhead in 
pgbench client. We like simple implementaion so I agree with your modification 
version. And I tested this version, there is no overhead in creating gaussian and 
exponential random number with minimum threshold that is most overhead situation.



* Conclusion

   If Mitsumasa-san is okay with the changes I have made, I would suggest
   to accept this patch.
Attached patch based on v7 is added output that is possibility of access record 
when we use exponential option
in the end of pgbench result. It is caluculated by a definite integral method for 
e^-x.

If you check it and think no problem, please mark it ready for commiter.
Ishii-san will review this patch:)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 98,103  static int	pthread_join(pthread_t th, void **thread_return);
--- 98,106 
  #define LOG_STEP_SECONDS	5	/* seconds between log messages */
  #define DEFAULT_NXACTS	10		/* default nxacts */
  
+ #define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+ #define MIN_EXPONENTIAL_THRESHOLD	2.0	/* minimum threshold for exp */
+ 
  int			nxacts = 0;			/* number of transactions per client */
  int			duration = 0;		/* duration in seconds */
  
***
*** 169,174  bool		is_connect;			/* establish connection for each transaction */
--- 172,185 
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
  
+ /* gaussian distribution tests: */
+ double		stdev_threshold;   /* standard deviation threshold */
+ booluse_gaussian = false;
+ 
+ /* exponential distribution tests: */
+ double		exp_threshold;   /* threshold for exponential */
+ bool		use_exponential = false;
+ 
  char	   *pghost = "";
  char	   *pgport = "";
  char	   *login = NULL;
***
*** 330,335  static char *select_only = {
--- 341,428 
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
+ /* --exponential case */
+ static char *exponential_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setexponential aid 1 :naccounts :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -N case */
+ static char *exponential_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setexponential aid 1 :naccounts :exp_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --exponential with -S case */
+ static char *exponential_select_only = {
+ 	"\\set naccounts " CppAsString2(na

Re: [HACKERS] gaussian distribution pgbench

2014-02-13 Thread KONDO Mitsumasa


Sorry, previos attached patch has small bug.
Please use latest one.

> 134 - return min + (int64) (max - min + 1) * rand;
> 134 + return min + (int64)((max - min + 1) * rand);

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pgbench/pgbench.c
--- b/contrib/pgbench/pgbench.c
***
*** 176,181  int			progress_nthreads = 0; /* number of threads for progress report */
--- 176,183 
  bool		is_connect;			/* establish connection for each transaction */
  bool		is_latencies;		/* report per-command latencies */
  int			main_pid;			/* main process id used in log filename */
+ double		stdev_threshold = 5;		/* standard deviation threshold */
+ bool		gaussian_option = false;	/* use gaussian distribution random generator */
  
  char	   *pghost = "";
  char	   *pgport = "";
***
*** 338,346  static char *select_only = {
--- 340,390 
  	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
  };
  
+ /* --gaussian case */
+ static char *gaussian_tpc_b = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setgaussian aid 1 :naccounts :stdev_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+ 	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --gaussian with -N case */
+ static char *gaussian_simple_update = {
+ 	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+ 	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setgaussian aid 1 :naccounts :stdev_threshold\n"
+ 	"\\setrandom bid 1 :nbranches\n"
+ 	"\\setrandom tid 1 :ntellers\n"
+ 	"\\setrandom delta -5000 5000\n"
+ 	"BEGIN;\n"
+ 	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ 	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+ 	"END;\n"
+ };
+ 
+ /* --gaussian with -S case */
+ static char *gaussian_select_only = {
+ 	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+ 	"\\setgaussian aid 1 :naccounts :stdev_threshold\n"
+ 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+ };
+ 
  /* Function prototypes */
  static void setalarm(int seconds);
  static void *threadRun(void *arg);
+ static inline double sqrtd(const double x);
  
  static void
  usage(void)
***
*** 381,386  usage(void)
--- 425,431 
  		   "  -v, --vacuum-all vacuum all four standard tables before tests\n"
  		   "  --aggregate-interval=NUM aggregate data over NUM seconds\n"
  		   "  --sampling-rate=NUM  fraction of transactions to log (e.g. 0.01 for 1%%)\n"
+ 		   "  --gaussian=NUM   gaussian distribution with NUM standard deviation threshold\n"
  		   "\nCommon options:\n"
  		   "  -d, --debug  print debugging output\n"
  		   "  -h, --host=HOSTNAME  database server host or socket directory\n"
***
*** 477,482  getrand(TState *thread, int64 min, int64 max)
--- 522,597 
  	return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state));
  }
  
+ /* random number generator: gaussian distribution from min to max inclusive */
+ static int64
+ getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold)
+ {
+ 	double		stdev;
+ 	double		rand;
+ 	static double	rand1;
+ 	static double	rand2;
+ 	static double	var_sqrt;
+ 	static bool	reuse = false;
+ 	
+ 	/*
+ 	 * Get user specified random number(-stdev_threshold < stdev <= stdev_threshold) 
+ 	 * in this loop. This loop is executed until appeared ranged number we want.
+ 	 * However, this loop could not almost go on, because min stdev_threshold is 2
+ 	 * then the possibility of retry-loop is under 4 percent. And possibility of
+ 	 * re-retry-loop is under 1.6 percent. And it doesn't happen frequentry even if
+ 	 * we also think about the cycle of the trigonometric function.
+  	 */
+ 	do
+ 	{
+ 		/* reuse pre calculation result as possible */
+ 		if(!reuse)
+ 		{
+ 			/* 
+  			 * pg_erand48 generates [0,1) random number. However rand1 
+  			 * needs (0,1) random number because log(0) cannot calculate.
+  			 * And rand2 also needs (0,1) random number in strictly. But
+  			 * normalization cost is high and we can substitute (0,1] at
+  			 * rand1 and [0,1) at rand2, so we use approximate calculation.
+  			 */
+ 			rand1 = 1.0

Re: [HACKERS] gaussian distribution pgbench

2014-02-13 Thread KONDO Mitsumasa


Hi Febien,

Thank you very much for your very detail and useful comments!
I read your comment, I agree most of your advice:)

Attached patch is fixed for your comment. That are...
 - Remove redundant long-option.
   - We can use "--gaussian=NUM -S" or "--gaussian=NUMN -N" options.
 - Add sentence in document
 - Separate two random generate function which are uniform and gaussian.
   - getGaussianrand() is created.
 - Fix ranged random number more strictly, ex. (0,1) or [0,1).
   - Please see comment of source code in detail:).
 - Fix typo.
 - Use cos() and sin() function when we generate gaussian random number.
 - Add fast sqrt calculation algorithm.
 - Reuse sqrt result and pre generate random number for reducing calculation 
cost.
   - Experience of this method is under following. It will be little-bit faster 
than non-reuse method. And distribution of gaussian is still good.


* Settings
 shared_buffers = 1024MB

* Test script
 pgbench -i -s 1
 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n
 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n

* Result
  method |  try1  |  try2  |  try3  |
|
reuse method | 44189  | 44453  | 44013  |
non-reuse method | 43567  | 43635  | 43508  |



(2014/02/09 21:32), Fabien COELHO wrote:

   This is a valuable contribution to enable pgbench to generate more realistic
   loads, which is seldom uniform in practice.

Thanks!


   However, ISTM that other distributions such an exponantial one would make
   more sense,
I can easy to create exponential distribution. Here, I assume exponential 
distribution that is f(x) = lambda * exp^(-lambda * x) in general.

What do you think under following interface?

custom script: \setexp [varname] min max threshold
command  : --exponential=NUM(threshold)

I don't want to use lambda variable for simple implementation. So lambda is 
always 1. Because it can enough to control distribution by threshold. Threshold 
parameter is f(x) value. And using created distribution projects to 'aid' by same 
method. If you think OK, I will impliment under followings tomorrow, and also 
create parseing part of this function...


do
{
   rand = 1.0 - pg_erand48(thread->random_state);
   rand = -log(rand);
}while( rand > exp_threshold)

return rand / exp_threshold;



   and also the values should be further randomized so that
   neighboring values are not more likely to be drawn. The latest point is non
   trivial.
That's right, but I worry about gaussian randomness and benchmark reproducibility 
might be disappeared when we re-randomized access pattern, because Postgres 
storage method manages records by each pages and it is difficult to realize 
access randomness in whole pages, not record. If we solve this problem, we have 
to need algorithm for smart shuffule projection function that is still having 
gaussian randomized. I think it will be difficult, and it have to impement in 
another patch in the future.




* Mathematical soundness

   We want to derive a discrete normal distribution from a uniform one.
   Well, normal distributions are for continuous variables... Anyway, this is
   done by computing a continuous normal distribution which is then projected
   onto integers. I'm basically fine with that.

   The system uses a Box-Muller transform (1958) to do this transformation.
   The Ziggurat method seems to be prefered for this purpose, *but* it would
   require precalculated tables which depends on the target values. So I'm
   fine with the Box-Muller transform for pgbench.
Yes, that's right. I selected simple and relatively faster algorithm, that is 
Box-Muller transform.



   The BM method uses 2 uniformly distributed numbers to derive 2 normally
   distributed numbers. The implementation computes one of these, and loops
   over till one match a threshold criterion.

   More explanations, at least in comments, are needed about this threshold
   and its meaning. It is required to be more than 2. I guess is that it allows
   to limit the number of iterations of the while loop,

Yes. This loop could not almost go on, because min stdev_threshold is 2.
The possibility of retry-loop is under 4 percent. It might not be problem.


   but in what proportion
   is unclear. The documentation does not also help the user to understand
   this value and its meaning.

Yes, it is huristic method. So I added the comments in document.



   What I think it is: it is the deviation for the FURTHEST point around the
   mean, that is the actual deviation associated to the "min" and "max" target
   values. The 2 minimum value induces that there is a least 4 stddev lengths
   between min & max, with the most likely mean in the middle.

Correct!


   If the threshold test fails, one of the 2 uniform number is redrawn, a new
   candidate value is tested. I'm not at ease about why only 1 value is redrawn
   and not both, some explanations would be welcome. Also, on the other hand,

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-02-11 Thread KONDO Mitsumasa


Hi Rajeev,


(2014/01/29 17:31), Rajeev rastogi wrote:

No Issue, you can share me the test cases, I will take the performance report.

Attached patch is supported to latest pg_stat_statements. It includes min, max,
and stdev statistics. Could you run compiling test on your windows enviroments?
I think compiling error was fixed.

We had disscuttion about which is needed useful statistics in community, I think 
both of statistics have storong and weak point. When we see the less(2 or 3) 
executed statement, stdev will be meaningless because it cannot calculate 
estimated value precisely very much, however in this situation, min and max will 
be propety work well because it isn't estimated value but fact value. On the 
other hand, when we see the more frequency executed statement, they will be 
contrary position
statistics, stdev will be very useful statistics for estimating whole statements, 
and min and max might be extremely value.

At the end of the day, these value were needed each other for more useful
statistics when we want to see several actual statments. And past my experience 
showed no performance problems in this patch. So I'd like to implements all these 
values in pg_stat_statements.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center





*** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
***
*** 19,24  CREATE FUNCTION pg_stat_statements(IN showtext boolean,
--- 19,27 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 41,43  CREATE VIEW pg_stat_statements AS
--- 44,51 
SELECT * FROM pg_stat_statements(true);
  
  GRANT SELECT ON pg_stat_statements TO PUBLIC;
+ 
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
*** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
***
*** 9,14  RETURNS void
--- 9,19 
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
  CREATE FUNCTION pg_stat_statements(IN showtext boolean,
  OUT userid oid,
  OUT dbid oid,
***
*** 16,21  CREATE FUNCTION pg_stat_statements(IN showtext boolean,
--- 21,29 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 42,44  GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- 50,53 
  
  -- Don't want this to be available to non-superusers.
  REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
+ REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC;
*** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
***
*** 4,8 
--- 4,9 
  \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
  
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset();
+ ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time();
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements();
  ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements;
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***
*** 106,111  static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
--- 106,113 
  #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
+ #define EXEC_TIME_INIT_MIN	DBL_MAX		/* initial execution min time */
+ #define EXEC_TIME_INIT_MAX	-DBL_MAX	/* initial execution max time */
  
  #define JUMBLE_SIZE1024	/* query serialization buffer size */
  
***
*** 137,142  typedef struct Counters
--- 139,147 
  {
  	int64		calls;			/* # of times executed */
  	double		total_time;		/* total execution time, in msec */
+ 	double		total_sqtime;		/* cumulated square execution time, in msec */
+ 	double		min_time;		/* maximum execution time, in msec */
+ 	double		max_time;		/* minimum execution time, in msec */
  	int64		rows;			/* total # of retrieved or affected rows */
  	int64		shared_blks_hit;	/* # of shared buffer hits */
  	int64		shared_blks_read;		/* # of shared disk blocks read */
***
*** 274,283  void		_PG_init(void);
--- 279,290 
  void		_PG_fini(void);
  
  Datum		pg_stat_statem

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-30 Thread KONDO Mitsumasa

(2014/01/30 8:29), Tom Lane wrote:
> Andrew Dunstan  writes:
>> I could live with just stddev. Not sure others would be so happy.
> 
> FWIW, I'd vote for just stddev, on the basis that min/max appear to add
> more to the counter update time than stddev does; you've got
> this:
> 
> + e->counters.total_sqtime += total_time * total_time;
> 
> versus this:
>
> + if (e->counters.min_time > total_time || e->counters.min_time 
> == EXEC_TIME_INIT)
> + e->counters.min_time = total_time;
> + if (e->counters.max_time < total_time)
> + e->counters.max_time = total_time;
> 
> I think on most modern machines, a float multiply-and-add is pretty
> darn cheap; a branch that might or might not be taken, OTOH, is a
> performance bottleneck.
> 
> Similarly, the shared memory footprint hit is more: two new doubles
> for min/max versus one for total_sqtime (assuming we're happy with
> the naive stddev calculation).
> 
> If we felt that min/max were of similar value to stddev then this
> would be mere nitpicking.  But since people seem to agree they're
> worth less, I'm thinking the cost/benefit ratio isn't there.
Why do you surplus worried about cost in my patch? Were three double variables
assumed a lot of memory, or having lot of calculating cost? My test result
showed LWlock bottele-neck is urban legend. If you have more fair test
pattern, please tell me, I'll do it if the community will do fair judge.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-29 Thread KONDO Mitsumasa


(2014/01/29 17:31), Rajeev rastogi wrote:

On 28th January, Mitsumasa KONDO wrote:

By the way, latest pg_stat_statement might affect performance in
Windows system.
Because it uses fflush() system call every creating new entry in
pg_stat_statements, and it calls many fread() to warm file cache. It
works well in Linux system, but I'm not sure in Windows system. If you
have time, could you test it on your Windows system? If it affects
perfomance a lot, we can still change it.



No Issue, you can share me the test cases, I will take the performance report.

Thank you for your kind!

I posted another opinion in his patch. So please wait for a while, for not waste 
your test time.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-29 Thread KONDO Mitsumasa


(2014/01/29 16:58), Peter Geoghegan wrote:

On Tue, Jan 28, 2014 at 10:51 PM, Tom Lane  wrote:

KONDO Mitsumasa  writes:

By the way, latest pg_stat_statement might affect performance in Windows system.
Because it uses fflush() system call every creating new entry in
pg_stat_statements, and it calls many fread() to warm file cache.


This statement doesn't seem to have much to do with the patch as
committed.


You could make a strong case for the patch improving performance in
practice for many users, by allowing us to increase the
pg_stat_statements.max default value to 5,000, 5 times the previous
value. The real expense of creating a new entry is the exclusive
locking of the hash table, which blocks *everything* in
pg_stat_statements. That isn't expanded at all, since queries are only
written with a shared lock, which only blocks the creation of new
entries which was already relatively expensive. In particular, it does
not block the maintenance of costs for all already observed entries in
the hash table. It's obvious that simply reducing the pressure on the
cache will improve matters a lot, which for many users the external
texts patch does.

Since Mitsumasa has compared that patch and at least one of his
proposed pg_stat_statements patches on several occasions, I will
re-iterate the difference: any increased overhead from the external
texts patch is paid only when a query is first entered into the hash
table. Then, there is obviously and self-evidently no additional
overhead from contention for any future execution of the same query,
no matter what, indefinitely (the exclusive locking section of
creating a new entry does not do I/O, except in fantastically unlikely
circumstances). So for many of the busy production systems I work
with, that's the difference between a cost paid perhaps tens of
thousands of times a second, and a cost paid every few days or weeks.
Even if he is right about a write() taking an unreasonably large
amount of time on Windows, the consequences felt as pg_stat_statements
LWLock contention would be very limited.

I am not opposed in principle to adding new things to the counters
struct in pg_stat_statements. I just think that the fact that the
overhead of installing the module on a busy production system is
currently so low is of *major* value, and therefore any person that
proposes to expand that struct should be required to very conclusively
demonstrate that there is no appreciably increase in overhead. Having
a standard deviation column would be nice, but it's still not that
important. Maybe when we have portable atomic addition we can afford
to be much more inclusive of that kind of thing.


I'd like to know the truth and the fact in your patch, rather than your 
argument:-)

So I create detail pg_stat_statements benchmark tool using pgbench.
This tool can create 1 pattern unique sqls in a file, and it is only
for measuring pg_stat_statements performance. Because it only updates
pg_stat_statements data and doesn't write to disk at all. It's fair benchmark.

[usage]
perl create_sql.pl >file.sql
psql -h -h xxx.xxx.xxx.xxx mitsu-ko -c 'SELECT pg_stat_statements_reset()'
pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -n -T 180 -f file.sql

[part of sample sqls file, they are executed in pgbench]
SELECT 
1-1/9+5*8*6+5+9-6-3-4/9%7-2%7/5-9/7+2+9/7-1%5%9/3-4/4-9/1+5+5/6/5%2+1*2*3-8/8+5-3-8-4/8+5%2*2-2-5%6+4
SELECT 
1%9%7-8/5%6-1%2*2-7+9*6-2*6-9+1-2*9+6+7*8-4-2*1%3/7-1%4%9+4+7/5+4/2-3-5%8/3/7*6-1%8*6*1/7/2%5*6/2-3-9
SELECT 
1%3*2/8%6%5-3%1+1/6*6*5/9-2*5+6/6*5-1/2-6%4%8/7%2*7%5%9%5/9%1%1-3-9/2*1+1*6%8/2*4/1+6*7*1+5%6+9*1-9*6

...

I test some methods that include old pgss, old pgss with my patch, and new pgss.
Test server and postgresql.conf are same in I tested last day in this ML-thread.
And test methods and test results are here,

[methods]
method 1: with old pgss, pg_stat_statements.max=1000(default)
method 2: with old pgss with my patch, pg_stat_statements.max=1000(default)
peter 1 : with new pgss(Peter's patch), pg_stat_statements.max=5000(default)
peter 2 : with new pgss(Peter's patch), pg_stat_statements.max=1000

[for reference]
method 5:with old pgss, pg_stat_statements.max=5000
method 6:with old pgss with my patch, pg_stat_statements.max=5000

[results]
 method   |  try1  |  try2  |  try3  | degrade performance ratio
-
 method 1 | 6.546  | 6.558  | 6.638  | 0% (reference score)
 method 2 | 6.527  | 6.556  | 6.574  | 1%
 peter 1  | 5.204  | 5.203  | 5.216  |20%
 peter 2  | 4.241  | 4.236  | 4.262  |35%

 method 5 | 5.931  | 5.848  | 5.872  |11%
 method 6 | 5.794  | 5.792  | 5.776  |12%


New pgss is extremely degrade performance than old pgss, and I cannot see 
performance scaling.
I understand that his argument was "My patch reduces time of LWLock in 
pg_stat_statements,

it is good for performance. However, Kondo's patch will be

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-28 Thread KONDO Mitsumasa

(2014/01/29 15:51), Tom Lane wrote:
> KONDO Mitsumasa  writes:
>> By the way, latest pg_stat_statement might affect performance in Windows 
>> system.
>> Because it uses fflush() system call every creating new entry in
>> pg_stat_statements, and it calls many fread() to warm file cache.
> This statement doesn't seem to have much to do with the patch as
> committed.  There are no fflush calls, and no notion of warming the
> file cache either.
Oh, all right.

> We do assume that the OS is smart enough to keep
> a frequently-read file in cache ... is Windows too stupid for that?
I don't know about it. But I think Windows cache feature is stupid.
It seems to always write/read data to/from disk, nevertheless having large 
memory...
I'd like to know test result on Windows, if we can...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] drop duplicate buffers in OS

2014-01-28 Thread KONDO Mitsumasa

Hi,

Attached is latest patch.
I change little bit at PinBuffer() in bufmgr.c. It will evict target clean file 
cache in OS more exactly.

- if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED))
+ if (!(buf->flags & BM_DIRTY) && !(buf->flags & BM_FADVED) && !(buf->flags & 
BM_JUST_DIRTIED))

(2014/01/29 8:20), Jeff Janes wrote:

On Wed, Jan 15, 2014 at 10:34 AM, Robert Haas mailto:robertmh...@gmail.com>> wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
mailto:kondo.mitsum...@lab.ntt.co.jp>> 
wrote:
 > I create patch that can drop duplicate buffers in OS using usage_count
 > alogorithm. I have developed this patch since last summer. This feature
seems to
 > be discussed in hot topic, so I submit it more faster than my schedule.
 >
 > When usage_count is high in shared_buffers, they are hard to drop from
 > shared_buffers. However, these buffers wasn't required in file cache. 
Because
 > they aren't accessed by postgres(postgres access to shared_buffers).
 > So I create algorithm that dropping file cache which is high usage_count 
in
 > shared_buffers and is clean state in OS. If file cache are clean state in
OS, and
 > executing posix_fadvice DONTNEED, it can only free in file cache without
writing
 > physical disk. This algorithm will solve double-buffered situation 
problem and
 > can use memory more efficiently.
 >
 > I am testing DBT-2 benchmark now...

Have you had any luck with it?  I have reservations about this approach.  Among
other reasons, if the buffer is truly nailed in shared_buffers for the long 
term,
the kernel won't see any activity on it and will be able to evict it fairly
efficiently on its own.
My patch aims not to evict other useful file cache in OS. If we doesn't evict 
useful file caches in shered_buffers, they will be evicted with other useful file 
cache in OS. But if we evict them as soon as possible, it will be difficult to 
evict other useful file cache all the more.

So I'm reluctant to do a detailed review if the author cannot demonstrate a
performance improvement.  I'm going to mark it waiting-on-author for that 
reason.

Will you review my patch? Thank you so much! However, My patch performance is be
little bit better. It might be error range. Optimize kernel readahead patch is 
grate.
Too readahead in OS is too bad, and to be full of not useful file cache in OS.
Here is the test result. Plain result is tested before(readahead patch test).

* Test server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB
  OS: RHEL 6.4(x86_64)
  FS: Ext4

* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method | score | average | 90%tile | Maximum

plain  | 3589  | 9.751   | 33.680  | 87.8036
patched| 3799  | 9.914   | 22.451  | 119.4259

* Main Settings
shared_buffers= 2458MB
drop_duplicate_buffers = 5 // patched only

I tested benchmark with drop_duplicate_buffers = 3 and 4 settings. But I didn't 
get good result. So I will test with more larger shared_buffers and these settings.

[detail settings]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/dbserver/param.out

* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html 

[patched]
http://pgstatsinfo.projects.pgfoundry.org/drop_os_cache/drop_dupulicate_cache20140129/HTML/index_thput.html

We can see faster response time at OS witeback situation(maybe) and executing 
CHECKPOINT. Because when these are happened, read transactions hit file cache 
more in my patch. So responce times are better than plain.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1107,1112  include 'filename'
--- 1107,1130 

+  
+   drop-duplicate-buffers (integer)
+   
+drop-duplicate-buffers configuration parameter
+   
+   
+
+ Sets target min usage count in shared_buffers, which is droped in file cache.
+ When you use this parameter, you set large shared_buffers that is about 50% 
+ or higher, and set parameter with 4 or 5. It need to set carefuly. If you 
+ use this parameter with small shared_buffers, transaction performance will be
+ decliend. This parameter is needed to support posix_fadvise() system call.
+ If your system doesn't support posix_fadvise(), it doesn't work at all.
+
+ 
+   
+  
+ 

temp_buffers (integer)

*** a/src/backend/storage

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-28 Thread KONDO Mitsumasa

(2014/01/28 15:17), Rajeev rastogi wrote:

On 27th January, Mitsumasa KONDO wrote:

2014-01-26 Simon Riggs mailto:si...@2ndquadrant.com>>

 On 21 January 2014 19:48, Simon Riggs mailto:si...@2ndquadrant.com>> wrote:
  > On 21 January 2014 12:54, KONDO Mitsumasa

 <mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote:
  >> Rebased patch is attached.
  >
  > Does this fix the Windows bug reported by Kumar on 20/11/2013 ?

Sorry, I was misunderstanding. First name of Mr. Rajeev Rastogi is
Kumar! I searched only e-mail address and title by his name...

I don't have windows compiler enviroment, but attached patch might be
fixed.
Could I ask Mr. Rajeev Rastogi to test my patch again?

I tried to test this but I could not apply the patch on latest git HEAD.
This may be because of recent patch (related to pg_stat_statement only
"pg_stat_statements external query text storage "), which got committed
on 27th January.
Thank you for trying to test my patch. As you say, recently commit changes 
pg_stat_statements.c a lot. So I have to revise my patch. Please wait for a while.

By the way, latest pg_stat_statement might affect performance in Windows system. 
Because it uses fflush() system call every creating new entry in 
pg_stat_statements, and it calls many fread() to warm file cache. It works well 
in Linux system, but I'm not sure in Windows system. If you have time, could you 
test it on your Windows system? If it affects perfomance a lot, we can still 
change it.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-27 Thread KONDO Mitsumasa

(2014/01/26 17:43), Mitsumasa KONDO wrote:
> 2014-01-26 Simon Riggs mailto:si...@2ndquadrant.com>>
> 
> On 21 January 2014 19:48, Simon Riggs  <mailto:si...@2ndquadrant.com>> wrote:
>  > On 21 January 2014 12:54, KONDO Mitsumasa 
>  <mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote:
>  >> Rebased patch is attached.
>  >
>  > Does this fix the Windows bug reported by Kumar on 20/11/2013 ?
Sorry, I was misunderstanding. First name of Mr. Rajeev Rastogi is Kumar! I 
searched
only e-mail address and title by his name...

I don't have windows compiler enviroment, but attached patch might be fixed.
Could I ask Mr. Rajeev Rastogi to test my patch again?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
***
*** 19,24  CREATE FUNCTION pg_stat_statements(
--- 19,27 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 41,43  CREATE VIEW pg_stat_statements AS
--- 44,52 
SELECT * FROM pg_stat_statements();
  
  GRANT SELECT ON pg_stat_statements TO PUBLIC;
+ 
+ /* New Function */
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
*** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
***
*** 9,14  RETURNS void
--- 9,19 
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
  CREATE FUNCTION pg_stat_statements(
  OUT userid oid,
  OUT dbid oid,
***
*** 16,21  CREATE FUNCTION pg_stat_statements(
--- 21,29 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 42,44  GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- 50,53 
  
  -- Don't want this to be available to non-superusers.
  REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
+ REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC;
*** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
***
*** 4,8 
--- 4,9 
  \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
  
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset();
+ ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time();
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements();
  ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements;
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***
*** 78,84  static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
  #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
! 
  #define JUMBLE_SIZE1024	/* query serialization buffer size */
  
  /*
--- 78,85 
  #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
! #define EXEC_TIME_INIT_MIN	DBL_MAX		/* initial execution min time */
! #define	EXEC_TIME_INIT_MAX	-DBL_MAX	/* initial execution max time */
  #define JUMBLE_SIZE1024	/* query serialization buffer size */
  
  /*
***
*** 114,119  typedef struct Counters
--- 115,123 
  {
  	int64		calls;			/* # of times executed */
  	double		total_time;		/* total execution time, in msec */
+ 	double		total_sqtime;		/* cumulated square execution time, in msec */
+ 	double		min_time;		/* maximum execution time, in msec */
+ 	double		max_time;		/* minimum execution time, in msec */
  	int64		rows;			/* total # of retrieved or affected rows */
  	int64		shared_blks_hit;	/* # of shared buffer hits */
  	int64		shared_blks_read;		/* # of shared disk blocks read */
***
*** 237,245  void		_PG_init(void);
--- 241,251 
  void		_PG_fini(void);
  
  Datum		pg_stat_statements_reset(PG_FUNCTION_ARGS);
+ Datum		pg_stat_statements_reset_time(PG_FUNCTION_ARGS);
  Datum		pg_stat_statements(PG_FUNCTION_ARGS);
  
  PG_FUNCTION_INFO_V1(pg_stat_

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-27 Thread KONDO Mitsumasa


(2014/01/23 23:18), Andrew Dunstan wrote:

What is more, if the square root calculation is affecting your benchmarks, I
suspect you are benchmarking the wrong thing.
I run another test that has two pgbench-clients in same time, one is 
select-only-query and another is executing 'SELECT *　pg_stat_statement' query in 
every one second. I used v6 patch in this test.


* Benchmark Commands
$bin/pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180 -n &
$bin/pgbench -h xxx.xxx.xxx.xxx mitsu-ko -T 180 -n -f file.sql

** file.sql
SELECT * FROM pg_stat_statement;
\sleep 1s

* Select-only-query Result (Test result is represented by tps.)
   method|  try1  |  try2  |  try3

with pgss| 125502 | 125818 | 125809
with patched pgss| 125909 | 125699 | 126040


This result shows my patch is almost same performance than before.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-23 Thread KONDO Mitsumasa


(2014/01/23 10:28), Peter Geoghegan wrote:

On Wed, Jan 22, 2014 at 5:28 PM, KONDO Mitsumasa
 wrote:

Oh, thanks to inform me. I think essential problem of my patch has bottle
neck in sqrt() function and other division caluculation.


Well, that's a pretty easy theory to test. Just stop calling them (and
do something similar to what we do for current counter fields instead)
and see how much difference it makes.
What means "calling them"? I think that part of heavy you think is 
pg_stat_statement view that is called by select query, not a part of LWLock 
getting statistic by hook. Right?


I tested my patch in pgbench, but I cannot find bottleneck of my latest patch.
(Sorry, I haven't been test select query in pg_stat_statement view...) Here is a 
test result.


* Result (Test result is represented by tps.)
   method|  try1  |  try2  |  try3

without pgss | 130938 | 131558 | 131796
with pgss| 125164 | 125146 | 125358
with patched pgss| 126091 | 126681 | 126433


* Test Setting
shared_buffers=1024MB
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7


* pgbench script
pgbench -h xxx.xxx.xxx.xxx mitsu-ko -i -s 100
psql -h xxx.xxx.xxx.xxx mitsu-ko -c 'CHECKPOINT'
pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180
pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180
pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180


* Server SPEC:
CPU: Xeon E5-2670 1P/8C 2.6GHz #We don't have 32 core cpu...
Memory: 24GB
RAID: i420 2GB cache
Disk: 15K * 6 (RAID 1+0)


Attached is latest developping patch. It hasn't been test much yet, but sqrt 
caluclation may be faster.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
***
*** 19,24  CREATE FUNCTION pg_stat_statements(
--- 19,27 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 41,43  CREATE VIEW pg_stat_statements AS
--- 44,52 
SELECT * FROM pg_stat_statements();
  
  GRANT SELECT ON pg_stat_statements TO PUBLIC;
+ 
+ /* New Function */
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
*** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
***
*** 9,14  RETURNS void
--- 9,19 
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
  CREATE FUNCTION pg_stat_statements(
  OUT userid oid,
  OUT dbid oid,
***
*** 16,21  CREATE FUNCTION pg_stat_statements(
--- 21,29 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 42,44  GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- 50,53 
  
  -- Don't want this to be available to non-superusers.
  REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
+ REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC;
*** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
***
*** 4,8 
--- 4,9 
  \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
  
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset();
+ ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time();
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements();
  ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements;
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***
*** 78,84  static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
  #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
! 
  #define JUMBLE_SIZE1024	/* query serialization buffer size */
  
  /*
--- 78,85 
  #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
! #define EXEC_TIME_INIT_MIN	DBL_MAX		/* initial execution min time */
! #define	EXEC_

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread KONDO Mitsumasa


(2014/01/23 12:00), Andrew Dunstan wrote:


On 01/22/2014 08:28 PM, KONDO Mitsumasa wrote:

(2014/01/22 22:26), Robert Haas wrote:

On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa
 wrote:

OK, Kondo, please demonstrate benchmarks that show we have <1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.


OK, testing DBT-2 now. However, error range of benchmark might be 1% higher.
So I show you detail HTML results.


To see any impact from spinlock contention, I think you're pretty much
going to need a machine with >32 cores, I think, and lots of
concurrency.  pgbench -S is probably a better test than DBT-2, because
it leaves out all the writing, so percentage-wise more time will be
spent doing things like updating the pgss hash table.

Oh, thanks to inform me. I think essential problem of my patch has bottle neck
in sqrt() function and other division caluculation. I will replcace sqrt()
function in math.h to more faster algorithm. And moving unneccessary part of
caluculation in LWlocks or other locks. It might take time to improvement, so
please wait for a while.



Umm, I have not read the patch, but are you not using Welford's method? Its
per-statement overhead should be absolutely tiny (and should not compute a 
square
root at all  per statement - the square root should only be computed when the
standard deviation is actually wanted, e.g. when a user examines
pg_stat_statements) See for example
<http://www.johndcook.com/standard_deviation.html>
Thanks for your advice. I read your example roughly, however, I think calculating 
variance is not so heavy in my patch. Double based sqrt caluculation is most 
heavily in my mind. And I find fast square root algorithm that is used in 3D games.

http://en.wikipedia.org/wiki/Fast_inverse_square_root

This page shows inverse square root algorithm, but it can caluculate normal 
square root, and it is faster algorithm at the price of precision than general 
algorithm. I think we want to fast algorithm, so it will be suitable.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread KONDO Mitsumasa


(2014/01/22 22:26), Robert Haas wrote:

On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa
 wrote:

OK, Kondo, please demonstrate benchmarks that show we have <1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.


OK, testing DBT-2 now. However, error range of benchmark might be 1% higher.
So I show you detail HTML results.


To see any impact from spinlock contention, I think you're pretty much
going to need a machine with >32 cores, I think, and lots of
concurrency.  pgbench -S is probably a better test than DBT-2, because
it leaves out all the writing, so percentage-wise more time will be
spent doing things like updating the pgss hash table.
Oh, thanks to inform me. I think essential problem of my patch has bottle neck in 
sqrt() function and other division caluculation. I will replcace sqrt() function 
in math.h to more faster algorithm. And moving unneccessary part of caluculation 
in LWlocks or other locks. It might take time to improvement, so please wait for 
a while.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-22 Thread KONDO Mitsumasa


(2014/01/22 9:34), Simon Riggs wrote:

AFAICS, all that has happened is that people have given their opinions
and we've got almost the same identical patch, with a rush-rush
comment to commit even though we've waited months. If you submit a
patch, then you need to listen to feedback and be clear about what you
will do next, if you don't people will learn to ignore you and nobody
wants that.
I think it was replied that will be heavily. If we realize histogram in 
pg_stat_statements, we have to implement dobuble precision arrays for storing 
histogram data. And when we update histogram data in each statements, we must 
update arrays with searching what response time is the smallest or biggest? It is 
very big cost, assuming large memory, and too hevily when updating than we get 
benefit from it. So I just add stddev for as fast as latest pg_stat_statements. I 
got some agreed from some people, as you say.



On 21 January 2014 21:19, Peter Geoghegan  wrote:

On Tue, Jan 21, 2014 at 11:48 AM, Simon Riggs  wrote:

I agree with people saying that stddev is better than nothing at all,
so I am inclined to commit this, in spite of the above.


I could live with stddev. But we really ought to be investing in
making pg_stat_statements work well with third-party tools. I am very
wary of enlarging the counters structure, because it is protected by a
spinlock. There has been no attempt to quantify that cost, nor has
anyone even theorized that it is not likely to be appreciable.


OK, Kondo, please demonstrate benchmarks that show we have <1% impact
from this change. Otherwise we may need a config parameter to allow
the calculation.
OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I 
show you detail HTML results.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2014-01-21 Thread KONDO Mitsumasa


Rebased patch is attached.

pg_stat_statements in PG9.4dev has already changed table columns in. So I hope 
this patch will be committed in PG9.4dev.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
***
*** 19,24  CREATE FUNCTION pg_stat_statements(
--- 19,27 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 41,43  CREATE VIEW pg_stat_statements AS
--- 44,52 
SELECT * FROM pg_stat_statements();
  
  GRANT SELECT ON pg_stat_statements TO PUBLIC;
+ 
+ /* New Function */
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
*** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
***
*** 9,14  RETURNS void
--- 9,19 
  AS 'MODULE_PATHNAME'
  LANGUAGE C;
  
+ CREATE FUNCTION pg_stat_statements_reset_time()
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+ LANGUAGE C;
+ 
  CREATE FUNCTION pg_stat_statements(
  OUT userid oid,
  OUT dbid oid,
***
*** 16,21  CREATE FUNCTION pg_stat_statements(
--- 21,29 
  OUT query text,
  OUT calls int8,
  OUT total_time float8,
+ OUT min_time float8,
+ OUT max_time float8,
+ OUT stdev_time float8,
  OUT rows int8,
  OUT shared_blks_hit int8,
  OUT shared_blks_read int8,
***
*** 42,44  GRANT SELECT ON pg_stat_statements TO PUBLIC;
--- 50,53 
  
  -- Don't want this to be available to non-superusers.
  REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
+ REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC;
*** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
--- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
***
*** 4,8 
--- 4,9 
  \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
  
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset();
+ ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time();
  ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements();
  ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements;
*** a/contrib/pg_stat_statements/pg_stat_statements.c
--- b/contrib/pg_stat_statements/pg_stat_statements.c
***
*** 78,83  static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100;
--- 78,84 
  #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
  #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
  #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
+ #define EXEC_TIME_INIT		(-1)		/* initial execution time */
  
  #define JUMBLE_SIZE1024	/* query serialization buffer size */
  
***
*** 114,119  typedef struct Counters
--- 115,123 
  {
  	int64		calls;			/* # of times executed */
  	double		total_time;		/* total execution time, in msec */
+ 	double		total_sqtime;		/* cumulated square execution time, in msec */
+ 	double		min_time;		/* maximum execution time, in msec */
+ 	double		max_time;		/* minimum execution time, in msec */
  	int64		rows;			/* total # of retrieved or affected rows */
  	int64		shared_blks_hit;	/* # of shared buffer hits */
  	int64		shared_blks_read;		/* # of shared disk blocks read */
***
*** 237,245  void		_PG_init(void);
--- 241,251 
  void		_PG_fini(void);
  
  Datum		pg_stat_statements_reset(PG_FUNCTION_ARGS);
+ Datum		pg_stat_statements_reset_time(PG_FUNCTION_ARGS);
  Datum		pg_stat_statements(PG_FUNCTION_ARGS);
  
  PG_FUNCTION_INFO_V1(pg_stat_statements_reset);
+ PG_FUNCTION_INFO_V1(pg_stat_statements_reset_time);
  PG_FUNCTION_INFO_V1(pg_stat_statements);
  
  static void pgss_shmem_startup(void);
***
*** 266,271  static pgssEntry *entry_alloc(pgssHashKey *key, const char *query,
--- 272,278 
  			int query_len, bool sticky);
  static void entry_dealloc(void);
  static void entry_reset(void);
+ static void entry_reset_time(void);
  static void AppendJumble(pgssJumbleState *jstate,
  			 const unsigned char *item, Size size);
  static void JumbleQuery(pgssJumbleState *jstate, Query *query);
***
*** 1046,1051  pgss_store(const char *query, uint32 queryId,
--- 1053,1059 
  
  		e->counters.calls += 1;
  		e->counters.total_time += total_time;
+ 		e->counters.total_sqtime += total_time * total_time;
  		e->counters.rows += rows;
  		e->counters.shared_blks_hit += bufusage->shared_blks_hit;
  		e->counters.shared_blks_read += bufusage->shared_blks_read;
***
*** 1061,1066  pgss_store(const char *query, u

Re: [HACKERS] drop duplicate buffers in OS

2014-01-16 Thread KONDO Mitsumasa


(2014/01/16 3:34), Robert Haas wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
 wrote:

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature seems to
be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache. Because
they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count in
shared_buffers and is clean state in OS. If file cache are clean state in OS, 
and
executing posix_fadvice DONTNEED, it can only free in file cache without writing
physical disk. This algorithm will solve double-buffered situation problem and
can use memory more efficiently.

I am testing DBT-2 benchmark now...


The thing about this is that our usage counts for shared_buffers don't
really work right now; it's common for everything, or nearly
everything, to have a usage count of 5.  So I'm reluctant to rely on
that for much of anything.
This patch aims to large shared_buffers situations, so 10% memory shared_buffers 
situaition
might be not effective. This patch is in experimental and to show how to solve 
the double-buffers for one of a example.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] drop duplicate buffers in OS

2014-01-16 Thread KONDO Mitsumasa


(2014/01/16 21:38), Aidan Van Dyk wrote:

Can we just get the backend that dirties the page to the posix_fadvice DONTNEED?
No, it can remove clean page in OS file caches. Because if page is dirtied, it 
cause physical-disk-writing. However, it is experimental patch so it might be 
changed by future benchmark testing.



Or have another helper that sweeps the shared buffers and does this 
post-first-dirty?
We can add DropDuplicateOSCache() function to checkpointer process or other 
process. And we can chenged posix_fadvice() DONTNEED to sync_file_range(). It can 
cause physical-disk-writing in target buffer, not to free OS file caches.


I'm considering that sync_file_range() SYNC_FILE_RANGE_WAIT_BEFORE | 
SYNC_FILE_RANGE_WRITE in executing checkpoint. It can avoid fsync freeze 
situaition in part of of finnal checkpoint.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] drop duplicate buffers in OS

2014-01-14 Thread KONDO Mitsumasa

Hi,

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature seems to
be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache. Because
they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count in
shared_buffers and is clean state in OS. If file cache are clean state in OS, 
and
executing posix_fadvice DONTNEED, it can only free in file cache without writing
physical disk. This algorithm will solve double-buffered situation problem and
can use memory more efficiently.

I am testing DBT-2 benchmark now...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1107,1112  include 'filename'
--- 1107,1130 

   
  
+  
+   drop-duplicate-buffers (integer)
+   
+drop-duplicate-buffers configuration parameter
+   
+   
+
+ Sets target min usage count in shared_buffers, which is droped in file cache.
+ When you use this parameter, you set large shared_buffers that is about 50% 
+ or higher, and set parameter with 4 or 5. It need to set carefuly. If you 
+ use this parameter with small shared_buffers, transaction performance will be
+ decliend. This parameter is needed to support posix_fadvise() system call.
+ If your system doesn't support posix_fadvise(), it doesn't work at all.
+
+ 
+   
+  
+ 
   
temp_buffers (integer)

*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***
*** 69,74 
--- 69,75 
  /* GUC variables */
  bool		zero_damaged_pages = false;
  int			bgwriter_lru_maxpages = 100;
+ int		DropDuplicateBuffers = -1;
  double		bgwriter_lru_multiplier = 2.0;
  bool		track_io_timing = false;
  
***
*** 111,116  static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 112,118 
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
  static void AtProcExit_Buffers(int code, Datum arg);
  static int	rnode_comparator(const void *p1, const void *p2);
+ static void DropDuplicateOSCache(volatile BufferDesc *bufHdr);
  
  
  /*
***
*** 835,841  BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  	 * 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
  	if (relpersistence == RELPERSISTENCE_PERMANENT)
  		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
  	else
--- 837,843 
  	 * 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_FADVED);
  	if (relpersistence == RELPERSISTENCE_PERMANENT)
  		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
  	else
***
*** 1101,1107  PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
--- 1103,1120 
  		if (strategy == NULL)
  		{
  			if (buf->usage_count < BM_MAX_USAGE_COUNT)
+ 			{
  buf->usage_count++;
+ 
+ /*
+  * If the buffer is clean, we can remove duplicate buffers
+  * from the file cache in OS without physical disk writing,
+  * which is for more efficient whole memory using. It is needed
+  * to execute at once per buffers.
+  */
+ if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED))
+ 	DropDuplicateOSCache(buf);
+ 			}
  		}
  		else
  		{
***
*** 1953,1958  FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1966,1974 
  	 */
  	recptr = BufferGetLSN(buf);
  
+ 	/* mark this buffer's file cache in OS isn't clean */
+ 	buf->flags |= BM_FADVED;
+ 
  	/* To check if block content changes while flushing. - vadim 01/17/97 */
  	buf->flags &= ~BM_JUST_DIRTIED;
  	UnlockBufHdr(buf);
***
*** 2101,2106  BufferGetLSNAtomic(Buffer buffer)
--- 2117,2144 
  	return lsn;
  }
  
+ /*
+  * Drop duplicate OS cache which is in OS and shared_buffers. This purpose
+  * is for more efficient file cache space and dirty file cache management.
+  */
+ static void
+ DropDuplicateOSCache(volatile BufferDesc *bufHdr)
+ {
+ 	SMgrRelationreln;
+ 	MdfdVec *v;
+ 	off_t   seekpos;
+ 
+ 	/* Drop OS cache which is higher usage_count in shared_buffer */
+ 	if(DropDuplicateBuffers != -1 && bufHdr->usage_count >= DropDuplicateBuffers)
+ 	{
+ 		reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+ 		v = _mdfd_getseg(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum, false, 0);
+ 		seekpos = (off_t) BLC

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2014-01-14 Thread KONDO Mitsumasa


Hi,

I fix and submit this patch in CF4.

In my past patch, it is significant bug which is mistaken caluculation of
offset in posix_fadvise():-( However it works well without problem in pgbench.
Because pgbench transactions are always random access...

And I test my patch in DBT-2 benchmark. Results are under following.

* Test server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB
  OS: RHEL 6.4(x86_64)
  FS: Ext4


* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method | score | average | 90%tile | Maximum

plain  | 3589  | 9.751   | 33.680  | 87.8036
option=off | 3670  | 9.107   | 34.267  | 79.3773
option=on  | 4222  | 5.140   | 7.619   | 102.473

 "option" is "readahead_strategy" option, and "on" is my proposed.
 "average", "90%tile", and Maximum represent latency.
 Average_latency is 2 times faster than plain!


* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html
[option=off]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_off_20140109/HTML/index_thput.html
[option=on]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_on_20140109/HTML/index_thput.html

We can see part of super slow latency in my proposed method test.
Part of transaction active is 20%, and part of slow transactions is 80%.
It might be Pareto principle in CHECKPOINT;-)
#It's joke.

I will test some join sqls performance and TPC-3 benchmark in this or next week.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/configure
--- b/configure
***
*** 11303,11309  fi
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
--- 11303,11309 
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
*** a/contrib/pg_prewarm/pg_prewarm.c
--- b/contrib/pg_prewarm/pg_prewarm.c
***
*** 179,185  pg_prewarm(PG_FUNCTION_ARGS)
  		 */
  		for (block = first_block; block <= last_block; ++block)
  		{
! 			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer);
  			++blocks_done;
  		}
  	}
--- 179,185 
  		 */
  		for (block = first_block; block <= last_block; ++block)
  		{
! 			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer, BAS_BULKREAD);
  			++blocks_done;
  		}
  	}
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1293,1298  include 'filename'
--- 1293,1322 

   
  
+  
+   readahead_strategy (integer)
+   
+readahead_strategyconfiguration parameter
+   
+   
+
+ This feature is to select which readahead strategy is used. When we
+ set off(default), readahead strategy is optimized by OS. On the other
+ hands, when we set on, readahead strategy is optimized by Postgres.
+ In typicaly situations, OS readahead strategy will be good working,
+ however Postgres often knows better readahead strategy before 
+ executing disk access. For example, we can easy to predict access 
+ pattern when we input SQLs, because planner of postgres decides 
+ efficient access pattern to read faster. And it might be random access
+ pattern or sequential access pattern. It will be less disk IO and more
+ efficient to use file cache in OS. It will be better performance.
+ However this optimization is not complete now, so it is necessary to
+ choose it carefully in considering situations. Default setting is off
+ that is optimized by OS, and whenever it can change it.
+
+   
+   
+ 
   
   
  
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 9125,9131  copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
--- 9125,9131 
  		/* If we got a cancel signal during th

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-17 Thread KONDO Mitsumasa


(2013/12/17 21:29), Simon Riggs wrote:

These are interesting results. Good research.

Thanks!


They also show that the benefit of this is very specific to the exact
task being performed. I can't see any future for a setting that
applies to everything or nothing. We must be more selective.

This patch is still needed some human judgement whether readahead is on or off.
But it might have been already useful for clever users. However, I'd like to 
implement adding more the minimum optimization.



We also need much better benchmark results, clearly laid out, so they
can be reproduced and discussed.

I think this feature is big benefit for OLTP, and it might useful for BI now.
BI queries are mostly compicated, so we will need to test more in some
situations. Printf debug is very useful for debugging my patch, and it will 
accelerate the optimization.



Please keep working on this.

OK. I do it patiently.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_rewarm status

2013-12-17 Thread KONDO Mitsumasa


(2013/12/18 5:33), Robert Haas wrote:

Sounds like it might be worth dusting the patch off again...

I'd like to request you to add all_index option and usage_count option.
When all_index option is selected, all index become rewarm nevertheless user 
doesn't input relation name. And usage_count option adds usage_copunt in 
shared_buffers. Useful buffers will remain long and not to be thrown easly.

I think these are easy to implements and useful. So please if you like.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-17 Thread KONDO Mitsumasa


Hi,

I fixed the patch to improve followings.

  - Can compile in MacOS.
  - Change GUC name enable_kernel_readahead to readahead_strategy.
  - Change POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL when we select sequential
access strategy, this reason is later...

I tested simple two access paterns which are followings in pgbench tables scale 
size is 1000.


  A) SELECT count(bid) FROM pgbench_accounts; (Index only scan)
  B) SELECT count(bid) FROM pgbench_accounts; (Seq scan)

 In each test, I restart postgres and drop file cache before each test.

Unpatched PG is faster than patched in A and B query. It was about 1.3 times 
faster. Result of A query as expected, because patched PG cannot execute 
readahead at all. So cache cold situation is bad for patched PG. However, it 
might good for cache hot situation, because it doesn't read disk IO at all and 
can calculate file cache usage and know which cache is important.


However, result of B query as unexpected, because my patch select 
POSIX_FADV_SEQUNENTIAL collectry, but it slow. I cannot understand that, 
nevertheless I read kernel source code... Next, I change POSIX_FADV_SEQUNENTIAL 
to POISX_FADV_NORMAL in my patch. B query was faster as unpatched PG.


In heavily random access benchmark tests which are pgbench and DBT-2, my patched 
PG is about 1.1 - 1.3 times faster than unpatched PG. But postgres buffer hint 
strategy algorithm have not optimized for readahead strategy yet, and I don't fix 
it. It is still only for ring buffer algorithm in shared_buffer.



Attached printf-debug patch will show you inside postgres buffer strategy. When 
you see "S" it selects sequential access strategy, on the other hands, when you 
see "R" it selects random access strategy. It might interesting for you. It's 
very visual.


Example output is here.

[mitsu-ko@localhost postgresql]$ bin/vacuumdb
SSS~~SS
[mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(aid) FROM 
pgbench_accounts"
   
QUERY PLAN
-
 Aggregate  (cost=2854.29..2854.30 rows=1 width=4) (actual time=33.438..33.438 
rows=1 loops=1)
   ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts  
(cost=0.29..2604.29 rows=10 width=4) (actual time=0.072..20.912 rows=10 
loops=1)
 Heap Fetches: 0
 Total runtime: 33.552 ms
(4 rows)

RRR~~RR
[mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(bid) FROM 
pgbench_accounts"
SSS~~SS
--
 Aggregate  (cost=2890.00..2890.01 rows=1 width=4) (actual time=40.315..40.315 
rows=1 loops=1)
   ->  Seq Scan on pgbench_accounts  (cost=0.00..2640.00 rows=10 width=4) 
(actual time=0.112..23.001 rows=10 loops=1)
 Total runtime: 40.472 ms
(3 rows)


Thats's all now.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/configure
--- b/configure
***
*** 19937,19943  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
--- 19937,19943 
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 1252,1257  include 'filename'
--- 1252,1281 

   
  
+  
+   readahead_strategy (integer)
+   
+readahead_strategyconfiguration parameter
+   
+   
+
+ This feature is to select which readahead strategy is used. When we
+ set off(default), readahead strategy is optimized by OS. On the other
+ hands, when we set on, readahead strategy is optimized by Postgres.
+ In typicaly situations, OS readahead strategy will be good working,
+ however Postgres often knows better readahead strategy before 
+ executing disk access. For example, we can easy to predict access 
+ pattern when we input SQLs, because planner of postgres decides 
+ effici

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-12 Thread KONDO Mitsumasa


(2013/12/12 9:30), Claudio Freire wrote:

On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa
 wrote:



enable_readahead=os|fadvise

with os = on, fadvise = off


Hmm. fadvise is method and is not a purpose. So I consider another idea of
this GUC.


Yeah, I was thinking of opening the door for readahead=aio, but
whatever clearer than on-off would work ;)


I'm very interested in Postgres with libaio, and I'd like to see the perfomance 
improvements. I'm not sure about libaio, however, it will face 
exclusive-buffer-lock problem in asynchronous IO.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread KONDO Mitsumasa

(2013/12/12 18:09), Simon Riggs wrote:

On 9 December 2013 10:54, KONDO Mitsumasa  wrote:

(2013/12/09 19:35), Pavel Stehule wrote:

2013/12/9 KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>>

 Hi Fabrízio,

 I test your v4 patch, and send your review comments.

 * Fix typo
  > 49 -# commited transactions from the master, specify a recovery
time delay.
  > 49 +# committed transactions from the master, specify a recovery
time delay.

 * Fix white space
  > 177 -   if (secs <= 0 && microsecs <=0)
  > 177 +   if (secs <= 0 && microsecs <=0 )

 * Add functionality (I propose)
 We can set negative number at min_standby_apply_delay. I think that
this feature
 is for world wide replication situation. For example, master server is
in
 Japan and slave server is in San Francisco. Japan time fowards than
San
 Francisco time
 . And if we want to delay in this situation, it can need negative
number in
 min_standby_apply_delay. So I propose that time delay conditional
branch
 change under following.
  > - if (min_standby_apply_delay > 0)
  > + if (min_standby_apply_delay != 0)
 What do you think? It might also be working collectry.

what using interval instead absolute time?

This is because local time is recorded in XLOG. And it has big cost for
calculating global time.

I agree with your request here, but I don't think negative values are
the right way to implement that, at least it would not be very usable.
I think that my proposal is the easiest and simplist way to solve this problem. 
And I believe that the man who cannot calculate the difference in time-zone 
doesn't set replication cluster across continents.

My suggestion would be to add the TZ to the checkpoint record. This
way all users of WAL can see the TZ of the master and act accordingly.
I'll do a separate patch for that.

It is something useful for also other situations. However, it might be
going to happen long and complicated discussions... I　think that our hope is to 
commit this patch in this commit-fest or next final commit-fest.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-12 Thread KONDO Mitsumasa

(2013/12/12 7:23), Fabrízio de Royes Mello wrote:

On Wed, Dec 11, 2013 at 7:47 PM, Andres Freund  * hot_standby=off: Makes delay useable with wal_level=archive (and thus
 >   a lower WAL volume)
 > * standby_mode=off: Configurations that use tools like pg_standby and
 >   similar simply don't need standby_mode=on. If you want to trigger
 >   failover from within the restore_command you *cannot* set it.
 > * recovery_target_*: It can still make sense if you use
 >   pause_at_recovery_target.

I don't think part of his arguments are right very much... We can just set 
stanby_mode=on when we use "min_standby_apply_delay" with pg_standby and similar 
simply tools. However, I tend to agree with not to need to prohibit except for 
standby_mode. So I'd like to propose that changing parameter name of 
"min_standby_apply_delay" to "min_recovery_apply_delay". It is natural for this 
feature.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-10 Thread KONDO Mitsumasa


(2013/12/10 18:38), Andres Freund wrote:

"master PITR"? What's that? All PITR is based on recovery.conf and thus
not really a "master"?
"master PITR" is PITR with "standby_mode = off". It's just recovery from 
basebackup. They have difference between "master PITR" and "standby" that the 
former will be independent timelineID, but the latter is same timeline ID taht 
following the master sever. In the first place, purposes are different.



Why should we prohibit using this feature in PITR? I don't see any
advantage in doing so. If somebody doesn't want the delay, they
shouldn't set it in the configuration file. End of story.
Unfortunately, there are a lot of stupid in the world... I think you have these 
clients, too.



There's not really a that meaningful distinction between PITR and
replication using archive_command. Especially when using
*pause_after.
It is meaningless in "master PITR". It will be master which has new timelineID at 
unexpected timing.



I think this feature will be used in a lot of scenarios in
which PITR is currently used.

We have to judge which is better, we get something potential or to protect 
stupid.
And we had better to wait author's comment...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-10 Thread KONDO Mitsumasa


(2013/12/10 22:55), Claudio Freire wrote:

On Tue, Dec 10, 2013 at 5:03 AM, KONDO Mitsumasa
 wrote:

I revise this patch and re-run performance test, it can work collectry in
Linux and no complile wanings. I add GUC about enable_kernel_readahead
option in new version. When this GUC is on(default), it works in
POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it
works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by
buffer hint in Postgres, readahead parameter is optimized by postgres. We
can change this parameter in their transactions everywhere and everytime.


I'd change the naming to

OK. I think "on" or "off" naming is not good, too.


enable_readahead=os|fadvise

with os = on, fadvise = off

Hmm. fadvise is method and is not a purpose. So I consider another idea of this 
GUC.

1)readahead_strategy=os|pg
  This naming is good for future another implements. If we will want to set
  maximum readahead paraemeter which is always use POSIX_FADV_SEQUENTIAL, we can 
set "max".


2)readahead_optimizer=os|pg or readahaed_strategist=os|pg
  This naming is easy to understand to who is opitimized readahead.
  But it isn't extensibility for future another implements.


And, if you want to keep the on/off values, I'd reverse them. Because
off reads more like "I don't do anything special", and in your patch
it's quite the opposite.

I understand your feeling. If we adopt "on|off" setting, I would like to set GUC
optimized_readahead=off|on.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-10 Thread KONDO Mitsumasa

(2013/12/11 10:25), Tom Lane wrote:
> Jeff Janes  writes:
>> On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire 
>> wrote:
>>> Problem is, Postgres relies on a working kernel cache for checkpoints.
>>> Checkpoint logic would have to be heavily reworked to account for an
>>> impaired kernel cache.
> 
>> I don't think it would need anything more than a sorted checkpoint.
> 
> Nonsense.  We don't have access to the physical-disk-layout information
> needed to do reasonable sorting;
OS knows physical-disk-layout which is under following.
> [mitsu-ko@ssd ~]$ filefrag -v .bashrc
> Filesystem type is: ef53
> File size of .bashrc is 124 (1 block, blocksize 4096)
>  ext logical physical expected length flags
>0   0 15761410   1 eof
> .bashrc: 1 extent found
If we have to know this information, we can get physical-disk-layout whenever.

> to say nothing of doing something
> intelligent in a multi-spindle environment, or whenever any other I/O
> is going on concurrently.
IO scheduler in OS knows it best. So I think BufferedIO is faster than DirectIO
in general situations.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-12-09 Thread KONDO Mitsumasa


Hi,

I revise this patch and re-run performance test, it can work collectry in Linux 
and no complile wanings. I add GUC about enable_kernel_readahead option in new 
version. When this GUC is on(default), it works in POSIX_FADV_NORMAL which is 
general readahead in OS. And when it is off, it works in POSXI_FADV_RANDOM or 
POSIX_FADV_SEQUENTIAL which is judged by buffer hint in Postgres, readahead 
parameter is optimized by postgres. We can change this parameter in their 
transactions everywhere and everytime.


* Test server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB
  OS: RHEL 6.4(x86_64)
  FS: Ext4

* Test setting
  I use "pgbench -c 8 -j 4 -T 2400 -S -P 10 -a"
  I also use my accurate patch in this test. So I exexuted under following 
command before each benchmark.

1. cluster all database
2. truncate pgbench_history
3. checkpoint
4. sync
5. checkpoint

* postresql.conf
shared_buffers = 2048MB
maintenance_work_mem = 64MB
wal_level = minimal
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7

* Performance test result
** In memory database size
s=1000|   1   |   2   |   3   |  avg
-
readahead=on  | 39836 | 40229 | 40055 | 40040
readahead=off | 31259 | 29656 | 30693 | 30536
ratio |  78%  |  74%  |  77%  |   76%

** Over memory database size
s=2000|   1  |   2  |3|  avg
-
readahead=on  | 1288 | 1370 |   1367  | 1341
readahead=off | 1683 | 1688 |   1395  | 1589
ratio | 131% | 123% |   102%  | 118%

s=3000|   1  |   2  |3|  avg
-
readahead=on  |  965 |  862 |   993   |  940
readahead=off | 1113 | 1098 |   935   | 1049
ratio | 115% | 127% |   94%   | 112%


It seems good performance expect scale factor=1000. When readahead parameter is 
off, disk IO keep to a minimum or necessary, therefore it is faster than 
"readahead=on". "readahead=on" uses useless diskIO. For example, which is faster 
8KB random read or 12KB random read from disks in many times transactions? It is 
self-evident that the former is faster.


In scale factor 1000, it becomes to slower buffer-is-hot than "readahead=on". So 
it seems to less performance. But it is essence in measuring perfomance. And you 
can confirm it by attached benchmark graphs. We can use this parameter when 
buffer is reratively hot. If you want to see other trial graphs, I will send.


And I will support to MacOS and create document about this patch in this week.
#MacOS is in my house.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
*** a/configure
--- b/configure
***
*** 19937,19943  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
--- 19937,19943 
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***
*** 9119,9125  copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
--- 9119,9125 
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***
*** 41,46 
--- 41,47 
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "storage/buf.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***
*** 451,457  ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  			if (track_io_timing)
  INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
  
  			if (track_io_timing)
  			{
--- 452,458 
  			if (track_io_timing)
  INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, block

Re: [HACKERS] Time-Delayed Standbys

2013-12-09 Thread KONDO Mitsumasa


(2013/12/09 20:29), Andres Freund wrote:

On 2013-12-09 19:51:01 +0900, KONDO Mitsumasa wrote:

Add my comment. We have to consider three situations.

1. PITR
2. replication standby
3. replication standby with restore_command

I think this patch cannot delay in 1 situation.


Why?


I have three reasons.

  1. It is written in document. Can we remove it?
  2. Name of this feature is "Time-delayed *standbys*", not "Time-delayed
 *recovery*". Can we change it?
  3. I think it is unnessesary in master PITR. And if it can delay in master
 PITR, it will become master at unexpected timing, not to continue to
 recovery. It is meaningless.

I'd like to ask you what do you expect from this feature and how to use it.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-09 Thread KONDO Mitsumasa


(2013/12/09 19:35), Pavel Stehule wrote:




2013/12/9 KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>>

Hi Fabrízio,

I test your v4 patch, and send your review comments.

* Fix typo
 > 49 -# commited transactions from the master, specify a recovery time 
delay.
 > 49 +# committed transactions from the master, specify a recovery time 
delay.

* Fix white space
 > 177 -   if (secs <= 0 && microsecs <=0)
 > 177 +   if (secs <= 0 && microsecs <=0 )

* Add functionality (I propose)
We can set negative number at min_standby_apply_delay. I think that this 
feature
is for world wide replication situation. For example, master server is in
Japan and slave server is in San Francisco. Japan time fowards than San
Francisco time
. And if we want to delay in this situation, it can need negative number in
min_standby_apply_delay. So I propose that time delay conditional branch
change under following.
 > - if (min_standby_apply_delay > 0)
 > + if (min_standby_apply_delay != 0)
What do you think? It might also be working collectry.


what using interval instead absolute time?
This is because local time is recorded in XLOG. And it has big cost for 
calculating global time.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-09 Thread KONDO Mitsumasa


(2013/12/09 19:36), KONDO Mitsumasa wrote:

* Problem 1
I read your wittened document. There is "PITR has not affected".
However, when I run PITR with min_standby_apply_delay=300, it cannot start
server. The log is under following.

[mitsu-ko@localhost postgresql]$ bin/pg_ctl -D data2 start
server starting
[mitsu-ko@localhost postgresql]$ LOG:  database system was interrupted; last
known up at 2013-12-08 18:57:00 JST
LOG:  creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `../arc/0002.history':
LOG:  starting archive recovery
LOG:  restored log file "00010041" from archive
LOG:  redo starts at 0/4128
LOG:  consistent recovery state reached at 0/41F0
LOG:  database system is ready to accept read only connections
LOG:  restored log file "00010042" from archive
FATAL:  cannot wait on a latch owned by another process
LOG:  startup process (PID 30501) exited with exit code 1
LOG:  terminating any other active server processes

We need recovery flag for controling PITR situation.

Add my comment. We have to consider three situations.

1. PITR
2. replication standby
3. replication standby with restore_command

I think this patch cannot delay in 1 situation. So I think you should add only 
StandbyModeRequested flag in conditional branch.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center








--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-09 Thread KONDO Mitsumasa


Hi Fabrízio,

I test your v4 patch, and send your review comments.

* Fix typo
> 49 -# commited transactions from the master, specify a recovery time delay.
> 49 +# committed transactions from the master, specify a recovery time delay.

* Fix white space
> 177 -   if (secs <= 0 && microsecs <=0)
> 177 +   if (secs <= 0 && microsecs <=0 )

* Add functionality (I propose)
We can set negative number at min_standby_apply_delay. I think that this feature
is for world wide replication situation. For example, master server is in Japan 
and slave server is in San Francisco. Japan time fowards than San Francisco time
. And if we want to delay in this situation, it can need negative number in 
min_standby_apply_delay. So I propose that time delay conditional branch change 
under following.

> - if (min_standby_apply_delay > 0)
> + if (min_standby_apply_delay != 0)
What do you think? It might also be working collectry.


* Problem 1
I read your wittened document. There is "PITR has not affected".
However, when I run PITR with min_standby_apply_delay=300, it cannot start 
server. The log is under following.

[mitsu-ko@localhost postgresql]$ bin/pg_ctl -D data2 start
server starting
[mitsu-ko@localhost postgresql]$ LOG:  database system was interrupted; last 
known up at 2013-12-08 18:57:00 JST
LOG:  creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `../arc/0002.history':
LOG:  starting archive recovery
LOG:  restored log file "00010041" from archive
LOG:  redo starts at 0/4128
LOG:  consistent recovery state reached at 0/41F0
LOG:  database system is ready to accept read only connections
LOG:  restored log file "00010042" from archive
FATAL:  cannot wait on a latch owned by another process
LOG:  startup process (PID 30501) exited with exit code 1
LOG:  terminating any other active server processes

We need recovery flag for controling PITR situation.


That's all for now.
If you are busy, please fix in your pace. I'm busy and I'd like to wait your 
time, too:-)


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-08 Thread KONDO Mitsumasa


(2013/12/05 23:42), Greg Stark wrote:

On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa
 wrote:

Yes.　And using something efficiently DirectIO is more difficult than
BufferedIO.
If we change write() flag with direct IO in PostgreSQL, it will execute
hardest ugly randomIO.


Using DirectIO presumes you're using libaio or threads to implement
prefetching and asynchronous I/O scheduling.

I think in the long term there are only two ways to go here. Either a)
we use DirectIO and implement an I/O scheduler in Postgres or b) We
use mmap and use new system calls to give the kernel all the
information Postgres has available to it to control the I/O scheduler.
I agree with part of (b) method. I think MMAP API isn't purpose for controling 
I/O as others saying. And I think posix_fadivse(), sync_file_range() and 
fallocate() is easier way to be realized better I/O sheduler in Postgres. These 
systemcall doesn't cause data corruption at all, and we can just use existing 
implementaion. They effect only perfomance.


My survey of posix_fadvise() and sync_file_range() is here. It's simple rule.
#Almost my explaining is written in linux man:-)

* Optimize readahead in OS [ posix_fadvise() ]
  These options is for mainly read perfomance.

  - POSIX_FADV_SEQUENTIAL flag
-> Readahead parameter in OS becomes maximum.
  - POSIX_FADV_RANDOM flag
-> Don't use readahead parameter in OS. It can calculate the file cache
   frequency and efficiency for using the file cache.
  - POSIX_FADV_NORMAL
-> Readahead parameter in OS optimized dynamically in each situasions. If
   you doesn't judge strategy of disk controlling, we can select this
   option. It might be good working in almost cases.

* Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ]
  These optinos is for write and read perfomance controling in OS file caches.

  - POSIX_FADV_DONTNEED
   -> Drop the file cache. If it is dirty, write disk and drop file cache.
  If it isn't dirty, it only drop from OS file cache.
  - sync_file_range()
   -> If you want to write dirty buffer to disk and remain file cache in OS, you
   can select this system-call. And it can contorol amount of write size.
  - POSIX_FADV_NOREUSE
   -> If you think that the file cache will not be needed, we can set this
   option. The file cache will be drop soon.
  - POSIX_FADV_WILLNEED
   -> If you think that the file cache will be important, we can set this
   option. The file cache will be tend to remain in OS file caches.


That's all.

Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it 
is optimized by general heuristic algorithms. I think it is right way. However, 
PostgreSQL can predict IO pattern in part of planner, executer and checkpointer, 
so we had better set optimum posix_fadvise() flag or sync_file_range() 
before/after execute general IO systemcall. I think that they will be good IO 
contoroling and scheduling method without unreliable implementations.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-05 Thread KONDO Mitsumasa


(2013/12/04 16:39), Claudio Freire wrote:

On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii  wrote:

Can we avoid the Linux kernel problem by simply increasing our shared
buffer size, say up to 80% of memory?

It will be swap more easier.


Is that the case? If the system has not enough memory, the kernel
buffer will be used for other purpose, and the kernel cache will not
work very well anyway. In my understanding, the problem is, even if
there's enough memory, the kernel's cache does not work as expected.


Problem is, Postgres relies on a working kernel cache for checkpoints.
Checkpoint logic would have to be heavily reworked to account for an
impaired kernel cache.

Really, there's no difference between fixing the I/O problems in the
kernel(s) vs in postgres. The only difference is, in the kernel(s),
everyone profits, and you've got a huge head start.

Yes.　And using something efficiently DirectIO is more difficult than BufferedIO.
If we change write() flag with direct IO in PostgreSQL, it will execute hardest 
ugly randomIO.



Communicating more with the kernel (through posix_fadvise, fallocate,
aio, iovec, etc...) would probably be good, but it does expose more
kernel issues. posix_fadvise, for instance, is a double-edged sword
ATM. I do believe, however, that exposing those issues and prompting a
fix is far preferable than silently working around them.
Agreed. And, I believe that controled BufferedIO is faster and easier than 
controled DirectIO perfectly. In actually, Oracle database uses BufferedIO to 
access small datasets, and uses DirectIO to access big datasets. It is because 
using OS file cache more efficiently.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why we are going to have to go DirectIO

2013-12-03 Thread KONDO Mitsumasa


(2013/12/04 11:28), Tatsuo Ishii wrote:

Magnus Hagander  writes:

On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus  wrote:

Would certainly be nice.  Realistically, getting good automated
performace tests will require paying someone like Greg S., Mark or me
for 6 solid months to develop them, since worthwhile open source
performance test platforms currently don't exist.  That money has never
been available; maybe I should do a kickstarter.



So in order to get *testing* we need to pay somebody. But to build a great
database server, we can rely on volunteer efforts or sponsorship from
companies who are interested in moving the project forward?


And even more to the point, volunteers to reinvent the kernel I/O stack
can be found on every street corner?  And those volunteers won't need any
test scaffolding to be sure that *their* version never has performance
regressions?  (Well, no, they won't, because no such thing will ever be
built.  But we do need better test scaffolding for real problems.)


Can we avoid the Linux kernel problem by simply increasing our shared
buffer size, say up to 80% of memory?

It will be swap more easier.

I think that we should use latest system-calls in Linux which are like 
posix_fadvise(), fallocate() and sync_file_range() etc, when we use linux 
buffered IO. Hoevere, PostgreSQL doesn't use these system-call a lots. 
Especially, I think that checkpoint algorithm is very ugly..


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-03 Thread KONDO Mitsumasa


(2013/12/04 4:00), Andres Freund wrote:

On 2013-12-03 13:46:28 -0500, Robert Haas wrote:

On Tue, Dec 3, 2013 at 12:36 PM, Fabrízio de Royes Mello
 wrote:

On Tue, Dec 3, 2013 at 2:33 PM, Christian Kruse 
wrote:


Hi Fabrizio,

looks good to me. I did some testing on 9.2.4, 9.2.5 and HEAD. It
applies and compiles w/o errors or warnings. I set up a master and two
hot standbys replicating from the master, one with 5 minutes delay and
one without delay. After that I created a new database and generated
some test data:

CREATE TABLE test (val INTEGER);
INSERT INTO test (val) (SELECT * FROM generate_series(0, 100));

The non-delayed standby nearly instantly had the data replicated, the
delayed standby was replicated after exactly 5 minutes. I did not
notice any problems, errors or warnings.



Thanks for your review Christian...


So, I proposed this patch previously and I still think it's a good
idea, but it got voted down on the grounds that it didn't deal with
clock drift.  I view that as insufficient reason to reject the
feature, but others disagreed.


I really fail to see why clock drift should be this patch's
responsibility.  It's not like the world would go under^W data corruption
would ensue if the clocks drift. Your standby would get delayed
imprecisely. Big deal. From what I know of potential users of this
feature, they would set it to at the very least 30min - that's WAY above
the range for acceptable clock-drift on servers.

Yes. I think that purpose of this patch is long time delay in standby server,
and not for little bit careful timing delay.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-12-03 Thread KONDO Mitsumasa

(2013/11/30 5:34), Fabrízio de Royes Mello wrote:

On Fri, Nov 29, 2013 at 5:49 AM, KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote:
> * Problem1
> Your patch does not code recovery.conf.sample about recovery_time_delay.
> Please add it.
Fixed.

OK. It seems no problem.

> * Problem2
> When I set time-delayed standby and start standby server, I cannot access
stanby server by psql. It is because PG is in first starting recovery which
cannot access by psql. I think that time-delayed standby is only delayed
recovery
position, it must not affect other functionality.
>
> I didn't test recoevery in master server with recovery_time_delay. If you
have
detail test result of these cases, please send me.
>
Well, I could not reproduce the problem that you described.

I run the following test:

1) Clusters
- build master
- build slave and attach to the master using SR and config recovery_time_delay
to
1min.

2) Stop de Slave

3) Run some transactions on the master using pgbench to generate a lot of
archives

4) Start the slave and connect to it using psql and in another session I can see
all archive recovery log

Hmm... I had thought my mistake in reading your email, but it reproduce again.
When I sat small recovery_time_delay(=3), it might work collectry. However, I
sat long timed recovery_time_delay(=300), it didn't work.

My reporduced operation log is under following.

[mitsu-ko@localhost postgresql]$ bin/pgbench -T 30 -c 8 -j4 -p5432
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 8
number of threads: 4
duration: 30 s
number of transactions actually processed: 68704
latency average: 3.493 ms
tps = 2289.196747 (including connections establishing)
tps = 2290.175129 (excluding connections establishing)
[mitsu-ko@localhost postgresql]$ vim slave/recovery.conf
[mitsu-ko@localhost postgresql]$ bin/pg_ctl -D slave start
server starting
[mitsu-ko@localhost postgresql]$ LOG: database system was shut down in
recovery at 2013-12-03 10:26:41 JST
LOG: entering standby mode
LOG: consistent recovery state reached at 0/5C4D8668
LOG: redo starts at 0/5C4000D8
[mitsu-ko@localhost postgresql]$ FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
[mitsu-ko@localhost postgresql]$ bin/psql -p6543
psql: FATAL: the database system is starting up
[mitsu-ko@localhost postgresql]$ bin/psql -p6543
psql: FATAL: the database system is starting up

I attached my postgresql.conf and recovery.conf. It will be reproduced.

I think that your patch should be needed recovery flags which are like
ArchiveRecoveryRequested and InArchiveRecovery etc. It is because time-delayed
standy works only replication situasion. And I hope that it isn't bad in startup
standby server and archive recovery. Is it wrong with your image? I think this
patch have a lot of potential, however I think that standby functionality is more
important than this feature. And we might need to discuss that how behavior is
best in this patch.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

conf.tar.gz
Description: GNU Zip compressed data

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Time-Delayed Standbys

2013-11-28 Thread KONDO Mitsumasa


Hi Royes,

I'm sorry for my late review...

I feel potential of your patch in PG replication function, and it might be 
something useful for all people. I check your patch and have some comment for 
improvement. I haven't executed detail of unexpected sutuation yet. But I think 
that under following problem2 is important functionality problem. So I ask you to 
solve the problem in first.


* Regress test
No problem.

* Problem1
Your patch does not code recovery.conf.sample about recovery_time_delay.
Please add it.

* Problem2
When I set time-delayed standby and start standby server, I cannot access stanby 
server by psql. It is because PG is in first starting recovery which cannot 
access by psql. I think that time-delayed standby is only delayed recovery 
position, it must not affect other functionality.


I didn't test recoevery in master server with recovery_time_delay. If you have 
detail test result of these cases, please send me.


My first easy review of your patch is that all.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging WAL when updating hintbit

2013-11-18 Thread KONDO Mitsumasa


(2013/11/15 19:27), Sawada Masahiko wrote:

On Thu, Nov 14, 2013 at 7:51 PM, Florian Weimer  wrote:

On 11/14/2013 07:02 AM, Sawada Masahiko wrote:


I attached patch adds new wal_level 'all'.



Shouldn't this be a separate setting?  It's useful for storage which
requires rewriting a partially written sector before it can be read again.



Thank you for comment.
Actually, I had thought to add separate parameter.

I think that he said that if you can proof that amount of WAL is almost same and
without less performance same as before, you might not need to separate
parameter in your patch.

Did you test about amount of WAL size in your patch?
I'd like to know it.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio

2013-11-18 Thread KONDO Mitsumasa


(2013/11/19 12:03), Peter Geoghegan wrote:

On Mon, Nov 18, 2013 at 6:12 PM, KONDO Mitsumasa
 wrote:

I confirmed that Itagaki-san and Mr Cerdic disscution. He said that raw
values be just simple. However, were his changes just simple? I cannot
understand his aesthetics sense and also you, too:-(



It's too complicated, and do you know how to tuning PG from information of
local_* and temp_*?
At least, I think that most user cannot tuning from these information, and
it might not be useful information only part of them.


All of those costs are cumulative aggregates. If we didn't aggregate
them, then the user couldn't possibly determine them on their own, to
any approximation. That's the difference. If you think the local_* and
temp_* aren't very useful, I'm inclined to agree, but it's too late to
do anything about that now.
I regret past decision of Itagaki-san's patch, and improvement might not be 
possible. However, we can change it, if we get have logical reason to change it.



No. It's not for geek tools and people having pre-packaged solution in big
company, but also for common DBA tools.


I don't think that the tool needs to be expensive. If selecting from
the pg_stat_statements view every 1-3 seconds is too expensive for
such a tool, we can have a discussion about being smarter, because
there certainly are ways to optimize it.
I can understand why you say my patch is heavy now! Your monitoring methods are 
redically heavy. In general, we get pg_stat_statements view every 1 min - 5min. 
It is because monitoring SQLs must not heavier than common main SQLs. If we 
measure the real performance, we don't measure with monitoring SQL's cost. And, I 
cannot still understand you'd like to collect drastic detail performance of 
statements. I'd like to only know max, avg and stddev in each statement. They are 
enough, because we can improve them by using these information.



Regarding your min/max patch: I'm opposed to adding even more to the
spinlock-protected counters struct, so that we can get an exact answer
to a question where an approximate answer would very likely be good
enough. And as Itagaki-san said 4 years ago, who is to say that what
you've done here for buffers (or equally, what you've done in your
min/max patch) is more interesting than the same thing but for another
cost? The point of having what you've removed from the
pg_stat_statements docs about calculating averages is that it is an
example that can be generalized from. I certainly think there should
be better tooling to make displaying costs over time easier, or
characterizing the distribution, but unfortunately this isn't it.

Something like pg_stat_statements is allowed to be approximate. That's
considered an appropriate trade-off. Most obviously, today there can
be hash table collisions, and some of the entries can therefore be
plain wrong. Previously, I put the probability of 1 collision in the
hash table at about 1% when pg_stat_statements.max is set to 10,000.
So if your min/max patch was "implemented in userspace", and an
outlier is lost in the noise with just one second of activity, I'm not
terribly concerned about that. It's a trade-off, and if you don't
think it's the correct one, well then I'm afraid that's just where you
and I differ. As I've said many times, if you want to have a
discussion about making aggressive snapshotting of the
pg_stat_statements view more practical, I think that would be very
useful.
In summary of your comment, your patch is lower cost than I proposed patch. 
Because my patch has long lock problem, on the other hands your patch doesn't 
these problem. Is it right? If it true, I can understand your theoretically, but 
I'm not sure without real practice or benchmark that it is really or not. So we 
will need benchmark test in my patch and yours. I try it.



By the way, MySQL and Oracle database which are very popular have these
statistics. I think that your argument might disturb people who wants to
migration from these database and will accelerate popularity of these
database more.


I think that there should be better tooling built on top of
pg_stat_statements. I don't know what Oracle does, but I'm pretty sure
that MySQL has nothing like pg_stat_statements. Please correct me if
I'm mistaken.
I joined the db tech show case 2013 which is held in japan last week. Oracle 
speaker intoroduced performance schema and like these in MySQL 5.6. This is the 
slide of his. It's almost in japanese, but please see it since 31page. It is 
wirtten in SQL.

http://www.slideshare.net/yoyamasaki/20131110-tuning-onmysql56

In MySQL 5.6, it has information which are sum_time, min_time, avg_time, max_time 
and sum_lock_time. I think it is useful for understanding our executing statements.



As I said on the min/max thread, if we're no

Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio

2013-11-18 Thread KONDO Mitsumasa


(2013/11/19 11:12), KONDO Mitsumasa wrote:

(2013/11/19 3:56), Peter Geoghegan wrote:

On Mon, Nov 18, 2013 at 10:49 AM, Fujii Masao  wrote:

The same idea was proposed before but not committed because
Itagaki thought that pg_stat_statements view should report only raw values.
Please read the following thread. I have the same feeling with him.
Anyway we should listen to more opinions from other people, though.
http://www.postgresql.org/message-id/20091222172719.8b86.52131...@oss.ntt.co.jp

I confirmed that Itagaki-san and Mr Cerdic disscution. He said that raw values 
be
just simple. However, were his changes just simple? I cannot understand his
aesthetics sense and also you, too:-(

PG8.4 before:

012 CREATE FUNCTION pg_stat_statements(
013 OUT userid oid,
014 OUT dbid oid,
015 OUT query text,
016 OUT calls int8,
017 OUT total_time float8,
018 OUT rows int8
019 )


PG9.0 after:

012 CREATE FUNCTION pg_stat_statements(
013 OUT userid oid,
014 OUT dbid oid,
015 OUT query text,
016 OUT calls int8,
017 OUT total_time float8,
018 OUT rows int8,
019 OUT shared_blks_hit int8,
020 OUT shared_blks_read int8,
021 OUT shared_blks_written int8,
022 OUT local_blks_hit int8,
023 OUT local_blks_read int8,
024 OUT local_blks_written int8,
025 OUT temp_blks_read int8,
026 OUT temp_blks_written int8
027 )

It's too complicated, and do you know how to tuning PG from information of
local_* and temp_*?
At least, I think that most user cannot tuning from these information, and it
might not be useful information only part of them.


+1 from me. I think that a higher level tool needs to be built on
pg_stat_statements to make things easy for those that want a slick,
pre-packaged solution.

No. It's not for geek tools and people having pre-packaged solution in big
company, but also for common DBA tools.

By the way, MySQL and Oracle database which are very popular have these
statistics. I think that your argument might disturb people who wants to
migration from these database and will accelerate popularity of these database 
more.


As I said on the min/max thread, if we're not
doing enough to help people who would like to build such a tool, we
should discuss how we can do better.

Could you tell me how to get min/max statistics with low cost?
I'm not sure about detail of your patch in CF, but it seems very high cost.
Repeatedly, I think that if we want to get drastic detail statistics, we have to
create another tools of statistics. Your patch will be these statistics tools.
However, pg_stat_statement sholud be just light weight.

Regards,

Oh, I forgot to say it...
I beleive that essence of information technology is to show more informative 
information in little information with low cost. If it wrong, please let me know 
the reason.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio

2013-11-18 Thread KONDO Mitsumasa


(2013/11/19 3:56), Peter Geoghegan wrote:

On Mon, Nov 18, 2013 at 10:49 AM, Fujii Masao  wrote:

The same idea was proposed before but not committed because
Itagaki thought that pg_stat_statements view should report only raw values.
Please read the following thread. I have the same feeling with him.
Anyway we should listen to more opinions from other people, though.
http://www.postgresql.org/message-id/20091222172719.8b86.52131...@oss.ntt.co.jp
I confirmed that Itagaki-san and Mr Cerdic disscution. He said that raw values 
be just simple. However, were his changes just simple? I cannot understand his 
aesthetics sense and also you, too:-(


PG8.4 before:

012 CREATE FUNCTION pg_stat_statements(
013 OUT userid oid,
014 OUT dbid oid,
015 OUT query text,
016 OUT calls int8,
017 OUT total_time float8,
018 OUT rows int8
019 )


PG9.0 after:

012 CREATE FUNCTION pg_stat_statements(
013 OUT userid oid,
014 OUT dbid oid,
015 OUT query text,
016 OUT calls int8,
017 OUT total_time float8,
018 OUT rows int8,
019 OUT shared_blks_hit int8,
020 OUT shared_blks_read int8,
021 OUT shared_blks_written int8,
022 OUT local_blks_hit int8,
023 OUT local_blks_read int8,
024 OUT local_blks_written int8,
025 OUT temp_blks_read int8,
026 OUT temp_blks_written int8
027 )
It's too complicated, and do you know how to tuning PG from information of 
local_* and temp_*?
At least, I think that most user cannot tuning from these information, and it 
might not be useful information only part of them.



+1 from me. I think that a higher level tool needs to be built on
pg_stat_statements to make things easy for those that want a slick,
pre-packaged solution.
No. It's not for geek tools and people having pre-packaged solution in big 
company, but also for common DBA tools.


By the way, MySQL and Oracle database which are very popular have these 
statistics. I think that your argument might disturb people who wants to 
migration from these database and will accelerate popularity of these database more.



As I said on the min/max thread, if we're not
doing enough to help people who would like to build such a tool, we
should discuss how we can do better.

Could you tell me how to get min/max statistics with low cost?
I'm not sure about detail of your patch in CF, but it seems very high cost.
Repeatedly, I think that if we want to get drastic detail statistics, we have to 
create another tools of statistics. Your patch will be these statistics tools. 
However, pg_stat_statement sholud be just light weight.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio

2013-11-18 Thread KONDO Mitsumasa


(2013/11/18 20:16), Haribabu kommi wrote:

On 18 October 2013 13:35 KONDO Mitsumasa wrote:

This patch conflicts pg_stat_statement_min_max_exectime patch which I
submitted, and pg_stat_statement_min_max_exectime patch also adds new
columns which are min_time and max_time. So I'd like to change it in
this opportunity.


This patch adds another column "shared_blks_hit_percent" to pg_stat_statements 
view
Which is very beneficial to the user to know how much percentage of blks are 
hit.

All changes are fine and working as described. Marked as ready for committer.

Thank you for your reviewing!

However, I'd like to add average time in each statement, too. Attached patch is 
my latest one. Adding shared_blks_hit_percent and ave_time. This is the adding 
main code.

+ total_time / calls::float AS avg_time,


If this patch and min/max and stddev patch will be commited, we can see more 
detail and simple information in pg_stat_statements, by light-weight coding.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
new file mode 100644
index 000..f0a8e0f
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
@@ -0,0 +1,63 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(
+OUT userid oid,
+OUT dbid oid,
+OUT query text,
+OUT calls int8,
+OUT total_time float8,
+OUT rows int8,
+OUT shared_blks_hit int8,
+OUT shared_blks_read int8,
+OUT shared_blks_dirtied int8,
+OUT shared_blks_written int8,
+OUT local_blks_hit int8,
+OUT local_blks_read int8,
+OUT local_blks_dirtied int8,
+OUT local_blks_written int8,
+OUT temp_blks_read int8,
+OUT temp_blks_written int8,
+OUT blk_read_time float8,
+OUT blk_write_time float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT userid,
+ dbid,
+ query,
+ calls,
+ total_time,
+ rows,
+ CASE WHEN shared_blks_hit + shared_blks_read > 0
+   THEN 100.0 * (shared_blks_hit::float / (shared_blks_hit + shared_blks_read))
+   ELSE 0 END AS shared_blks_hit_percent,
+ shared_blks_hit,
+ shared_blks_read,
+ shared_blks_dirtied,
+ shared_blks_written,
+ local_blks_hit,
+ local_blks_read,
+ local_blks_dirtied,
+ local_blks_written,
+ temp_blks_read,
+ temp_blks_written,
+ blk_read_time,
+ blk_write_time
+  FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
new file mode 100644
index 000..41dc16b
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
@@ -0,0 +1,65 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(
+OUT userid oid,
+OUT dbid oid,
+OUT query text,
+OUT calls int8,
+OUT total_time float8,
+OUT rows int8,
+OUT shared_blks_hit int8,
+OUT shared_blks_read int8,
+OUT shared_blks_dirtied int8,
+OUT shared_blks_written int8,
+OUT local_blks_hit int8,
+OUT local_blks_read int8,
+OUT local_blks_dirtied int8,
+OUT local_blks_written int8,
+OUT temp_blks_read int8,
+OUT temp_blks_written int8,
+OUT blk_read_time float8,
+OUT blk_write_time float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT userid,
+ dbid,
+ query,
+ calls,
+ total_time,
+ total_time / calls::float AS avg_time,
+ rows,
+ CASE WHEN shared_blks_hit + shared_blks_read > 0
+   THEN 100.0 * (shared_blks_hit::float / (shared_blks_hit + shared_blks_read))
+   ELSE 0 END AS shared_blks_hit_percent,
+ shared_blks_hit,
+ sh

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-11-17 Thread KONDO Mitsumasa


(2013/11/18 11:25), Claudio Freire wrote:

On Sun, Nov 17, 2013 at 11:02 PM, KONDO Mitsumasa
 wrote:

However, my patch is on the way and needed to more improvement. I am
going
to add method of controlling readahead by GUC, for user can freely select
readahed parameter in their transactions.



Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?


I think we had better to develop these patches in step by step each patches,
because it is difficult that readahead optimizetion is completely come true
from a beginning of one patch. We need flame-work in these patches, first.


Well, problem is, that without those smarts, I don't think this patch
can be enabled by default. It will considerably hurt common use cases
for postgres.

Yes. I have thought as much you that defalut setting is false.
(use normal readahead as before). Next version of my patch will become these.


But I guess we'll have a better idea about that when we see how much
of a performance impact it makes when you run those tests, so no need
to guess in the dark.

Yes, sure.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-11-17 Thread KONDO Mitsumasa


(2013/11/15 13:48), Claudio Freire wrote:

On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa

I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
test.


That's close to the kernel version I was using, so you should see the
same effect.
OK. You proposed readahead maximum patch, I think it seems to get benefit for 
perofomance and your part of argument is really true.



Your patch becomes maximum readahead, when a sql is selected index range
scan. Is it right?


Ehm... sorta.


I think that your patch assumes that pages are ordered by
index-data.


No. It just knows which pages will be needed, and fadvises them. No
guessing involved, except the guess that the scan will not be aborted.
There's a heuristic to stop limited scans from attempting to fadvise,
and that's that prefetch strategy is applied only from the Nth+ page
walk.

We may completely optimize kernel readahead in PostgreSQL in the future,
however it is very difficult and takes long time that it completely comes true 
from a beginning. So I propose GUC switch that can use in their transactions.(I　 
will create this patch in this CF.). If someone off readahed for using file cache 
more efficient in his transactions, he can set "SET readahead = off". PostgreSQL 
is open source, and I think that it becomes clear which case it is effective for, 
by using many people.



It improves index-only scans the most, but I also attempted to handle
heap prefetches. That's where the kernel started conspiring against
me, because I used many naturally-clustered indexes, and THERE
performance was adversely affected because of that kernel bug.
I also create gaussinan-distributed pgbench now and submit this CF. It can clear 
which situasion is effective, partially we will know.



You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.


The decisive difference with your patch is that my patch uses buffer hint
control architecture, so it can control readahaed smarter in some cases.


Indeed, but it's not enough. See my above comment about naturally
clustered indexes. The planner expects that, and plans accordingly. It
will notice correlation between a PK and physical location, and will
treat an index scan over PK to be almost sequential. With your patch,
that assumption will be broken I believe.

~

However, my patch is on the way and needed to more improvement. I am going
to add method of controlling readahead by GUC, for user can freely select
readahed parameter in their transactions.


Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?
I think we had better to develop these patches in step by step each patches, 
because it is difficult that readahead optimizetion is completely come true from 
a beginning of one patch. We need flame-work in these patches, first.



Try OLAP-style queries.


I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell
me good OLAP benchmark tools?


I don't really know. Skimming the specs, I'm not sure if those queries
generate large index range queries. You could try, maybe with
autoexplain?

OK, I do. And, I will use simple large index range queries with explain command.

Regards,
--
Mitsuamsa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-11-14 Thread KONDO Mitsumasa


(2013/11/15 2:09), Fujii Masao wrote:

Agreed.
Could you tell me your agreed reason?　I am sorry that I suspect you doesn't 
understand this disccusion enough:-(


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Ceter


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-11-14 Thread KONDO Mitsumasa


(2013/11/15 11:31), Peter Geoghegan wrote:

On Thu, Nov 14, 2013 at 6:28 PM, KONDO Mitsumasa
 wrote:

It is confirmation just to make sure, does "this patch" mean my patch? I
agree with you about not adding another lock implementation. It will becomes
overhead.


Yes, I referred to your patch. I don't want to go down this road,
because aggregation and constructing a timeline feels like the job of
another tool. I am concerned about local minima and maxima. Even with
the ability to reset min/max independently, you can't do so for each
entry individually. And this approach won't scale to a histogram or
more sophisticated ways of characterizing distribution, particularly
not multiplicatively for things other than execution time (blocks hit
and so on)
I think that pg_stat_statements is light-weight monitoring tool, not whole 
monitoring tool. This feature is very good for everyone to get statistics.
If you'd like to get more detail statistics, I suggest you to add another 
monitoring tools like pg_stat_statements_full. It might more heavy, but it can 
get more detail information. Everyone will welcome to new features of that.



- that spinlock needs to be held for very little time
indeed to preserve pg_stat_statements current low overhead.
I'd like to leave pg_stat_statement low overhead. My patch realizes it. I don't 
add new locks and complicated code in my patch.



As I said above, lets figure out how to have your tool or a similar
tool acquire snapshots inexpensively and frequently instead.

We tried to solve this problem using our tool in the past.
However, it is difficult that except log output method which is log_statement=all 
option. So we try to add new feature to pg_stat_statement, it would help DBA to 
provide useful statistics without overhead.



That is a
discussion I'd be happy to have. IMO pg_stat_statements ought to be as
cheap as possible, and do one thing well - aggregate fixed-unit costs
cumulatively.

I am also happy to your discussion!
I'd like to install your new patch and give you my comment.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-11-14 Thread KONDO Mitsumasa


(2013/11/15 11:17), Peter Geoghegan wrote:

On Thu, Nov 14, 2013 at 6:18 PM, KONDO Mitsumasa
 wrote:

I will fix it. Could you tell me your Mac OS version and gcc version? I have
only mac book air with Maverick OS(10.9).


I have an idea that Mac OSX doesn't have posix_fadvise at all. Didn't
you use the relevant macros so that the code at least builds on those
platforms?

Thank you for your nice advice, too.
I try to fix macro program.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-11-14 Thread KONDO Mitsumasa


(2013/11/14 7:11), Peter Geoghegan wrote:

On Wed, Oct 23, 2013 at 8:52 PM, Alvaro Herrera
 wrote:

Hmm, now if we had portable atomic addition, so that we could spare the
spinlock ...

And adding a histogram or
min/max for something like execution time isn't an approach that can
be made to work for every existing cost tracked by pg_stat_statements.
So, taking all that into consideration, I'm afraid this patch gets a
-1 from me.
It is confirmation just to make sure, does "this patch" mean my patch? I agree 
with you about not adding another lock implementation. It will becomes overhead.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-11-14 Thread KONDO Mitsumasa


(2013/11/15 2:03), Fujii Masao wrote:

On Thu, Nov 14, 2013 at 9:09 PM, KONDO Mitsumasa
 wrote:

Hi,

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.


When I compiled the HEAD code with this patch on MacOS, I got the following
error and warnings.

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include   -c -o fd.o fd.c
fd.c: In function 'BufferHintIOAdvise':
fd.c:1182: error: 'POSIX_FADV_SEQUENTIAL' undeclared (first use in
this function)
fd.c:1182: error: (Each undeclared identifier is reported only once
fd.c:1182: error: for each function it appears in.)
fd.c:1185: error: 'POSIX_FADV_RANDOM' undeclared (first use in this function)
make[4]: *** [fd.o] Error 1
make[3]: *** [file-recursive] Error 2
make[2]: *** [storage-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

tablecmds.c:9120: warning: passing argument 5 of 'smgrread' makes
pointer from integer without a cast
bufmgr.c:455: warning: passing argument 5 of 'smgrread' from
incompatible pointer type

Thanks you for your report!
I will fix it. Could you tell me your Mac OS version and gcc version? I have only 
mac book air with Maverick OS(10.9).


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Optimize kernel readahead using buffer access strategy

2013-11-14 Thread KONDO Mitsumasa

Hi Claudio,

(2013/11/14 22:53), Claudio Freire wrote:

On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
wrote:

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

In general OS, readahead parameter was dynamically decided by disk-read
situations. If long time disk-read was happened, readahead parameter becomes
big.
However it is based on experienced or heuristic algorithm, it causes waste
disk-read and throws out useful OS file caches in some case. It is bad for
disk-read performance a lot.

It would be relevant to know which kernel did you use for those tests.

I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
test.

A while back, I tried to use posix_fadvise to prefetch index pages.
I search your past work. Do you talk about this ML-thread? Or is there another
latest discussion? I see your patch is interesting, but it wasn't submitted to CF
and stopping discussions.

http://www.postgresql.org/message-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2txmy+skjd6g_vjc5...@mail.gmail.com

I ended up finding out that interleaving posix_fadvise with I/O like
that severly hinders (ie: completely disables) the kernel's read-ahead
algorithm.
Your patch becomes maximum readahead, when a sql is selected index range scan. Is
it right? I think that your patch assumes that pages are ordered by index-data.
This assumption is partially wrong. If your assumption is true, we don't need
CLUSTER command. In actuary, CLUSTER command becomes better performance than nothing.

How exactly did you set up those benchmarks? pg_bench defaults?

My detail test setting is under following,
* Server info
CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (2U/12C)
RAM: 6GB
-> I reduced it intentionally in OS paraemter, because large memory tests
have long time.
HDD: SEAGATE Model: ST2000NM0001 @ 7200rpm * 1
RAID: none.

* postgresql.conf(summarized)
shared_buffers = 600MB (10% of RAM = 6GB)
work_mem = 1MB
maintenance_work_mem = 64MB
wal_level = archive
fsync = on
archive_mode = on
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7

* pgbench settings
pgbench -j 4 -c 32 -T 600 pgbench

pg_bench does not exercise heavy sequential access patterns, or long
index scans. It performs many single-page index lookups per
transaction and that's it.
Yes, your argument is right. And it is also a fact that performance becomes
better in these situations.

You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.
The decisive difference with your patch is that my patch uses buffer hint control
architecture, so it can control readahaed smarter in some cases.
However, my patch is on the way and needed to more improvement. I am going to add
method of controlling readahead by GUC, for user can freely select readahed
parameter in their transactions.

Try OLAP-style queries.
I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell me
good OLAP benchmark tools?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-11-14 Thread KONDO Mitsumasa

Oh! Sorry...
I forgot to attach my latest patch.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
new file mode 100644
index 000..929d623
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
@@ -0,0 +1,51 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(
+OUT userid oid,
+OUT dbid oid,
+OUT query text,
+OUT calls int8,
+OUT total_time float8,
+OUT min_time float8,
+OUT max_time float8,
+OUT stddev_time float8,
+OUT rows int8,
+OUT shared_blks_hit int8,
+OUT shared_blks_read int8,
+OUT shared_blks_dirtied int8,
+OUT shared_blks_written int8,
+OUT local_blks_hit int8,
+OUT local_blks_read int8,
+OUT local_blks_dirtied int8,
+OUT local_blks_written int8,
+OUT temp_blks_read int8,
+OUT temp_blks_written int8,
+OUT blk_read_time float8,
+OUT blk_write_time float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+/* New Function */
+CREATE FUNCTION pg_stat_statements_reset_time()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
new file mode 100644
index 000..d590acc
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql
@@ -0,0 +1,52 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
+
+-- Register functions.
+CREATE FUNCTION pg_stat_statements_reset()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements_reset_time()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION pg_stat_statements(
+OUT userid oid,
+OUT dbid oid,
+OUT query text,
+OUT calls int8,
+OUT total_time float8,
+OUT min_time float8,
+OUT max_time float8,
+OUT stddev_time float8,
+OUT rows int8,
+OUT shared_blks_hit int8,
+OUT shared_blks_read int8,
+OUT shared_blks_dirtied int8,
+OUT shared_blks_written int8,
+OUT local_blks_hit int8,
+OUT local_blks_read int8,
+OUT local_blks_dirtied int8,
+OUT local_blks_written int8,
+OUT temp_blks_read int8,
+OUT temp_blks_written int8,
+OUT blk_read_time float8,
+OUT blk_write_time float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+-- Register a view on the function for ease of use.
+CREATE VIEW pg_stat_statements AS
+  SELECT * FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
+
+-- Don't want this to be available to non-superusers.
+REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index e8aed61..5c63940 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,8 +4,10 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \
-	pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.2.sql \
+   pg_stat_statements--1.0--1.1.sql \
+   pg_stat_statements--1.1--1.2.sql \
+   pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
index 5be281e..5662273 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -1,7 +1,7 @@
 /* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */
 
 -- complain if script is sourced in psql, rather than via ALTER EXTENSION
-\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit
 
 /* First we have to remove them from the extension */
 ALTER EXTENSION pg_stat_statements DROP VIEW pg_sta

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-11-14 Thread KONDO Mitsumasa

(2013/10/21 20:17), KONDO Mitsumasa wrote:
> (2013/10/18 22:21), Andrew Dunstan wrote:
>> If we're going to extend pg_stat_statements, even more than min and max
>> I'd like to see the standard deviation in execution time.
> OK. I do! I am making some other patches, please wait more!
I add stddev_time and fix some sources.

Psql result of my latest patch is under following.

userid  | 10
dbid| 16384
query   | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE
tid = ?;
calls   | 74746
total_time  | 1094.291998
min_time| 0.007
max_time| 15.091
stddev_time | 0.100439187720684
rows| 74746
shared_blks_hit | 302346
shared_blks_read| 6
shared_blks_dirtied | 161
shared_blks_written | 0
local_blks_hit  | 0
local_blks_read | 0
local_blks_dirtied  | 0
local_blks_written  | 0
temp_blks_read  | 0
temp_blks_written   | 0
blk_read_time   | 0
blk_write_time  | 0


I don't think a lot that order of columns in this table. If have any idea, 
please
send me. And thanks for a lot of comments and discussion, I am going to refer to
these for not only this patch but also development of pg_statsinfo and
pg_stats_reporter:-)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-22 Thread KONDO Mitsumasa


Hi All,

(2013/10/22 22:26), Stephen Frost wrote:

* Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote:

In our case, what I keep experiencing with tuning queries is that we
have like 99% of them running under acceptable threshold and 1% of them
taking more and more time.


This is usually described (at least where I come from) as 'rare events',
which goes to Tom's point that averages, stddev, etc, are not ideal
(though they are still better than nothing).


good reason to suppose that query runtime is Gaussian?  (I'd bet not;
in particular, multimodal behavior seems very likely due to things like
plan changes.)  If not, how much does that affect the usefulness of
a standard-deviation calculation?


Oscillating plan changes may fit multimodal but I don't feel that's
typical.  My experience has been it's either an extremely rare plan
difference or it's a shift from one plan to another over time.
After all, all of avg, min, max and stdev are only numerical value for predicting 
model. There aren't the robustness and strictness such as Write Ahead Logging. It 
resembles a weather forecast. They are still better than nothing.
It is needed a human judgment to finally suppose a cause from the numerical 
values. By the way, we can guess probability of the value from stdev.
Therefore we can guess easily even if there is an extreme value in min/max 
whether it is normal or not.



What I've been gathering from my quick chat this morning is that either
you know how to characterize the distribution and then the min max and
average are useful on their own, or you need to keep track of an
histogram where all the bins are of the same size to be able to learn
what the distribution actually is.


A histogram would certainly be useful.  We may also wish to look into
outlier/rare event detection methods and increase the logging we do in
those cases (if possible).


Of course keeping enough stats seems to always begin with keeping the
min, max and average, so we can just begin there. We would just be
unable to answer interesting questions with just that.


It would probably be good to do some research into techniques for
outlier detection which minimizes CPU and storage cost.

pg_stat_statement is often used in operating database system, so I don't
 like high CPU usage implementation. The software which will be lessor
performance just to install it is too unpleasant to accept. And if we
need more detail information for SQL tuning, it would be better to
develop other useful performance tuning and monitoring contrib not to
use in operating database system.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-22 Thread KONDO Mitsumasa


(2013/10/22 12:52), Fujii Masao wrote:

On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila  wrote:

On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa
 wrote:

(2013/10/19 14:58), Amit Kapila wrote:

On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
 wrote:
In general, my thinking is that we should prefer compression to reduce
IO (WAL volume), because reducing WAL volume has other benefits as
well like sending it to subscriber nodes. I think it will help cases
where due to less n/w bandwidth, the disk allocated for WAL becomes
full due to high traffic on master and then users need some
alternative methods to handle such situations.

Do you talk about archiving WAL file?


One of the points what I am talking about is sending data over
network to subscriber nodes for streaming replication and another is
WAL in pg_xlog. Both scenario's get benefited if there is is WAL
volume.


It can easy to reduce volume that we
set and add compression command with copy command at archive_command.


Okay.


I think many users would like to use a method which can reduce WAL
volume and the users which don't find it enough useful in their
environments due to decrease in TPS or not significant reduction in
WAL have the option to disable it.

I favor to select compression algorithm for higher performance. If we need
to compress WAL file more, in spite of lessor performance, we can change
archive copy command with high compression algorithm and add documents that
how to compress archive WAL files at archive_command. Does it wrong?


  No, it is not wrong, but there are scenario's as mentioned above
where less WAL volume can be beneficial.


In
actual, many of NoSQLs use snappy for purpose of higher performance.


Okay, you can also check the results with snappy algorithm, but don't
just rely completely on snappy for this patch, you might want to think
of another alternative for this patch.


So, our consensus is to introduce the hooks for FPW compression so that
users can freely select their own best compression algorithm?
Yes, it will be also good for future improvement. But I think WAL compression for 
disaster recovery system should be need in walsender and walreceiver proccess, 
and it is propety architecture for DR system. Higher compression ratio with high 
CPU usage algorithm in FPW might affect bad for perfomance in master server. If 
we can set compression algorithm in walsender and walreciever, performance is 
same as before or better, and WAL send performance will be better.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-21 Thread KONDO Mitsumasa

(2013/10/18 22:21), Andrew Dunstan wrote:
> If we're going to extend pg_stat_statements, even more than min and max
> I'd like to see the standard deviation in execution time.
OK. I do! I am making some other patches, please wait more!

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center.;


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-21 Thread KONDO Mitsumasa


(2013/10/19 14:58), Amit Kapila wrote:
> On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa
>  wrote:
> I think in general also snappy is mostly preferred for it's low CPU
> usage not for compression, but overall my vote is also for snappy.
I think low CPU usage is the best important factor in WAL compression.
It is because WAL write is sequencial write, so few compression ratio improvement 
cannot change PostgreSQL's performance, and furthermore raid card with writeback 
feature. Furthermore PG executes programs by single proccess, high CPU usage 
compression algorithm will cause lessor performance.


>> I found compression algorithm test in HBase. I don't read detail, but it
>> indicates snnapy algorithm gets best performance.
>> 
http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

>
> The dataset used for performance is quite different from the data
> which we are talking about here (WAL).
> "These are the scores for a data which consist of 700kB rows, each
> containing a binary image data. They probably won’t apply to things
> like numeric or text data."
Yes, you are right. We need testing about compression algorithm in WAL write.

>> I think it is necessary to make best efforts in community than I do the best
>> choice with strict test.
>
> Sure, it is good to make effort to select the best algorithm, but if
> you are combining this patch with inclusion of new compression
> algorithm in PG, it can only make the patch to take much longer time.
I think if our direction is specifically decided, it is easy to make the patch.
Complession patch's direction isn't still become clear, it will be a troublesome 
patch which is like sync-rep patch.


> In general, my thinking is that we should prefer compression to reduce
> IO (WAL volume), because reducing WAL volume has other benefits as
> well like sending it to subscriber nodes. I think it will help cases
> where due to less n/w bandwidth, the disk allocated for WAL becomes
> full due to high traffic on master and then users need some
> alternative methods to handle such situations.
Do you talk about archiving WAL file? It can easy to reduce volume that we set 
and add compression command with copy command at archive_command.


> I think many users would like to use a method which can reduce WAL
> volume and the users which don't find it enough useful in their
> environments due to decrease in TPS or not significant reduction in
> WAL have the option to disable it.
I favor to select compression algorithm for higher performance. If we need to 
compress WAL file more, in spite of lessor performance, we can change archive 
copy command with high compression algorithm and add documents that how to 
compress archive WAL files at archive_command. Does it wrong? In actual, many of 
NoSQLs use snappy for purpose of higher performance.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Who is pgFoundery administrator?

2013-10-21 Thread KONDO Mitsumasa


(2013/10/02 18:57), Michael Paquier wrote:

 wrote:

Who is pgFoundery administrator or board member now? I would like to send e-mail
them. At least, it does not have information and support page in pgFoundery 
homepage.

Why don't you consider github as a potential solution?
It is because github is too various software to search PostgreSQL good tools:-( 
In contrast, pgfoundry is easy to search good tools for PostgreSQL:-) When we'd 
like to search PG good tools in google, it can difficult to search our tools. 
Because we don't do keyword search optimizations.


I hope pgfoundry will be more robust and popular. It has been still needed.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio

2013-10-18 Thread KONDO Mitsumasa

Hi,

I submit improvement of pg_stat_statement usage patch in CF3.

In pg_stat_statement, I think buffer hit ratio is very important value. However,
it is difficult to calculate it, and it need complicated SQL. This patch makes 
it
more simple usage and documentation.

> -bench=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit /
> -   nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
> +bench=# SELECT query, calls, total_time, rows, shared_blks_hit_percent
>FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5;

It will be very simple:-)

This patch conflicts pg_stat_statement_min_max_exectime patch which I submitted,
and pg_stat_statement_min_max_exectime patch also adds new columns which are
min_time and max_time. So I'd like to change it in this opportunity.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index e8aed61..5c63940 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,8 +4,10 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \
-	pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.2.sql \
+   pg_stat_statements--1.0--1.1.sql \
+   pg_stat_statements--1.1--1.2.sql \
+   pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
index 5be281e..5662273 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -1,7 +1,7 @@
 /* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */
 
 -- complain if script is sourced in psql, rather than via ALTER EXTENSION
-\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit
 
 /* First we have to remove them from the extension */
 ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
new file mode 100644
index 000..f0a8e0f
--- /dev/null
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql
@@ -0,0 +1,63 @@
+/* contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit
+
+/* First we have to remove them from the extension */
+ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
+ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements();
+
+/* Then we can drop them */
+DROP VIEW pg_stat_statements;
+DROP FUNCTION pg_stat_statements();
+
+/* Now redefine */
+CREATE FUNCTION pg_stat_statements(
+OUT userid oid,
+OUT dbid oid,
+OUT query text,
+OUT calls int8,
+OUT total_time float8,
+OUT rows int8,
+OUT shared_blks_hit int8,
+OUT shared_blks_read int8,
+OUT shared_blks_dirtied int8,
+OUT shared_blks_written int8,
+OUT local_blks_hit int8,
+OUT local_blks_read int8,
+OUT local_blks_dirtied int8,
+OUT local_blks_written int8,
+OUT temp_blks_read int8,
+OUT temp_blks_written int8,
+OUT blk_read_time float8,
+OUT blk_write_time float8
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE VIEW pg_stat_statements AS
+  SELECT userid,
+ dbid,
+ query,
+ calls,
+ total_time,
+ rows,
+ CASE WHEN shared_blks_hit + shared_blks_read > 0
+   THEN 100.0 * (shared_blks_hit::float / (shared_blks_hit + shared_blks_read))
+   ELSE 0 END AS shared_blks_hit_percent,
+ shared_blks_hit,
+ shared_blks_read,
+ shared_blks_dirtied,
+ shared_blks_written,
+ local_blks_hit,
+ local_blks_read,
+ local_blks_dirtied,
+ local_blks_written,
+ temp_blks_read,
+ temp_blks_written,
+ blk_read_time,
+ blk_write_time
+  FROM pg_stat_statements();
+
+GRANT SELECT ON pg_stat_statements TO PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
deleted file mode 100644
index 42e4d68..000
--- a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
+++ /dev/null
@@ -1,43 +0,0 @@
-/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
-
--- Register functions.
-CREATE FUNCTION pg_stat_statement

[HACKERS] Add min and max execute statement time in pg_stat_statement

2013-10-18 Thread KONDO Mitsumasa

I submit patch adding min and max execute statement time in pg_stat_statement in
next CF.

pg_stat_statement have execution time, but it is average execution time and does
not provide detail information very much. So I add min and max execute statement
time in pg_stat_statement columns. Usage is almost same as before. However, I 
add
pg_stat_statements_reset_time() function to get min_time and max_time in the
specific period. This function resets or inits min and max execution time 
before.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index e8aed61..5c63940 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -4,8 +4,10 @@ MODULE_big = pg_stat_statements
 OBJS = pg_stat_statements.o
 
 EXTENSION = pg_stat_statements
-DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \
-	pg_stat_statements--unpackaged--1.0.sql
+DATA = pg_stat_statements--1.2.sql \
+   pg_stat_statements--1.0--1.1.sql \
+   pg_stat_statements--1.1--1.2.sql \
+   pg_stat_statements--unpackaged--1.0.sql
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
index 5be281e..5662273 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql
@@ -1,7 +1,7 @@
 /* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */
 
 -- complain if script is sourced in psql, rather than via ALTER EXTENSION
-\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit
+\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit
 
 /* First we have to remove them from the extension */
 ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
deleted file mode 100644
index 42e4d68..000
--- a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql
+++ /dev/null
@@ -1,43 +0,0 @@
-/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
-
--- Register functions.
-CREATE FUNCTION pg_stat_statements_reset()
-RETURNS void
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
-CREATE FUNCTION pg_stat_statements(
-OUT userid oid,
-OUT dbid oid,
-OUT query text,
-OUT calls int8,
-OUT total_time float8,
-OUT rows int8,
-OUT shared_blks_hit int8,
-OUT shared_blks_read int8,
-OUT shared_blks_dirtied int8,
-OUT shared_blks_written int8,
-OUT local_blks_hit int8,
-OUT local_blks_read int8,
-OUT local_blks_dirtied int8,
-OUT local_blks_written int8,
-OUT temp_blks_read int8,
-OUT temp_blks_written int8,
-OUT blk_read_time float8,
-OUT blk_write_time float8
-)
-RETURNS SETOF record
-AS 'MODULE_PATHNAME'
-LANGUAGE C;
-
--- Register a view on the function for ease of use.
-CREATE VIEW pg_stat_statements AS
-  SELECT * FROM pg_stat_statements();
-
-GRANT SELECT ON pg_stat_statements TO PUBLIC;
-
--- Don't want this to be available to non-superusers.
-REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC;
diff --git a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
index e84a3cb..0addba0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
+++ b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql
@@ -4,5 +4,6 @@
 \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit
 
 ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset();
+ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time();
 ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements();
 ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements;
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index ea930af..8f9b641 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -76,6 +76,7 @@ static const uint32 PGSS_FILE_HEADER = 0x20120328;
 #define USAGE_DECREASE_FACTOR	(0.99)	/* decreased every entry_dealloc */
 #define STICKY_DECREASE_FACTOR	(0.50)	/* factor for sticky entries */
 #define USAGE_DEALLOC_PERCENT	5		/* free this % of entries at once */
+#define EXEC_TIME_INIT			(-1)	/* initial execution time */
 
 #define JUMBLE_SIZE1024	/* query serialization buffer size */
 
@@ -102,6 +103,8 @@ typedef struct Counters
 {
 	int64		calls;			/* # of times executed */
 	double		total_time;		/* total execution time, in msec */
+	double		min

Re: [HACKERS] Compression of full-page-writes

2013-10-15 Thread KONDO Mitsumasa


(2013/10/15 22:01), k...@rice.edu wrote:

Google's lz4 is also a very nice algorithm with 33% better compression
performance than snappy and 2X the decompression performance in some
benchmarks also with a bsd license:

https://code.google.com/p/lz4/

If we judge only performance, we will select lz4. However, we should think
 another important factor which is software robustness, achievement, bug
fix history, and etc... If we see unknown bugs, can we fix it or improve
algorithm? It seems very difficult, because we only use it and don't
understand algorihtms. Therefore, I think that we had better to select
robust and having more user software.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software


















--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-14 Thread KONDO Mitsumasa


(2013/10/15 13:33), Amit Kapila wrote:

Snappy is good mainly for un-compressible data, see the link below:
http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4yebmoa...@mail.gmail.com

This result was gotten in ARM architecture, it is not general CPU.
Please see detail document.
http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9

I found compression algorithm test in HBase. I don't read detail, but it 
indicates snnapy algorithm gets best performance.

http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of

In fact, most of modern NoSQL storages use snappy. Because it has good 
performance and good licence(BSD license).



I think it is bit difficult to prove that any one algorithm is best
for all kind of loads.
I think it is necessary to make best efforts in community than I do the best 
choice with strict test.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Release note fix for timeline item

2013-10-14 Thread KONDO Mitsumasa


Sorry for my reply late...

(2013/10/08 23:26), Bruce Momjian wrote:
> First, I want to apologize for not completing the release notes earlier
> so that others could review them.  I started working on the release
> notes on Friday, but my unfamiliarity with the process and fear of
> making a mistake caused many delays.  I have improved the documentation
> on the process which will hopefully help next time.
There isn't anything in particular that I was dissatisfied about it.


You are right that there is alot of details skipped in the release note
text.  I have developed the attached patch which I think does a better
job.  Is it OK?

Yes, off course!
Thanks for your sincere action!

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-14 Thread KONDO Mitsumasa


(2013/10/13 0:14), Amit Kapila wrote:

On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund  wrote:

But maybe pglz is just not a good fit for this, it really
isn't a very good algorithm in this day and aage.
+1. This compression algorithm is needed more faster than pglz which is like 
general compression algorithm, to avoid the CPU bottle-neck. I think pglz doesn't 
have good performance, and it is like fossil compression algorithm. So we need to 
change latest compression algorithm for more better future.



Do you think that if WAL reduction or performance with other
compression algorithm (for ex. snappy)  is better, then chances of
getting the new compression algorithm in postresql will be more?
Latest compression algorithms papers(also snappy) have indecated. I think it is 
enough to select algorithm. It may be also good work in postgres.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-08 Thread KONDO Mitsumasa


(2013/10/08 20:13), Haribabu kommi wrote:

I chosen the sync_commit=off mode because it generates more tps, thus it 
increases the volume of WAL.

I did not think to there. Sorry...


I will test with sync_commit=on mode and provide the test results.

OK. Thanks!

--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-08 Thread KONDO Mitsumasa


Hi,

I tested dbt-2 benchmark in single instance and synchronous replication.
Unfortunately, my benchmark results were not seen many differences...

* Test server
   Server: HP Proliant DL360 G7
   CPU:Xeon E5640 2.66GHz (1P/4C)
   Memory: 18GB(PC3-10600R-9)
   Disk:   146GB(15k)*4 RAID1+0
   RAID controller: P410i/256MB

* Result
** Single instance**
| NOTPM | 90%tile | Average | S.Deviation
+---+-+-+-
no-patched  | 3322.93   | 20.469071   | 5.882   | 10.478
patched | 3315.42   | 19.086105   | 5.669   | 9.108


** Synchronous Replication **
| NOTPM | 90%tile | Average | S.Deviation
+---+-+-+-
no-patched  | 3275.55   | 21.332866   | 6.072   | 9.882
patched | 3318.82   | 18.141807   | 5.757   | 9.829

** Detail of result
http://pgstatsinfo.projects.pgfoundry.org/DBT-2_Fujii_patch/


I set full_page_write = compress with Fujii's patch in DBT-2. But it does not
seems to effect for eleminating WAL files. I will try to DBT-2 benchmark more 
once, and try to normal pgbench in my test server.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
<><><><>
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-10-08 Thread KONDO Mitsumasa


(2013/10/08 17:33), Haribabu kommi wrote:

The checkpoint_timeout and checkpoint_segments are increased to make sure no 
checkpoint happens during the test run.
Your setting is easy occurred checkpoint in checkpoint_segments = 256. I don't 
know number of disks in your test server, in my test server which has 4 magnetic 
disk(1.5k rpm), postgres generates 50 - 100 WALs per minutes.


And I cannot understand your setting which is sync_commit = off. This setting 
tend to cause cpu bottle-neck and data-loss. It is not general in database usage.

Therefore, your test is not fair comparison for Fujii's patch.

Going back to my DBT-2 benchmark, I have not got good performance (almost same 
performance).  So I am checking hunk, my setting, or something wrong in Fujii's 
patch now. I am going to try to send test result tonight.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Who is pgFoundery administrator?

2013-10-02 Thread KONDO Mitsumasa

(2013/10/02 17:37), KONDO Mitsumasa wrote:
> I want to submit new project in pgFoundery project.
Our new project was approved yesterday!
Thanks very much for pgFoundery crew.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Who is pgFoundery administrator?

2013-10-02 Thread KONDO Mitsumasa

Hi,

I want to submit new project in pgFoundery project.

I submitted new project which is WAL archive copy tool with directIO method in
pgFoundery homepage 2 weeks ago, but it does not have approved and responded at
all:-(

Who is pgFoundery administrator or board member now? I would like to send e-mail
them. At least, it does not have information and support page in pgFoundery 
homepage.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-09-30 Thread KONDO Mitsumasa


(2013/09/30 13:55), Amit Kapila wrote:

On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao  wrote:

Yep, please! It's really helpful!

OK! I test with single instance and synchronous replication constitution.

By the way, you posted patch which is sync_file_range() WAL writing method in 3 
years ago. I think it is also good for performance. As the reason, I read 
sync_file_range() and fdatasync() in latest linux kernel code(3.9.11), 
fdatasync() writes in dirty buffers of the whole file, on the other hand, 
sync_file_range() writes in partial dirty buffers. In more detail, these 
functions use the same function in kernel source code, fdatasync() is 
vfs_fsync_range(file, 0, LLONG_MAX, 1), and sync_file_range() is 
vfs_fsync_range(file, offset, amount, 1).

It is obvious that which is more efficiently in WAL writing.

You had better confirm it in linux kernel's git. I think your conviction will be 
more deeply.

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/sync.c?id=refs/tags/v3.11.2



I think it will be useful if you can get the data for 1 and 2 threads
(may be with pgbench itself) as well, because the WAL reduction is
almost sure, but the only thing is that it should not dip tps in some
of the scenarios.

That's right. I also want to know about this patch in MD environment, because
MD has strong point in sequential write which like WAL writing.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Compression of full-page-writes

2013-09-29 Thread KONDO Mitsumasa


Hi Fujii-san,

(2013/09/30 12:49), Fujii Masao wrote:
> On second thought, the patch could compress WAL very much because I used 
pgbench.

I will do the same measurement by using another benchmark.

If you hope, I can test this patch in DBT-2 benchmark in end of this week.
I will use under following test server.

* Test server
  Server: HP Proliant DL360 G7
  CPU:Xeon E5640 2.66GHz (1P/4C)
  Memory: 18GB(PC3-10600R-9)
  Disk:   146GB(15k)*4 RAID1+0
  RAID controller: P410i/256MB

This is PG-REX test server as you know.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2013-09-29 Thread KONDO Mitsumasa

(2013/09/27 5:29), Peter Eisentraut wrote:
> This patch no longer applies.
I will try to create this patch in next commit fest.
If you have nice idea, please send me!

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2013-09-29 Thread KONDO Mitsumasa


Sorry for my delay reply.
Since I have had vacation last week, I replied from gmail.
However, it was stalled post to pgsql-hackers:-(

(2013/09/21 7:54), Fabien COELHO wrote:

However this pattern induces stronger cache effects which are maybe not too 
realistic,
because neighboring keys in the middle are more likely to be chosen.
I think that your opinion is right. However, in effect, it is a 
paseudo-benchmark, so that I think that such a simple mechanism is also necessary.



Have you considered adding a "randomization" layer, that is once you have a key in [1 
.. > n] centered around n/2, then you perform a pseudo-random transformation into the same 
> domain so that key values are scattered over the whole domain?
Yes. I also consider this patch. It can realize by adding linear mapping array 
which is created by random generator. However, current erand48 algorithm is not 
high accuracy and  fossil algorithm, I do not know whether it works well. If we 
realize it, we may need more accurate random generator algorithm which is like 
Mersenne Twister.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2013-09-29 Thread KONDO Mitsumasa


Sorry for my delay reply.
Since I have had vacation last week, I replyed from gmail.
However, it was stalled post to pgsql-hackers:-(

(2013/09/21 6:05), Kevin Grittner wrote:
> You had accidentally added to the CF In Progress.
Oh, I had completely mistook this CF schedule :-)
Maybe, Horiguchi-san is same situation...

However, because of your moving, I become first submitter in next CF.
Thank you for moving !
--
Mitsumasa KONDO
NTT Open Source Software Center






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.4 regression

2013-09-05 Thread KONDO Mitsumasa


(2013/09/05 0:04), Andres Freund wrote:
> I'd vote for adding zeroing *after* the fallocate() first.
+1, with FALLOC_FL_KEEP_SIZE flag.

At least, fallocate with FALLOC_FL_KEEP_SIZE flag is faster than nothing in my 
developing sorted checkpoint. I adopted it to relation file, so I don't know 
about WAL file.


With FALLOC_FL_KEEP_SIZE flag does not change file size and not do zero fill. It 
seems to only get continuous physical disk space in file system.


Here is draft patch.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dc47c47..85b582c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3409,6 +3409,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 (errcode_for_file_access(),
  errmsg("could not create file \"%s\": %m", tmppath)));
 
+#if defined(HAVE_FALLOCATE)
+	fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, XLogSegSize);
+#endif
+
 	/*
 	 * Zero-fill the file.	We have to do this the hard way to ensure that all
 	 * the file space has really been allocated --- on platforms that allow

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Add pgbench option: CHECKPOINT before starting benchmark

2013-08-30 Thread KONDO Mitsumasa

Hi,

I add checkpoint option to pgbench.

pgbench is simple and useful benchmark for every user. However, result of
benchmark greatly changes by some situations which are in executing checkpoint,
number of dirty buffers in share_buffers, and so on. For such a problem, it is
custom to carry out a checkpoint before starting benchmark. But it is a fact 
that
the making of the script takes time, like under following script.

psql -U postgres -d pgbench -p5432 -c "CHECKPOINT"
pgbench -T 600 -c 12 -j4 -U postgres -d pgbench -p 5432

However, this script have a problem.
This script execute "CHECKPOINT" -> "VACUUM" -> "starting benchmark".
If relpages have lot of dirty pages, VACUUM generate dirty buffers on
shared_buffers, and it will cause bad heavily checkpoint.

I think pgbench would be more easy and accuracy benchmark tools for everyone. So
I set checkpoint before starting benchmark.

This patch's output is here.
-
[mitsu-ko@localhost pgbench]$ ./pgbench
starting vacuum...end.
starting checkpoint...end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/10
tps = 312.851958 (including connections establishing)
tps = 364.524478 (excluding connections establishing)
-
It execute "VACUUM" -> "CHECKPOINT" -> "starting benchmark".
I think it is ideal setting for more accuracy benchmark.

My patches option difinition is here.
[mitsu-ko@localhost pgbench]$ ./pgbench --help
~
-N, --no-checkpoint  do not run CHECKPOINT after initialization
~
In latest commited pgbench, -N is "--skip-some-updates  skip updates of
pgbench_tellers and pgbench_branches". But I cannot understand why -N is this
option, so I set this option -u, and -N is "do not run CHECKPOINT option".

What do you think?
-- 
Mitsumasa KONDO
NTT Open Source Software Center
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index ad8e272..523c278 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -347,6 +347,7 @@ usage(void)
 		   "  -i, --initialize invokes initialization mode\n"
 		   "  -F, --fillfactor=NUM set fill factor\n"
 		   "  -n, --no-vacuum  do not run VACUUM after initialization\n"
+		   "  -N, --no-checkpoint  do not run CHECKPOINT after initialization\n"
 		   "  -q, --quiet  quiet logging (one message each 5 seconds)\n"
 		   "  -s, --scale=NUM  scaling factor\n"
 		   "  --foreign-keys   create foreign key constraints between tables\n"
@@ -366,7 +367,8 @@ usage(void)
 		   "   protocol for submitting queries "
 		   "(default: simple)\n"
 		   "  -n, --no-vacuum  do not run VACUUM before tests\n"
-		   "  -N, --skip-some-updates  skip updates of pgbench_tellers and pgbench_branches\n"
+		   "  -N, --no-checkpoint  do not run CHECKPOINT before tests\n"
+		   "  -u, --skip-some-updates  skip updates of pgbench_tellers and pgbench_branches\n"
 		   "  -P, --progress=NUM   show thread progress report every NUM seconds\n"
 		   "  -r, --report-latencies   report average latency per command\n"
 		   "  -R, --rate=SPEC  target rate in transactions per second\n"
@@ -1520,7 +1522,7 @@ disconnect_all(CState *state, int length)
 
 /* create tables and setup data */
 static void
-init(bool is_no_vacuum)
+init(bool is_no_vacuum, bool is_no_checkpoint)
 {
 /* The scale factor at/beyond which 32bit integers are incapable of storing
  * 64bit values.
@@ -1775,6 +1777,12 @@ init(bool is_no_vacuum)
 		}
 	}
 
+	/* checkpoint */
+	if (!is_no_checkpoint)
+	{
+		fprintf(stderr, "checkpoint...\n");
+		executeStatement(con, "checkpoint");
+	}
 
 	fprintf(stderr, "done.\n");
 	PQfinish(con);
@@ -2213,6 +2221,7 @@ main(int argc, char **argv)
 		{"jobs", required_argument, NULL, 'j'},
 		{"log", no_argument, NULL, 'l'},
 		{"no-vacuum", no_argument, NULL, 'n'},
+		{"no-checkpoint", no_argument, NULL, 'N'},
 		{"port", required_argument, NULL, 'p'},
 		{"progress", required_argument, NULL, 'P'},
 		{"protocol", required_argument, NULL, 'M'},
@@ -2220,7 +2229,7 @@ main(int argc, char **argv)
 		{"report-latencies", no_argument, NULL, 'r'},
 		{"scale", required_argument, NULL, 's'},
 		{"select-only", no_argument, NULL, 'S'},
-		{"skip-some-updates", no_argument, NULL, 'N'},
+		{"skip-some-updates", no_argument, NULL, 'u'},
 		{"time", required_argument, NULL, 'T'},
 		{"transactions", required_argument, NULL, 't'},
 		{"username", required_argument, NULL, 'U'},
@@ -2240,6 +2249,7 @@ main(int argc, char **argv)
 	int			nclients = 1;	/* default number of simulated clients */
 	int			nthreads = 1;	/* default number of threads */
 	int			is_init_mode = 0;		/* initialize mode? */
+	int			is_no_checkpoint = 0;		/* no checkpoint at all before testing?

1 2 >

1 - 100 of 138 matches

Mail list logo