Re: [HACKERS] gaussian distribution pgbench
And I find new useful point of this feature. Under following results are '--gaussian=20' case and '--gaussian=2' case, and postgresql setting is same. > [mitsu-ko@pg-rex31 pgbench]$ ./pgbench -c8 -j4 --gaussian=20 -T30 -P 5 > starting vacuum...end. > progress: 5.0 s, 4285.8 tps, lat 1.860 ms stddev 0.425 > progress: 10.0 s, 4249.2 tps, lat 1.879 ms stddev 0.372 > progress: 15.0 s, 4230.3 tps, lat 1.888 ms stddev 0.430 > progress: 20.0 s, 4247.3 tps, lat 1.880 ms stddev 0.400 > LOG: checkpoints are occurring too frequently (12 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 25.0 s, 4269.0 tps, lat 1.870 ms stddev 0.427 > progress: 30.0 s, 4318.1 tps, lat 1.849 ms stddev 0.415 > transaction type: Gaussian distribution TPC-B (sort of) > scaling factor: 10 > standard deviation threshold: 20.0 > access probability of top 20%, 10% and 5% records: 0.4 0.95450 0.68269 > query mode: simple > number of clients: 8 > number of threads: 4 > duration: 30 s > number of transactions actually processed: 128008 > latency average: 1.871 ms > latency stddev: 0.412 ms > tps = 4266.266374 (including connections establishing) > tps = 4267.312022 (excluding connections establishing) > [mitsu-ko@pg-rex31 pgbench]$ ./pgbench -c8 -j4 --gaussian=2 -T30 -P 5 > starting vacuum...end. > LOG: checkpoints are occurring too frequently (13 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (1 second apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 5.0 s, 3927.9 tps, lat 2.030 ms stddev 0.691 > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (1 second apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 10.0 s, 4045.8 tps, lat 1.974 ms stddev 0.835 > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (1 second apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 15.0 s, 4042.5 tps, lat 1.976 ms stddev 0.613 > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 20.0 s, 4103.9 tps, lat 1.946 ms stddev 0.540 > LOG: checkpoints are occurring too frequently (1 second apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 25.0 s, 4003.1 tps, lat 1.995 ms stddev 0.526 > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (1 second apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > LOG: checkpoints are occurring too frequently (2 seconds apart) > HINT: Consider increasing the configuration parameter "checkpoint_segments". > progress: 30.0 s, 4025.5 tps, lat 1.984 ms stddev 0.568 > transaction type: Gaussian distribution TPC-B (sort of) > scaling factor: 10 > standard deviation threshold: 2.0 > access probability of top 20%, 10% and 5% records: 0.32566 0.16608 0.08345 > query mode: simple > number of clients: 8 > number of threads: 4 > duration: 30 s > number of transactions actually processed: 120752 > latency average: 1.984 ms > latency stddev: 0.638 ms > tps = 4024.823433 (including connections establishing) > tps = 4025.87 (excluding connections establishing) In '--gaussian=2' benchmark, checkpoint is frequently happen than '--gaussian=20' benchmark. Because former update large range of records so that fullpage write
Re: [HACKERS] gaussian distribution pgbench
(2014/03/17 23:29), Robert Haas wrote: On Sat, Mar 15, 2014 at 4:50 AM, Mitsumasa KONDO wrote: There are explanations and computations as comments in the code. If it is about the documentation, I'm not sure that a very precise mathematical definition will help a lot of people, and might rather hinder understanding, so the doc focuses on an intuitive explanation instead. Yeah, I think that we had better to only explain necessary infomation for using this feature. If we add mathematical theory in docs, it will be too difficult for user. And it's waste. Well, if you *don't* include at least *some* mathematical description of what the feature does in the documentation, then users who need to understand it will have to read the source code to figure it out, which is going to be even more difficult. I had fixed this problem. Please see the v12 patch. I think it doesn't includ mathematical description, but user will be able to understand intuitive from the explanation of document. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/17 22:37), Tom Lane wrote: > KONDO Mitsumasa writes: >> (2014/03/17 18:02), Heikki Linnakangas wrote: >>> On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote: >>> There is an infinite number of variants of the TPC-B test that we could >>> include >>> in pgbench. If we start adding every one of them, we're quickly going to >>> have >>> hundreds of options to choose the workload. I'd like to keep pgbench simple. >>> These two new test variants, gaussian and exponential, are not that special >>> that >>> they'd deserve to be included in the program itself. > >> Well, I add only two options, and they are major distribution that are seen >> in >> real database system than uniform distiribution. I'm afraid, I think you are >> too >> worried and it will not be added hundreds of options. And pgbench is still >> simple. > > FWIW, I concur with Heikki on this. Adding new versions of \setrandom is > useful functionality. Embedding them in the "standard" test is not, > because that just makes it (even) less standard. And pgbench has too darn > many switches already. Hmm, I cooled down and see the pgbench option. I can understand his arguments, there are many sitches already and it will become more largear options unless we stop adding new option. However, I think that the man who added the option in the past thought the option will be useful for PostgreSQL performance improvement. But now, they are disturb the new option such like my feature which can create more real system benchmark distribution. I think it is very unfortunate and also tending to stop progress of improvement of PostgreSQL performance, not only pgbench. And if we remove command line option, I think new feature will tend to reject. It is not also good. By the way, if we remove command line option, it is difficult to understand distirbution of gaussian, because threshold parameter is very sensitive and it is also very useful feature. It is difficult and taking labor that analyzing and visualization pgbench_history using SQL. What do you think about this problem? This is not disscussed yet. > [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=2 > ~ > access probability of top 20%, 10% and 5% records: 0.32566 0.16608 0.08345 > ~ > [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=4 > ~ > access probability of top 20%, 10% and 5% records: 0.57633 0.31086 0.15853 > ~ > [mitsu-ko@pg-rex31 pgbench]$ ./pgbench --gaussian=10 > ~ > access probability of top 20%, 10% and 5% records: 0.95450 0.68269 0.38292 > ~ Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/17 18:02), Heikki Linnakangas wrote: On 03/17/2014 10:40 AM, KONDO Mitsumasa wrote: By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM command options. Can you tell me the objective reason? I think pgbench is the benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark. It is written in documents, and default benchmark wasn't changed by my patch. So we need not remove command options, and they are one of the variety of benchmark options. Maybe you have something misunderstanding about my patch... There is an infinite number of variants of the TPC-B test that we could include in pgbench. If we start adding every one of them, we're quickly going to have hundreds of options to choose the workload. I'd like to keep pgbench simple. These two new test variants, gaussian and exponential, are not that special that they'd deserve to be included in the program itself. Well, I add only two options, and they are major distribution that are seen in real database system than uniform distiribution. I'm afraid, I think you are too worried and it will not be added hundreds of options. And pgbench is still simple. pgbench already has a mechanism for running custom scripts, in which you can specify whatever workload you want. Let's use that. If it's missing something you need to specify the workload you want, let's enhance the script language. I have not seen user who is using pgbench custom script very much. And gaussian and exponential distribution are much better to measure the real system perfomance, so I'd like to use it command option. In now pgbench, we can only measure about database size, but it isn't realistic situation. We want to forcast the required system from calculating the size of hot spot or distirbution of access pettern. I'd realy like to include it on my heart:) Please... Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/17 17:46), Heikki Linnakangas wrote: On 03/15/2014 08:53 AM, Fabien COELHO wrote: >* Does min and max really make sense for gaussian and exponential >distributions? For gaussian, I would expect mean and standard deviation as >the parameters, not min/max/threshold. Yes... and no:-) The aim is to draw an integer primary key from a table, so it must be in a specified range. Well, I don't agree with that aim. It's useful for choosing a primary key, as in the pgbench TPC-B workload, but a gaussian distributed random number could be used for many other things too. For example: \setrandom foo ... gaussian select * from cheese where weight > :foo And :foo should be a float, not an integer. That's what I was trying to say earlier, when I said that the variable should be a float. If you need an integer, just cast or round it in the query. I realize that the current \setrandom sets the variable to an integer, so gaussian/exponential would be different. But so what? An option to generate uniformly distributed floats would be handy too, though. Well, it seems new feature. If you want to realise it as double, add '\setrandomd' as a double random generator in pgbebch. I will agree with that. This is approximated by drawing a double value with the expected distribution (gaussian or exponential) and project it carefully onto integers. If it is out of range, there is a loop and another value is drawn. The minimal threshold constraint (2.0) ensures that the probability of looping is low. Well, that's one way to do constraint it to the given range, but there are many other ways to do it. Like, clamp it to the min/max if it's out of range. It's too heavy method.. Client calculation must be light. I don't think we need to choose any particular method, you can handle that in the test script. I think our implementation is the best way to realize it. It is fast and robustness for the probability of looping is low. If you have better idea, please teach us. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
Hi Heikki-san, (2014/03/17 14:39), KONDO Mitsumasa wrote: (2014/03/15 15:53), Fabien COELHO wrote: Hello Heikki, A couple of comments: * There should be an explicit "\setrandom ... uniform" option too, even though you get that implicitly if you don't specify the distribution Fix. We can use "\setrandom val min max uniform" without error messages. * What exactly does the "threshold" mean? The docs informally explain that "the larger the thresold, the more frequent values close to the middle of the interval are drawn", but that's pretty vague. There are explanations and computations as comments in the code. If it is about the documentation, I'm not sure that a very precise mathematical definition will help a lot of people, and might rather hinder understanding, so the doc focuses on an intuitive explanation instead. Add more detail information in the document. Is it OK? Please confirm it. * Does min and max really make sense for gaussian and exponential distributions? For gaussian, I would expect mean and standard deviation as the parameters, not min/max/threshold. Yes... and no:-) The aim is to draw an integer primary key from a table, so it must be in a specified range. This is approximated by drawing a double value with the expected distribution (gaussian or exponential) and project it carefully onto integers. If it is out of range, there is a loop and another value is drawn. The minimal threshold constraint (2.0) ensures that the probability of looping is low. It make sense. Please see the attached picutre in last day. * How about setting the variable as a float instead of integer? Would seem more natural to me. At least as an option. Which variable? The values set by setrandom are mostly used for primary keys. We really want integers in a range. Oh, I see. He said about documents. The document was mistaken. Threshold parameter must be double and fix the document. By the way, you seem to want to remove --gaussian=NUM and --exponential=NUM command options. Can you tell me the objective reason? I think pgbench is the benchmark test on PostgreSQL and default benchmark is TPC-B-like benchmark. It is written in documents, and default benchmark wasn't changed by my patch. So we need not remove command options, and they are one of the variety of benchmark options. Maybe you have something misunderstanding about my patch... Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** *** 98,103 static int pthread_join(pthread_t th, void **thread_return); --- 98,106 #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ + #define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + #define MIN_EXPONENTIAL_THRESHOLD 2.0 /* minimum threshold for exp */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ *** *** 169,174 bool is_connect; /* establish connection for each transaction */ --- 172,185 bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ + /* gaussian distribution tests: */ + double stdev_threshold; /* standard deviation threshold */ + booluse_gaussian = false; + + /* exponential distribution tests: */ + double exp_threshold; /* threshold for exponential */ + bool use_exponential = false; + char *pghost = ""; char *pgport = ""; char *login = NULL; *** *** 330,335 static char *select_only = { --- 341,428 "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" }; + /* --exponential case */ + static char *exponential_tpc_b = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts exponential :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" + "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --exponential with -N case */ + static char *exponential_simple_update = { + "\\set nbranches " CppAsS
Re: [HACKERS] gaussian distribution pgbench
(2014/03/15 15:53), Fabien COELHO wrote: Hello Heikki, A couple of comments: * There should be an explicit "\setrandom ... uniform" option too, even though you get that implicitly if you don't specify the distribution Indeed. I agree. I suggested it, but it got lost. * What exactly does the "threshold" mean? The docs informally explain that "the larger the thresold, the more frequent values close to the middle of the interval are drawn", but that's pretty vague. There are explanations and computations as comments in the code. If it is about the documentation, I'm not sure that a very precise mathematical definition will help a lot of people, and might rather hinder understanding, so the doc focuses on an intuitive explanation instead. * Does min and max really make sense for gaussian and exponential distributions? For gaussian, I would expect mean and standard deviation as the parameters, not min/max/threshold. Yes... and no:-) The aim is to draw an integer primary key from a table, so it must be in a specified range. This is approximated by drawing a double value with the expected distribution (gaussian or exponential) and project it carefully onto integers. If it is out of range, there is a loop and another value is drawn. The minimal threshold constraint (2.0) ensures that the probability of looping is low. * How about setting the variable as a float instead of integer? Would seem more natural to me. At least as an option. Which variable? The values set by setrandom are mostly used for primary keys. We really want integers in a range. Oh, I see. He said about documents. + Moreover, set gaussian or exponential with threshold interger value, + we can get gaussian or exponential random in integer value between + min and max bounds inclusive. Collectry, + Moreover, set gaussian or exponential with threshold double value, + we can get gaussian or exponential random in integer value between + min and max bounds inclusive. And I am going to fix the document more easily understanding for user. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/13 23:00), Fujii Masao wrote: On Thu, Mar 13, 2014 at 10:51 PM, Heikki Linnakangas wrote: On 03/13/2014 03:17 PM, Fujii Masao wrote: On Tue, Mar 11, 2014 at 1:49 PM, KONDO Mitsumasa wrote: (2014/03/09 1:49), Fabien COELHO wrote: I'm okay with this UI and itsaccess probability of top implementation. OK. We should do the same discussion for the UI of command-line option? The patch adds two options --gaussian and --exponential, but this UI seems to be a bit inconsistent with the UI for \setrandom. Instead, we can use something like --distribution=[uniform | gaussian | exponential]. IMHO we should just implement the \setrandom changes, and not add any of these options to modify the standard test workload. If someone wants to run TPC-B workload with gaussian or exponential distribution, they can implement it as a custom script. The docs include the script for the standard TPC-B workload; just copy-paster that and modify the \setrandom lines. Well, when we set '--gaussian=NUM' or '--exponential=NUM' on command line, we can see access probability of top N records in result of final output. This out put is under following, [mitsu-ko@localhost pgbench]$ ./pgbench --exponential=10 postgres starting vacuum...end. transaction type: Exponential distribution TPC-B (sort of) scaling factor: 1 exponential threshold: 10.0 access probability of top 20%, 10% and 5% records: 0.86466 0.63212 0.39347 ~ This feature helps user to understand bias of distribution for tuning threshold parameter. If this feature is nothing, it is difficult to understand distribution of access pattern, and it cannot realized on custom script. Because range of distribution (min, max, and SQL pattern) are unknown on custom script. So I think present UI is not bad and should not change. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
Hi, (2014/03/14 4:21), Fabien COELHO wrote: We should do the same discussion for the UI of command-line option? The patch adds two options --gaussian and --exponential, but this UI seems to be a bit inconsistent with the UI for \setrandom. Instead, we can use something like --distribution=[uniform | gaussian | exponential]. Hmmm. That is possible, obviously. Note that it does not need to resort to a custom script, if one can do something like "--define=exp_threshold=5.6". Yeah, threshold paramter should be needed by generating distribution algorithms in my patch. And it is important that we can control distribution pattern by this paramter. If so, maybe one simpler named variable could be used, say "threshold", instead of separate names for each options. If we separate threshold option, I think it is difficult to understand dependency of this parameter. Because "threshold" is very general term, and when we will add other new feature, it is difficult to undestand which parameter is dependent and be needed. However there is a catch: currently the option allows to check that the threshold is large enough so as to avoid loops in the generator. So this mean moving the check in the generator, and doing it over and over. Possibly this is a good idea, because otherwise a custom script could circumvent the check. Well, the current status is that the check can be avoided with --define... Also, a shorter possibly additional name, would be nice, maybe something like: --dist=exp|gauss|uniform? Not sure. I like long options not to be too long. Well, if we run standard benchmark in pgbench, we need not set option because it is default benmchmark, and it is same as uniform distribution. And if we run extra benchmarks in pgbench which are like '-S' or '-N', we need to set option. Because they are non-standard benchmark setting, and it is same as gaussian or exponential distribution. So present UI keeps consistency and along the pgbench history. > I like long options not to be too long. Yes, I like so too. Present UI is very simple and useful for combination using such like '-S' and '--gaussian'. So I hope not changing UI. ex) pgbench -S --gaussian=5 pgbench -N --exponential=2 --sampling-rate=0.8 Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/09 1:49), Fabien COELHO wrote: Hello Mitsumasa-san, New "\setrandom" interface is here. \setrandom var min max [gaussian threshold | exponential threshold] Attached patch realizes this interface, but it has little bit ugly codeing in executeStatement() and process_commands().. I think it is not too bad. The "ignore extra arguments on the line" is a little pre-existing mess anyway. All right. What do you think? I'm okay with this UI and its implementation. OK. Attached patch is updated in the document. I don't like complex sentence, so I use tag a lot. If you like this documents, please mark ready for commiter. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** *** 98,103 static int pthread_join(pthread_t th, void **thread_return); --- 98,106 #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ + #define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + #define MIN_EXPONENTIAL_THRESHOLD 2.0 /* minimum threshold for exp */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ *** *** 169,174 bool is_connect; /* establish connection for each transaction */ --- 172,185 bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ + /* gaussian distribution tests: */ + double stdev_threshold; /* standard deviation threshold */ + booluse_gaussian = false; + + /* exponential distribution tests: */ + double exp_threshold; /* threshold for exponential */ + bool use_exponential = false; + char *pghost = ""; char *pgport = ""; char *login = NULL; *** *** 330,335 static char *select_only = { --- 341,428 "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" }; + /* --exponential case */ + static char *exponential_tpc_b = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts exponential :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" + "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --exponential with -N case */ + static char *exponential_simple_update = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts exponential :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --exponential with -S case */ + static char *exponential_select_only = { + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts exponential :exp_threshold\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + }; + + /* --gaussian case */ + static char *gaussian_tpc_b = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts gaussian :stdev_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" + "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --gaussian with -N case */ + static char *gaussian_simple_update = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "
Re: [HACKERS] gaussian distribution pgbench
(2014/03/07 16:02), KONDO Mitsumasa wrote: And other cases are classified under following. \setrandom var min max gaussian #hoge --> uniform Oh, it's wrong... It will be.. \setrandom var min max gaussian #hoge --> ERROR Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
Hi, (2014/03/04 17:42), KONDO Mitsumasa wrote:> (2014/03/04 17:28), Fabien COELHO wrote: >>> OK. I'm not sure which idia is the best. So I wait for comments in community:) >> Hmmm. Maybe you can do what Tom voted for, he is the committer:-) > Yeah, but he might change his mind by our disscuttion. So I wait untill tomorrow, > and if nothing to comment, I will start to fix what Tom voted for. I create the patch which is fixed UI. If we agree with this interface, I also start to fix the document. New "\setrandom" interface is here. \setrandom var min max [gaussian threshold | exponential threshold] Attached patch realizes this interface, but it has little bit ugly codeing in executeStatement() and process_commands().. That is under following. if(argc == 4) { ... /* uniform */ } else if (argv[4]== gaussian or exponential) { ... /* gaussian or exponential */ } else { ... /* uniform with extra argments */ } It is beacause pgbench custom script allows extra comments or extra argument in its file. For example, under following cases are no problem case. \setrandom var min max #hoge --> uniform random \setrandom var min max #hoge1 #hoge2 --> uniform random \setrandom var min max gaussian threshold #hoge -->gaussian random And other cases are classified under following. \setrandom var min max gaussian #hoge --> uniform \setrandom var min max max2 gaussian threshold --> uniform \setrandom var min gaussian #hoge --> ERROR However, if we wrong grammer in pgbench custom script, pgbench outputs error log on user terminal. So I think it is especially no problem. What do you think? Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** *** 98,103 static int pthread_join(pthread_t th, void **thread_return); --- 98,106 #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ + #define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + #define MIN_EXPONENTIAL_THRESHOLD 2.0 /* minimum threshold for exp */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ *** *** 169,174 bool is_connect; /* establish connection for each transaction */ --- 172,185 bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ + /* gaussian distribution tests: */ + double stdev_threshold; /* standard deviation threshold */ + booluse_gaussian = false; + + /* exponential distribution tests: */ + double exp_threshold; /* threshold for exponential */ + bool use_exponential = false; + char *pghost = ""; char *pgport = ""; char *login = NULL; *** *** 330,335 static char *select_only = { --- 341,428 "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" }; + /* --exponential case */ + static char *exponential_tpc_b = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts exponential :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" + "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --exponential with -N case */ + static char *exponential_simple_update = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setrandom aid 1 :naccounts exponential :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + +
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Hi all, I think this patch is completely forgotten, and feel very unfortunate:( Min, max, and stdev is basic statistics in general monitoring tools, So I'd like to push it. (2014/02/12 15:45), KONDO Mitsumasa wrote: (2014/01/29 17:31), Rajeev rastogi wrote: No Issue, you can share me the test cases, I will take the performance report. Attached patch is supported to latest pg_stat_statements. It includes min, max, and stdev statistics. Could you run compiling test on your windows enviroments? I think compiling error was fixed. We had disscuttion about which is needed useful statistics in community, I think both of statistics have storong and weak point. When we see the less(2 or 3) executed statement, stdev will be meaningless because it cannot calculate estimated value precisely very much, however in this situation, min and max will be propety work well because it isn't estimated value but fact value. On the other hand, when we see the more frequency executed statement, they will be contrary position statistics, stdev will be very useful statistics for estimating whole statements, and min and max might be extremely value. At the end of the day, these value were needed each other for more useful statistics when we want to see several actual statments. And past my experience showed no performance problems in this patch. So I'd like to implements all these values in pg_stat_statements. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/04 17:28), Fabien COELHO wrote: OK. I'm not sure which idia is the best. So I wait for comments in community:) Hmmm. Maybe you can do what Tom voted for, he is the committer:-) Yeah, but he might change his mind by our disscuttion. So I wait untill tomorrow, and if nothing to comment, I will start to fix what Tom voted for. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/03 16:51), Fabien COELHO wrote:>>>\setrandom foo 1 10 [uniform] >>>\setrandom foo 1 :size gaussian 3.6 >>>\setrandom foo 1 100 exponential 7.2 >> It's good design. I think it will become more low overhead at part of parsing >> in pgbench, because comparison of strings will be redeced(maybe). And I'd like >> to remove [uniform], beacause we have to have compatibility for old scripts, >> and random function always gets uniform distribution in common sense of >> programming. > > I just put "uniform" as an optional default, hence the brackets. All right. I was misunderstanding. However, if we select this format, I'd like to remove it. Because pgbench needs to check counts of argment number. If we allow brackets, it will not be simple. > Otherwise, what I would have in mind if this would be designed from scratch: > >\set foo 124 >\set foo "string value" (?) >\set foo :variable >\set foo 12 + :shift > > And then > >\set foo uniform 1 10 >\set foo gaussian 1 10 4.2 >\set foo exponential 1 100 5.2 > > or maybe functions could be repended with something like "&uniform". > But that would be for another life:-) I don't agree with that.. They are more overhead in parsing part and more complex for user. >> However, new grammer is little bit long in user script. It seems trade-off that >> are visibility of scripts and user writing cost. > > Yep. OK. I'm not sure which idia is the best. So I wait for comments in community:) Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/03/02 22:32), Fabien COELHO wrote: Alvaro Herrera writes: Seems that in the review so far, Fabien has focused mainly in the mathematical properties of the new random number generation. That seems perfectly fine, but no comment has been made about the chosen UI for the feature. Per the few initial messages in the thread, in the patch as submitted you ask for a gaussian random number by using \setgaussian, and exponential via \setexp. Is this the right UI? I thought it would be both concise & clear to have that as another form of \set*. Yeah, but we got only two or three? concise. So I agree with discussing about UI. There is an additional argument expected. That would make: \setrandom foo 1 10 [uniform] \setrandom foo 1 :size gaussian 3.6 \setrandom foo 1 100 exponential 7.2 It's good design. I think it will become more low overhead at part of parsing in pgbench, because comparison of strings will be redeced(maybe). And I'd like to remove [uniform], beacause we have to have compatibility for old scripts, and random function always gets uniform distribution in common sense of programming. However, new grammer is little bit long in user script. It seems trade-off that are visibility of scripts and user writing cost. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What behavior is in this loop?
(2014/02/27 20:19), Heikki Linnakangas wrote: On 02/27/2014 12:38 PM, KONDO Mitsumasa wrote: I found interesting "for" and "while" loop in WaitForWALToBecomeAvailable() in xlog.c. Can you tell me this behavior? for (;;) { ~ } while (StanbyMode) I confirmed this code is no problem in gcc compiler:) Oh wow :-). That's clearly a thinko, although harmless in this case. Looking at the git history, I made that mistake in commit abf5c5c9a. Before that, there was no "while". That's easier to understand with some extra formatting. That's two loops, like this: /* loop 1 */ for (;;) { ... } /* loop2 */ while(StandbyMode); The second loop is obviously completely pointless. Thankfully, the there are no "breaks" inside the first loop (the ones within the switch-statements don't count), so the endless while-loop is never reached. Yeah, StandbyMode flag doesn't change in this loop. I'll go fix that... Thanks for the report! Thanks for your kind! By the way, I found cannot promote problem in PG9.3.3 in standby mode with archive revovery and crash recovery situations. I analyze this problem and fix it now. Please take care of then! Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Unfortunate choice of short switch name in pgbench
(2014/02/28 2:39), Tom Lane wrote: > Fabien COELHO writes: >>> Yeah, but they don't make -P take an integer argument. It's that >>> little frammish that makes this problem significant. > >> I do not see a strong case to make options with arguments case insensitive >> as a general rule. If this is done for -p/-P, why not -t/-T? I'll say the same thing. And if we remove -P short option in pgbench, it means that -P with integer will be forbided in postgres command. Surely, we don't hope so. >> If you really fell you must remove -P, please replace it by another >> one-letter, I use this option nearly everytime a run pgbench. > > Meh. If I thought -P would be that popular, I'd expect people to get > used to the issue. I don't believe it though. At least, a user which is interested in postgres performance tuning(include kernel options, etc) will often use this option. I recommended this feature, because we can see the bottle-neck which we have not seen:) I believe you will also become to like it more and more, while you use it. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] What behavior is in this loop?
Hi, I found interesting "for" and "while" loop in WaitForWALToBecomeAvailable() in xlog.c. Can you tell me this behavior? for (;;) { ~ } while (StanbyMode) I confirmed this code is no problem in gcc compiler:) Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/02/17 21:44), Rajeev rastogi wrote: It got compiled successfully on Windows. Thank you for checking on Windows! It is very helpful for me. Can we allow to add three statistics? I think only adding stdev is difficult to image for user. But if there are min and max, we can image each statements situations more easily. And I don't want to manage this feature in my monitoring tool that is called pg_statsinfo. Because it is beneficial for a lot of pg_stat_statements user and for improvement of postgres performance in the future. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Exposing currentTransactionWALVolume
(2014/02/15 23:04), Andres Freund wrote: Hi Simon, On 2014-01-14 17:12:35 +, Simon Riggs wrote: /* - * MarkCurrentTransactionIdLoggedIfAny + * ReportTransactionInsertedWAL * - * Remember that the current xid - if it is assigned - now has been wal logged. + * Remember that the current xid - if it is assigned - has now inserted WAL */ void -MarkCurrentTransactionIdLoggedIfAny(void) +ReportTransactionInsertedWAL(uint32 insertedWALVolume) { + currentTransactionWALVolume += insertedWALVolume; if (TransactionIdIsValid(CurrentTransactionState->transactionId)) CurrentTransactionState->didLogXid = true; } Not a big fan of combining those two. One works on the toplevel transaction, the other on the current subtransaction... The new name also ignores that it's only taking effect if there's actually a transaction in progress. Oh, yes. I don't have good idea, but we need to change function name or add new function for WAL adding volume. If it will be fixed, I set ready for commiter, because I cannot see any bad point in this patch. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2014/02/16 7:38), Fabien COELHO wrote: I have updated the patch (v7) based on Mitsumasa latest v6: - some code simplifications & formula changes. - I've added explicit looping probability computations in comments to show the (low) looping probability of the iterative search. - I've tried to clarify the sgml documentation. - I've removed the 5.0 default value as it was not used anymore. - I've renamed some variables to match the naming style around. Thank you for yor detail review and fix some code! I checked your modification version, it seems better than previos version and very helpful for documents. * Mathematical soundness I've checked again the mathematical soundness for the methods involved. After further thoughts, I'm not that sure that there is not a bias induced by taking the second value based on "cos" when the first based on "sin" as failed the test. So I removed the cos computation for the gaussian version, and simplified the code accordingly. This mean that it may be a little less efficient, but I'm more confident that there is no bias. I tried to confirm which method is better. However, at the end of the day, it is not a problem because other part of implementations have bigger overhead in pgbench client. We like simple implementaion so I agree with your modification version. And I tested this version, there is no overhead in creating gaussian and exponential random number with minimum threshold that is most overhead situation. * Conclusion If Mitsumasa-san is okay with the changes I have made, I would suggest to accept this patch. Attached patch based on v7 is added output that is possibility of access record when we use exponential option in the end of pgbench result. It is caluculated by a definite integral method for e^-x. If you check it and think no problem, please mark it ready for commiter. Ishii-san will review this patch:) Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** *** 98,103 static int pthread_join(pthread_t th, void **thread_return); --- 98,106 #define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ + #define MIN_GAUSSIAN_THRESHOLD 2.0 /* minimum threshold for gauss */ + #define MIN_EXPONENTIAL_THRESHOLD 2.0 /* minimum threshold for exp */ + int nxacts = 0; /* number of transactions per client */ int duration = 0; /* duration in seconds */ *** *** 169,174 bool is_connect; /* establish connection for each transaction */ --- 172,185 bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ + /* gaussian distribution tests: */ + double stdev_threshold; /* standard deviation threshold */ + booluse_gaussian = false; + + /* exponential distribution tests: */ + double exp_threshold; /* threshold for exponential */ + bool use_exponential = false; + char *pghost = ""; char *pgport = ""; char *login = NULL; *** *** 330,335 static char *select_only = { --- 341,428 "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" }; + /* --exponential case */ + static char *exponential_tpc_b = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setexponential aid 1 :naccounts :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" + "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --exponential with -N case */ + static char *exponential_simple_update = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setexponential aid 1 :naccounts :exp_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --exponential with -S case */ + static char *exponential_select_only = { + "\\set naccounts " CppAsString2(na
Re: [HACKERS] gaussian distribution pgbench
Sorry, previos attached patch has small bug. Please use latest one. > 134 - return min + (int64) (max - min + 1) * rand; > 134 + return min + (int64)((max - min + 1) * rand); Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pgbench/pgbench.c --- b/contrib/pgbench/pgbench.c *** *** 176,181 int progress_nthreads = 0; /* number of threads for progress report */ --- 176,183 bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ int main_pid; /* main process id used in log filename */ + double stdev_threshold = 5; /* standard deviation threshold */ + bool gaussian_option = false; /* use gaussian distribution random generator */ char *pghost = ""; char *pgport = ""; *** *** 338,346 static char *select_only = { --- 340,390 "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" }; + /* --gaussian case */ + static char *gaussian_tpc_b = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setgaussian aid 1 :naccounts :stdev_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n" + "UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --gaussian with -N case */ + static char *gaussian_simple_update = { + "\\set nbranches " CppAsString2(nbranches) " * :scale\n" + "\\set ntellers " CppAsString2(ntellers) " * :scale\n" + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setgaussian aid 1 :naccounts :stdev_threshold\n" + "\\setrandom bid 1 :nbranches\n" + "\\setrandom tid 1 :ntellers\n" + "\\setrandom delta -5000 5000\n" + "BEGIN;\n" + "UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n" + "END;\n" + }; + + /* --gaussian with -S case */ + static char *gaussian_select_only = { + "\\set naccounts " CppAsString2(naccounts) " * :scale\n" + "\\setgaussian aid 1 :naccounts :stdev_threshold\n" + "SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n" + }; + /* Function prototypes */ static void setalarm(int seconds); static void *threadRun(void *arg); + static inline double sqrtd(const double x); static void usage(void) *** *** 381,386 usage(void) --- 425,431 " -v, --vacuum-all vacuum all four standard tables before tests\n" " --aggregate-interval=NUM aggregate data over NUM seconds\n" " --sampling-rate=NUM fraction of transactions to log (e.g. 0.01 for 1%%)\n" + " --gaussian=NUM gaussian distribution with NUM standard deviation threshold\n" "\nCommon options:\n" " -d, --debug print debugging output\n" " -h, --host=HOSTNAME database server host or socket directory\n" *** *** 477,482 getrand(TState *thread, int64 min, int64 max) --- 522,597 return min + (int64) ((max - min + 1) * pg_erand48(thread->random_state)); } + /* random number generator: gaussian distribution from min to max inclusive */ + static int64 + getGaussianrand(TState *thread, int64 min, int64 max, double stdev_threshold) + { + double stdev; + double rand; + static double rand1; + static double rand2; + static double var_sqrt; + static bool reuse = false; + + /* + * Get user specified random number(-stdev_threshold < stdev <= stdev_threshold) + * in this loop. This loop is executed until appeared ranged number we want. + * However, this loop could not almost go on, because min stdev_threshold is 2 + * then the possibility of retry-loop is under 4 percent. And possibility of + * re-retry-loop is under 1.6 percent. And it doesn't happen frequentry even if + * we also think about the cycle of the trigonometric function. + */ + do + { + /* reuse pre calculation result as possible */ + if(!reuse) + { + /* + * pg_erand48 generates [0,1) random number. However rand1 + * needs (0,1) random number because log(0) cannot calculate. + * And rand2 also needs (0,1) random number in strictly. But + * normalization cost is high and we can substitute (0,1] at + * rand1 and [0,1) at rand2, so we use approximate calculation. + */ + rand1 = 1.0
Re: [HACKERS] gaussian distribution pgbench
Hi Febien, Thank you very much for your very detail and useful comments! I read your comment, I agree most of your advice:) Attached patch is fixed for your comment. That are... - Remove redundant long-option. - We can use "--gaussian=NUM -S" or "--gaussian=NUMN -N" options. - Add sentence in document - Separate two random generate function which are uniform and gaussian. - getGaussianrand() is created. - Fix ranged random number more strictly, ex. (0,1) or [0,1). - Please see comment of source code in detail:). - Fix typo. - Use cos() and sin() function when we generate gaussian random number. - Add fast sqrt calculation algorithm. - Reuse sqrt result and pre generate random number for reducing calculation cost. - Experience of this method is under following. It will be little-bit faster than non-reuse method. And distribution of gaussian is still good. * Settings shared_buffers = 1024MB * Test script pgbench -i -s 1 pgbench --gaussian=2 -T 30 -S -c8 -j4 -n pgbench --gaussian=2 -T 30 -S -c8 -j4 -n pgbench --gaussian=2 -T 30 -S -c8 -j4 -n * Result method | try1 | try2 | try3 | | reuse method | 44189 | 44453 | 44013 | non-reuse method | 43567 | 43635 | 43508 | (2014/02/09 21:32), Fabien COELHO wrote: This is a valuable contribution to enable pgbench to generate more realistic loads, which is seldom uniform in practice. Thanks! However, ISTM that other distributions such an exponantial one would make more sense, I can easy to create exponential distribution. Here, I assume exponential distribution that is f(x) = lambda * exp^(-lambda * x) in general. What do you think under following interface? custom script: \setexp [varname] min max threshold command : --exponential=NUM(threshold) I don't want to use lambda variable for simple implementation. So lambda is always 1. Because it can enough to control distribution by threshold. Threshold parameter is f(x) value. And using created distribution projects to 'aid' by same method. If you think OK, I will impliment under followings tomorrow, and also create parseing part of this function... do { rand = 1.0 - pg_erand48(thread->random_state); rand = -log(rand); }while( rand > exp_threshold) return rand / exp_threshold; and also the values should be further randomized so that neighboring values are not more likely to be drawn. The latest point is non trivial. That's right, but I worry about gaussian randomness and benchmark reproducibility might be disappeared when we re-randomized access pattern, because Postgres storage method manages records by each pages and it is difficult to realize access randomness in whole pages, not record. If we solve this problem, we have to need algorithm for smart shuffule projection function that is still having gaussian randomized. I think it will be difficult, and it have to impement in another patch in the future. * Mathematical soundness We want to derive a discrete normal distribution from a uniform one. Well, normal distributions are for continuous variables... Anyway, this is done by computing a continuous normal distribution which is then projected onto integers. I'm basically fine with that. The system uses a Box-Muller transform (1958) to do this transformation. The Ziggurat method seems to be prefered for this purpose, *but* it would require precalculated tables which depends on the target values. So I'm fine with the Box-Muller transform for pgbench. Yes, that's right. I selected simple and relatively faster algorithm, that is Box-Muller transform. The BM method uses 2 uniformly distributed numbers to derive 2 normally distributed numbers. The implementation computes one of these, and loops over till one match a threshold criterion. More explanations, at least in comments, are needed about this threshold and its meaning. It is required to be more than 2. I guess is that it allows to limit the number of iterations of the while loop, Yes. This loop could not almost go on, because min stdev_threshold is 2. The possibility of retry-loop is under 4 percent. It might not be problem. but in what proportion is unclear. The documentation does not also help the user to understand this value and its meaning. Yes, it is huristic method. So I added the comments in document. What I think it is: it is the deviation for the FURTHEST point around the mean, that is the actual deviation associated to the "min" and "max" target values. The 2 minimum value induces that there is a least 4 stddev lengths between min & max, with the most likely mean in the middle. Correct! If the threshold test fails, one of the 2 uniform number is redrawn, a new candidate value is tested. I'm not at ease about why only 1 value is redrawn and not both, some explanations would be welcome. Also, on the other hand,
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Hi Rajeev, (2014/01/29 17:31), Rajeev rastogi wrote: No Issue, you can share me the test cases, I will take the performance report. Attached patch is supported to latest pg_stat_statements. It includes min, max, and stdev statistics. Could you run compiling test on your windows enviroments? I think compiling error was fixed. We had disscuttion about which is needed useful statistics in community, I think both of statistics have storong and weak point. When we see the less(2 or 3) executed statement, stdev will be meaningless because it cannot calculate estimated value precisely very much, however in this situation, min and max will be propety work well because it isn't estimated value but fact value. On the other hand, when we see the more frequency executed statement, they will be contrary position statistics, stdev will be very useful statistics for estimating whole statements, and min and max might be extremely value. At the end of the day, these value were needed each other for more useful statistics when we want to see several actual statments. And past my experience showed no performance problems in this patch. So I'd like to implements all these values in pg_stat_statements. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql *** *** 19,24 CREATE FUNCTION pg_stat_statements(IN showtext boolean, --- 19,27 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 41,43 CREATE VIEW pg_stat_statements AS --- 44,51 SELECT * FROM pg_stat_statements(true); GRANT SELECT ON pg_stat_statements TO PUBLIC; + + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; *** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql *** *** 9,14 RETURNS void --- 9,19 AS 'MODULE_PATHNAME' LANGUAGE C; + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; + CREATE FUNCTION pg_stat_statements(IN showtext boolean, OUT userid oid, OUT dbid oid, *** *** 16,21 CREATE FUNCTION pg_stat_statements(IN showtext boolean, --- 21,29 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 42,44 GRANT SELECT ON pg_stat_statements TO PUBLIC; --- 50,53 -- Don't want this to be available to non-superusers. REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC; + REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC; *** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql --- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql *** *** 4,8 --- 4,9 \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset(); + ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time(); ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements(); ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements; *** a/contrib/pg_stat_statements/pg_stat_statements.c --- b/contrib/pg_stat_statements/pg_stat_statements.c *** *** 106,111 static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; --- 106,113 #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ + #define EXEC_TIME_INIT_MIN DBL_MAX /* initial execution min time */ + #define EXEC_TIME_INIT_MAX -DBL_MAX /* initial execution max time */ #define JUMBLE_SIZE1024 /* query serialization buffer size */ *** *** 137,142 typedef struct Counters --- 139,147 { int64 calls; /* # of times executed */ double total_time; /* total execution time, in msec */ + double total_sqtime; /* cumulated square execution time, in msec */ + double min_time; /* maximum execution time, in msec */ + double max_time; /* minimum execution time, in msec */ int64 rows; /* total # of retrieved or affected rows */ int64 shared_blks_hit; /* # of shared buffer hits */ int64 shared_blks_read; /* # of shared disk blocks read */ *** *** 274,283 void _PG_init(void); --- 279,290 void _PG_fini(void); Datum pg_stat_statem
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/30 8:29), Tom Lane wrote: > Andrew Dunstan writes: >> I could live with just stddev. Not sure others would be so happy. > > FWIW, I'd vote for just stddev, on the basis that min/max appear to add > more to the counter update time than stddev does; you've got > this: > > + e->counters.total_sqtime += total_time * total_time; > > versus this: > > + if (e->counters.min_time > total_time || e->counters.min_time > == EXEC_TIME_INIT) > + e->counters.min_time = total_time; > + if (e->counters.max_time < total_time) > + e->counters.max_time = total_time; > > I think on most modern machines, a float multiply-and-add is pretty > darn cheap; a branch that might or might not be taken, OTOH, is a > performance bottleneck. > > Similarly, the shared memory footprint hit is more: two new doubles > for min/max versus one for total_sqtime (assuming we're happy with > the naive stddev calculation). > > If we felt that min/max were of similar value to stddev then this > would be mere nitpicking. But since people seem to agree they're > worth less, I'm thinking the cost/benefit ratio isn't there. Why do you surplus worried about cost in my patch? Were three double variables assumed a lot of memory, or having lot of calculating cost? My test result showed LWlock bottele-neck is urban legend. If you have more fair test pattern, please tell me, I'll do it if the community will do fair judge. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/29 17:31), Rajeev rastogi wrote: On 28th January, Mitsumasa KONDO wrote: By the way, latest pg_stat_statement might affect performance in Windows system. Because it uses fflush() system call every creating new entry in pg_stat_statements, and it calls many fread() to warm file cache. It works well in Linux system, but I'm not sure in Windows system. If you have time, could you test it on your Windows system? If it affects perfomance a lot, we can still change it. No Issue, you can share me the test cases, I will take the performance report. Thank you for your kind! I posted another opinion in his patch. So please wait for a while, for not waste your test time. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/29 16:58), Peter Geoghegan wrote: On Tue, Jan 28, 2014 at 10:51 PM, Tom Lane wrote: KONDO Mitsumasa writes: By the way, latest pg_stat_statement might affect performance in Windows system. Because it uses fflush() system call every creating new entry in pg_stat_statements, and it calls many fread() to warm file cache. This statement doesn't seem to have much to do with the patch as committed. You could make a strong case for the patch improving performance in practice for many users, by allowing us to increase the pg_stat_statements.max default value to 5,000, 5 times the previous value. The real expense of creating a new entry is the exclusive locking of the hash table, which blocks *everything* in pg_stat_statements. That isn't expanded at all, since queries are only written with a shared lock, which only blocks the creation of new entries which was already relatively expensive. In particular, it does not block the maintenance of costs for all already observed entries in the hash table. It's obvious that simply reducing the pressure on the cache will improve matters a lot, which for many users the external texts patch does. Since Mitsumasa has compared that patch and at least one of his proposed pg_stat_statements patches on several occasions, I will re-iterate the difference: any increased overhead from the external texts patch is paid only when a query is first entered into the hash table. Then, there is obviously and self-evidently no additional overhead from contention for any future execution of the same query, no matter what, indefinitely (the exclusive locking section of creating a new entry does not do I/O, except in fantastically unlikely circumstances). So for many of the busy production systems I work with, that's the difference between a cost paid perhaps tens of thousands of times a second, and a cost paid every few days or weeks. Even if he is right about a write() taking an unreasonably large amount of time on Windows, the consequences felt as pg_stat_statements LWLock contention would be very limited. I am not opposed in principle to adding new things to the counters struct in pg_stat_statements. I just think that the fact that the overhead of installing the module on a busy production system is currently so low is of *major* value, and therefore any person that proposes to expand that struct should be required to very conclusively demonstrate that there is no appreciably increase in overhead. Having a standard deviation column would be nice, but it's still not that important. Maybe when we have portable atomic addition we can afford to be much more inclusive of that kind of thing. I'd like to know the truth and the fact in your patch, rather than your argument:-) So I create detail pg_stat_statements benchmark tool using pgbench. This tool can create 1 pattern unique sqls in a file, and it is only for measuring pg_stat_statements performance. Because it only updates pg_stat_statements data and doesn't write to disk at all. It's fair benchmark. [usage] perl create_sql.pl >file.sql psql -h -h xxx.xxx.xxx.xxx mitsu-ko -c 'SELECT pg_stat_statements_reset()' pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -n -T 180 -f file.sql [part of sample sqls file, they are executed in pgbench] SELECT 1-1/9+5*8*6+5+9-6-3-4/9%7-2%7/5-9/7+2+9/7-1%5%9/3-4/4-9/1+5+5/6/5%2+1*2*3-8/8+5-3-8-4/8+5%2*2-2-5%6+4 SELECT 1%9%7-8/5%6-1%2*2-7+9*6-2*6-9+1-2*9+6+7*8-4-2*1%3/7-1%4%9+4+7/5+4/2-3-5%8/3/7*6-1%8*6*1/7/2%5*6/2-3-9 SELECT 1%3*2/8%6%5-3%1+1/6*6*5/9-2*5+6/6*5-1/2-6%4%8/7%2*7%5%9%5/9%1%1-3-9/2*1+1*6%8/2*4/1+6*7*1+5%6+9*1-9*6 ... I test some methods that include old pgss, old pgss with my patch, and new pgss. Test server and postgresql.conf are same in I tested last day in this ML-thread. And test methods and test results are here, [methods] method 1: with old pgss, pg_stat_statements.max=1000(default) method 2: with old pgss with my patch, pg_stat_statements.max=1000(default) peter 1 : with new pgss(Peter's patch), pg_stat_statements.max=5000(default) peter 2 : with new pgss(Peter's patch), pg_stat_statements.max=1000 [for reference] method 5:with old pgss, pg_stat_statements.max=5000 method 6:with old pgss with my patch, pg_stat_statements.max=5000 [results] method | try1 | try2 | try3 | degrade performance ratio - method 1 | 6.546 | 6.558 | 6.638 | 0% (reference score) method 2 | 6.527 | 6.556 | 6.574 | 1% peter 1 | 5.204 | 5.203 | 5.216 |20% peter 2 | 4.241 | 4.236 | 4.262 |35% method 5 | 5.931 | 5.848 | 5.872 |11% method 6 | 5.794 | 5.792 | 5.776 |12% New pgss is extremely degrade performance than old pgss, and I cannot see performance scaling. I understand that his argument was "My patch reduces time of LWLock in pg_stat_statements, it is good for performance. However, Kondo's patch will be
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/29 15:51), Tom Lane wrote: > KONDO Mitsumasa writes: >> By the way, latest pg_stat_statement might affect performance in Windows >> system. >> Because it uses fflush() system call every creating new entry in >> pg_stat_statements, and it calls many fread() to warm file cache. > This statement doesn't seem to have much to do with the patch as > committed. There are no fflush calls, and no notion of warming the > file cache either. Oh, all right. > We do assume that the OS is smart enough to keep > a frequently-read file in cache ... is Windows too stupid for that? I don't know about it. But I think Windows cache feature is stupid. It seems to always write/read data to/from disk, nevertheless having large memory... I'd like to know test result on Windows, if we can... Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop duplicate buffers in OS
Hi, Attached is latest patch. I change little bit at PinBuffer() in bufmgr.c. It will evict target clean file cache in OS more exactly. - if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED)) + if (!(buf->flags & BM_DIRTY) && !(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED)) (2014/01/29 8:20), Jeff Janes wrote: On Wed, Jan 15, 2014 at 10:34 AM, Robert Haas mailto:robertmh...@gmail.com>> wrote: On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote: > I create patch that can drop duplicate buffers in OS using usage_count > alogorithm. I have developed this patch since last summer. This feature seems to > be discussed in hot topic, so I submit it more faster than my schedule. > > When usage_count is high in shared_buffers, they are hard to drop from > shared_buffers. However, these buffers wasn't required in file cache. Because > they aren't accessed by postgres(postgres access to shared_buffers). > So I create algorithm that dropping file cache which is high usage_count in > shared_buffers and is clean state in OS. If file cache are clean state in OS, and > executing posix_fadvice DONTNEED, it can only free in file cache without writing > physical disk. This algorithm will solve double-buffered situation problem and > can use memory more efficiently. > > I am testing DBT-2 benchmark now... Have you had any luck with it? I have reservations about this approach. Among other reasons, if the buffer is truly nailed in shared_buffers for the long term, the kernel won't see any activity on it and will be able to evict it fairly efficiently on its own. My patch aims not to evict other useful file cache in OS. If we doesn't evict useful file caches in shered_buffers, they will be evicted with other useful file cache in OS. But if we evict them as soon as possible, it will be difficult to evict other useful file cache all the more. So I'm reluctant to do a detailed review if the author cannot demonstrate a performance improvement. I'm going to mark it waiting-on-author for that reason. Will you review my patch? Thank you so much! However, My patch performance is be little bit better. It might be error range. Optimize kernel readahead patch is grate. Too readahead in OS is too bad, and to be full of not useful file cache in OS. Here is the test result. Plain result is tested before(readahead patch test). * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB OS: RHEL 6.4(x86_64) FS: Ext4 * DBT-2 result(WH400, SESSION=100, ideal_score=5160) Method | score | average | 90%tile | Maximum plain | 3589 | 9.751 | 33.680 | 87.8036 patched| 3799 | 9.914 | 22.451 | 119.4259 * Main Settings shared_buffers= 2458MB drop_duplicate_buffers = 5 // patched only I tested benchmark with drop_duplicate_buffers = 3 and 4 settings. But I didn't get good result. So I will test with more larger shared_buffers and these settings. [detail settings] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/dbserver/param.out * Detail results (uploading now. please wait for a hour...) [plain] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html [patched] http://pgstatsinfo.projects.pgfoundry.org/drop_os_cache/drop_dupulicate_cache20140129/HTML/index_thput.html We can see faster response time at OS witeback situation(maybe) and executing CHECKPOINT. Because when these are happened, read transactions hit file cache more in my patch. So responce times are better than plain. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1107,1112 include 'filename' --- 1107,1130 + + drop-duplicate-buffers (integer) + +drop-duplicate-buffers configuration parameter + + + + Sets target min usage count in shared_buffers, which is droped in file cache. + When you use this parameter, you set large shared_buffers that is about 50% + or higher, and set parameter with 4 or 5. It need to set carefuly. If you + use this parameter with small shared_buffers, transaction performance will be + decliend. This parameter is needed to support posix_fadvise() system call. + If your system doesn't support posix_fadvise(), it doesn't work at all. + + + + + temp_buffers (integer) *** a/src/backend/storage
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/28 15:17), Rajeev rastogi wrote: On 27th January, Mitsumasa KONDO wrote: 2014-01-26 Simon Riggs mailto:si...@2ndquadrant.com>> On 21 January 2014 19:48, Simon Riggs mailto:si...@2ndquadrant.com>> wrote: > On 21 January 2014 12:54, KONDO Mitsumasa <mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote: >> Rebased patch is attached. > > Does this fix the Windows bug reported by Kumar on 20/11/2013 ? Sorry, I was misunderstanding. First name of Mr. Rajeev Rastogi is Kumar! I searched only e-mail address and title by his name... I don't have windows compiler enviroment, but attached patch might be fixed. Could I ask Mr. Rajeev Rastogi to test my patch again? I tried to test this but I could not apply the patch on latest git HEAD. This may be because of recent patch (related to pg_stat_statement only "pg_stat_statements external query text storage "), which got committed on 27th January. Thank you for trying to test my patch. As you say, recently commit changes pg_stat_statements.c a lot. So I have to revise my patch. Please wait for a while. By the way, latest pg_stat_statement might affect performance in Windows system. Because it uses fflush() system call every creating new entry in pg_stat_statements, and it calls many fread() to warm file cache. It works well in Linux system, but I'm not sure in Windows system. If you have time, could you test it on your Windows system? If it affects perfomance a lot, we can still change it. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/26 17:43), Mitsumasa KONDO wrote: > 2014-01-26 Simon Riggs mailto:si...@2ndquadrant.com>> > > On 21 January 2014 19:48, Simon Riggs <mailto:si...@2ndquadrant.com>> wrote: > > On 21 January 2014 12:54, KONDO Mitsumasa > <mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote: > >> Rebased patch is attached. > > > > Does this fix the Windows bug reported by Kumar on 20/11/2013 ? Sorry, I was misunderstanding. First name of Mr. Rajeev Rastogi is Kumar! I searched only e-mail address and title by his name... I don't have windows compiler enviroment, but attached patch might be fixed. Could I ask Mr. Rajeev Rastogi to test my patch again? Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql *** *** 19,24 CREATE FUNCTION pg_stat_statements( --- 19,27 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 41,43 CREATE VIEW pg_stat_statements AS --- 44,52 SELECT * FROM pg_stat_statements(); GRANT SELECT ON pg_stat_statements TO PUBLIC; + + /* New Function */ + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; *** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql *** *** 9,14 RETURNS void --- 9,19 AS 'MODULE_PATHNAME' LANGUAGE C; + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; + CREATE FUNCTION pg_stat_statements( OUT userid oid, OUT dbid oid, *** *** 16,21 CREATE FUNCTION pg_stat_statements( --- 21,29 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 42,44 GRANT SELECT ON pg_stat_statements TO PUBLIC; --- 50,53 -- Don't want this to be available to non-superusers. REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC; + REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC; *** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql --- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql *** *** 4,8 --- 4,9 \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset(); + ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time(); ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements(); ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements; *** a/contrib/pg_stat_statements/pg_stat_statements.c --- b/contrib/pg_stat_statements/pg_stat_statements.c *** *** 78,84 static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ ! #define JUMBLE_SIZE1024 /* query serialization buffer size */ /* --- 78,85 #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ ! #define EXEC_TIME_INIT_MIN DBL_MAX /* initial execution min time */ ! #define EXEC_TIME_INIT_MAX -DBL_MAX /* initial execution max time */ #define JUMBLE_SIZE1024 /* query serialization buffer size */ /* *** *** 114,119 typedef struct Counters --- 115,123 { int64 calls; /* # of times executed */ double total_time; /* total execution time, in msec */ + double total_sqtime; /* cumulated square execution time, in msec */ + double min_time; /* maximum execution time, in msec */ + double max_time; /* minimum execution time, in msec */ int64 rows; /* total # of retrieved or affected rows */ int64 shared_blks_hit; /* # of shared buffer hits */ int64 shared_blks_read; /* # of shared disk blocks read */ *** *** 237,245 void _PG_init(void); --- 241,251 void _PG_fini(void); Datum pg_stat_statements_reset(PG_FUNCTION_ARGS); + Datum pg_stat_statements_reset_time(PG_FUNCTION_ARGS); Datum pg_stat_statements(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(pg_stat_
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/23 23:18), Andrew Dunstan wrote: What is more, if the square root calculation is affecting your benchmarks, I suspect you are benchmarking the wrong thing. I run another test that has two pgbench-clients in same time, one is select-only-query and another is executing 'SELECT * pg_stat_statement' query in every one second. I used v6 patch in this test. * Benchmark Commands $bin/pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180 -n & $bin/pgbench -h xxx.xxx.xxx.xxx mitsu-ko -T 180 -n -f file.sql ** file.sql SELECT * FROM pg_stat_statement; \sleep 1s * Select-only-query Result (Test result is represented by tps.) method| try1 | try2 | try3 with pgss| 125502 | 125818 | 125809 with patched pgss| 125909 | 125699 | 126040 This result shows my patch is almost same performance than before. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/23 10:28), Peter Geoghegan wrote: On Wed, Jan 22, 2014 at 5:28 PM, KONDO Mitsumasa wrote: Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. Well, that's a pretty easy theory to test. Just stop calling them (and do something similar to what we do for current counter fields instead) and see how much difference it makes. What means "calling them"? I think that part of heavy you think is pg_stat_statement view that is called by select query, not a part of LWLock getting statistic by hook. Right? I tested my patch in pgbench, but I cannot find bottleneck of my latest patch. (Sorry, I haven't been test select query in pg_stat_statement view...) Here is a test result. * Result (Test result is represented by tps.) method| try1 | try2 | try3 without pgss | 130938 | 131558 | 131796 with pgss| 125164 | 125146 | 125358 with patched pgss| 126091 | 126681 | 126433 * Test Setting shared_buffers=1024MB checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.7 * pgbench script pgbench -h xxx.xxx.xxx.xxx mitsu-ko -i -s 100 psql -h xxx.xxx.xxx.xxx mitsu-ko -c 'CHECKPOINT' pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180 pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180 pgbench -h xxx.xxx.xxx.xxx mitsu-ko -c64 -j32 -S -T 180 * Server SPEC: CPU: Xeon E5-2670 1P/8C 2.6GHz #We don't have 32 core cpu... Memory: 24GB RAID: i420 2GB cache Disk: 15K * 6 (RAID 1+0) Attached is latest developping patch. It hasn't been test much yet, but sqrt caluclation may be faster. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql *** *** 19,24 CREATE FUNCTION pg_stat_statements( --- 19,27 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 41,43 CREATE VIEW pg_stat_statements AS --- 44,52 SELECT * FROM pg_stat_statements(); GRANT SELECT ON pg_stat_statements TO PUBLIC; + + /* New Function */ + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; *** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql *** *** 9,14 RETURNS void --- 9,19 AS 'MODULE_PATHNAME' LANGUAGE C; + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; + CREATE FUNCTION pg_stat_statements( OUT userid oid, OUT dbid oid, *** *** 16,21 CREATE FUNCTION pg_stat_statements( --- 21,29 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 42,44 GRANT SELECT ON pg_stat_statements TO PUBLIC; --- 50,53 -- Don't want this to be available to non-superusers. REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC; + REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC; *** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql --- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql *** *** 4,8 --- 4,9 \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset(); + ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time(); ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements(); ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements; *** a/contrib/pg_stat_statements/pg_stat_statements.c --- b/contrib/pg_stat_statements/pg_stat_statements.c *** *** 78,84 static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ ! #define JUMBLE_SIZE1024 /* query serialization buffer size */ /* --- 78,85 #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ ! #define EXEC_TIME_INIT_MIN DBL_MAX /* initial execution min time */ ! #define EXEC_
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/23 12:00), Andrew Dunstan wrote: On 01/22/2014 08:28 PM, KONDO Mitsumasa wrote: (2014/01/22 22:26), Robert Haas wrote: On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa wrote: OK, Kondo, please demonstrate benchmarks that show we have <1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. To see any impact from spinlock contention, I think you're pretty much going to need a machine with >32 cores, I think, and lots of concurrency. pgbench -S is probably a better test than DBT-2, because it leaves out all the writing, so percentage-wise more time will be spent doing things like updating the pgss hash table. Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. I will replcace sqrt() function in math.h to more faster algorithm. And moving unneccessary part of caluculation in LWlocks or other locks. It might take time to improvement, so please wait for a while. Umm, I have not read the patch, but are you not using Welford's method? Its per-statement overhead should be absolutely tiny (and should not compute a square root at all per statement - the square root should only be computed when the standard deviation is actually wanted, e.g. when a user examines pg_stat_statements) See for example <http://www.johndcook.com/standard_deviation.html> Thanks for your advice. I read your example roughly, however, I think calculating variance is not so heavy in my patch. Double based sqrt caluculation is most heavily in my mind. And I find fast square root algorithm that is used in 3D games. http://en.wikipedia.org/wiki/Fast_inverse_square_root This page shows inverse square root algorithm, but it can caluculate normal square root, and it is faster algorithm at the price of precision than general algorithm. I think we want to fast algorithm, so it will be suitable. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/22 22:26), Robert Haas wrote: On Wed, Jan 22, 2014 at 3:32 AM, KONDO Mitsumasa wrote: OK, Kondo, please demonstrate benchmarks that show we have <1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. To see any impact from spinlock contention, I think you're pretty much going to need a machine with >32 cores, I think, and lots of concurrency. pgbench -S is probably a better test than DBT-2, because it leaves out all the writing, so percentage-wise more time will be spent doing things like updating the pgss hash table. Oh, thanks to inform me. I think essential problem of my patch has bottle neck in sqrt() function and other division caluculation. I will replcace sqrt() function in math.h to more faster algorithm. And moving unneccessary part of caluculation in LWlocks or other locks. It might take time to improvement, so please wait for a while. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2014/01/22 9:34), Simon Riggs wrote: AFAICS, all that has happened is that people have given their opinions and we've got almost the same identical patch, with a rush-rush comment to commit even though we've waited months. If you submit a patch, then you need to listen to feedback and be clear about what you will do next, if you don't people will learn to ignore you and nobody wants that. I think it was replied that will be heavily. If we realize histogram in pg_stat_statements, we have to implement dobuble precision arrays for storing histogram data. And when we update histogram data in each statements, we must update arrays with searching what response time is the smallest or biggest? It is very big cost, assuming large memory, and too hevily when updating than we get benefit from it. So I just add stddev for as fast as latest pg_stat_statements. I got some agreed from some people, as you say. On 21 January 2014 21:19, Peter Geoghegan wrote: On Tue, Jan 21, 2014 at 11:48 AM, Simon Riggs wrote: I agree with people saying that stddev is better than nothing at all, so I am inclined to commit this, in spite of the above. I could live with stddev. But we really ought to be investing in making pg_stat_statements work well with third-party tools. I am very wary of enlarging the counters structure, because it is protected by a spinlock. There has been no attempt to quantify that cost, nor has anyone even theorized that it is not likely to be appreciable. OK, Kondo, please demonstrate benchmarks that show we have <1% impact from this change. Otherwise we may need a config parameter to allow the calculation. OK, testing DBT-2 now. However, error range of benchmark might be 1% higher. So I show you detail HTML results. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Rebased patch is attached. pg_stat_statements in PG9.4dev has already changed table columns in. So I hope this patch will be committed in PG9.4dev. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql *** *** 19,24 CREATE FUNCTION pg_stat_statements( --- 19,27 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 41,43 CREATE VIEW pg_stat_statements AS --- 44,52 SELECT * FROM pg_stat_statements(); GRANT SELECT ON pg_stat_statements TO PUBLIC; + + /* New Function */ + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; *** a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql --- b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql *** *** 9,14 RETURNS void --- 9,19 AS 'MODULE_PATHNAME' LANGUAGE C; + CREATE FUNCTION pg_stat_statements_reset_time() + RETURNS void + AS 'MODULE_PATHNAME' + LANGUAGE C; + CREATE FUNCTION pg_stat_statements( OUT userid oid, OUT dbid oid, *** *** 16,21 CREATE FUNCTION pg_stat_statements( --- 21,29 OUT query text, OUT calls int8, OUT total_time float8, + OUT min_time float8, + OUT max_time float8, + OUT stdev_time float8, OUT rows int8, OUT shared_blks_hit int8, OUT shared_blks_read int8, *** *** 42,44 GRANT SELECT ON pg_stat_statements TO PUBLIC; --- 50,53 -- Don't want this to be available to non-superusers. REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC; + REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC; *** a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql --- b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql *** *** 4,8 --- 4,9 \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset(); + ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time(); ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements(); ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements; *** a/contrib/pg_stat_statements/pg_stat_statements.c --- b/contrib/pg_stat_statements/pg_stat_statements.c *** *** 78,83 static const uint32 PGSS_PG_MAJOR_VERSION = PG_VERSION_NUM / 100; --- 78,84 #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ + #define EXEC_TIME_INIT (-1) /* initial execution time */ #define JUMBLE_SIZE1024 /* query serialization buffer size */ *** *** 114,119 typedef struct Counters --- 115,123 { int64 calls; /* # of times executed */ double total_time; /* total execution time, in msec */ + double total_sqtime; /* cumulated square execution time, in msec */ + double min_time; /* maximum execution time, in msec */ + double max_time; /* minimum execution time, in msec */ int64 rows; /* total # of retrieved or affected rows */ int64 shared_blks_hit; /* # of shared buffer hits */ int64 shared_blks_read; /* # of shared disk blocks read */ *** *** 237,245 void _PG_init(void); --- 241,251 void _PG_fini(void); Datum pg_stat_statements_reset(PG_FUNCTION_ARGS); + Datum pg_stat_statements_reset_time(PG_FUNCTION_ARGS); Datum pg_stat_statements(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(pg_stat_statements_reset); + PG_FUNCTION_INFO_V1(pg_stat_statements_reset_time); PG_FUNCTION_INFO_V1(pg_stat_statements); static void pgss_shmem_startup(void); *** *** 266,271 static pgssEntry *entry_alloc(pgssHashKey *key, const char *query, --- 272,278 int query_len, bool sticky); static void entry_dealloc(void); static void entry_reset(void); + static void entry_reset_time(void); static void AppendJumble(pgssJumbleState *jstate, const unsigned char *item, Size size); static void JumbleQuery(pgssJumbleState *jstate, Query *query); *** *** 1046,1051 pgss_store(const char *query, uint32 queryId, --- 1053,1059 e->counters.calls += 1; e->counters.total_time += total_time; + e->counters.total_sqtime += total_time * total_time; e->counters.rows += rows; e->counters.shared_blks_hit += bufusage->shared_blks_hit; e->counters.shared_blks_read += bufusage->shared_blks_read; *** *** 1061,1066 pgss_store(const char *query, u
Re: [HACKERS] drop duplicate buffers in OS
(2014/01/16 3:34), Robert Haas wrote: On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa wrote: I create patch that can drop duplicate buffers in OS using usage_count alogorithm. I have developed this patch since last summer. This feature seems to be discussed in hot topic, so I submit it more faster than my schedule. When usage_count is high in shared_buffers, they are hard to drop from shared_buffers. However, these buffers wasn't required in file cache. Because they aren't accessed by postgres(postgres access to shared_buffers). So I create algorithm that dropping file cache which is high usage_count in shared_buffers and is clean state in OS. If file cache are clean state in OS, and executing posix_fadvice DONTNEED, it can only free in file cache without writing physical disk. This algorithm will solve double-buffered situation problem and can use memory more efficiently. I am testing DBT-2 benchmark now... The thing about this is that our usage counts for shared_buffers don't really work right now; it's common for everything, or nearly everything, to have a usage count of 5. So I'm reluctant to rely on that for much of anything. This patch aims to large shared_buffers situations, so 10% memory shared_buffers situaition might be not effective. This patch is in experimental and to show how to solve the double-buffers for one of a example. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] drop duplicate buffers in OS
(2014/01/16 21:38), Aidan Van Dyk wrote: Can we just get the backend that dirties the page to the posix_fadvice DONTNEED? No, it can remove clean page in OS file caches. Because if page is dirtied, it cause physical-disk-writing. However, it is experimental patch so it might be changed by future benchmark testing. Or have another helper that sweeps the shared buffers and does this post-first-dirty? We can add DropDuplicateOSCache() function to checkpointer process or other process. And we can chenged posix_fadvice() DONTNEED to sync_file_range(). It can cause physical-disk-writing in target buffer, not to free OS file caches. I'm considering that sync_file_range() SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE in executing checkpoint. It can avoid fsync freeze situaition in part of of finnal checkpoint. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] drop duplicate buffers in OS
Hi, I create patch that can drop duplicate buffers in OS using usage_count alogorithm. I have developed this patch since last summer. This feature seems to be discussed in hot topic, so I submit it more faster than my schedule. When usage_count is high in shared_buffers, they are hard to drop from shared_buffers. However, these buffers wasn't required in file cache. Because they aren't accessed by postgres(postgres access to shared_buffers). So I create algorithm that dropping file cache which is high usage_count in shared_buffers and is clean state in OS. If file cache are clean state in OS, and executing posix_fadvice DONTNEED, it can only free in file cache without writing physical disk. This algorithm will solve double-buffered situation problem and can use memory more efficiently. I am testing DBT-2 benchmark now... Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1107,1112 include 'filename' --- 1107,1130 + + drop-duplicate-buffers (integer) + +drop-duplicate-buffers configuration parameter + + + + Sets target min usage count in shared_buffers, which is droped in file cache. + When you use this parameter, you set large shared_buffers that is about 50% + or higher, and set parameter with 4 or 5. It need to set carefuly. If you + use this parameter with small shared_buffers, transaction performance will be + decliend. This parameter is needed to support posix_fadvise() system call. + If your system doesn't support posix_fadvise(), it doesn't work at all. + + + + + temp_buffers (integer) *** a/src/backend/storage/buffer/bufmgr.c --- b/src/backend/storage/buffer/bufmgr.c *** *** 69,74 --- 69,75 /* GUC variables */ bool zero_damaged_pages = false; int bgwriter_lru_maxpages = 100; + int DropDuplicateBuffers = -1; double bgwriter_lru_multiplier = 2.0; bool track_io_timing = false; *** *** 111,116 static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, --- 112,118 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln); static void AtProcExit_Buffers(int code, Datum arg); static int rnode_comparator(const void *p1, const void *p2); + static void DropDuplicateOSCache(volatile BufferDesc *bufHdr); /* *** *** 835,841 BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, * 1 so that the buffer can survive one clock-sweep pass.) */ buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT); if (relpersistence == RELPERSISTENCE_PERMANENT) buf->flags |= BM_TAG_VALID | BM_PERMANENT; else --- 837,843 * 1 so that the buffer can survive one clock-sweep pass.) */ buf->tag = newTag; ! buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_FADVED); if (relpersistence == RELPERSISTENCE_PERMANENT) buf->flags |= BM_TAG_VALID | BM_PERMANENT; else *** *** 1101,1107 PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy) --- 1103,1120 if (strategy == NULL) { if (buf->usage_count < BM_MAX_USAGE_COUNT) + { buf->usage_count++; + + /* + * If the buffer is clean, we can remove duplicate buffers + * from the file cache in OS without physical disk writing, + * which is for more efficient whole memory using. It is needed + * to execute at once per buffers. + */ + if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED)) + DropDuplicateOSCache(buf); + } } else { *** *** 1953,1958 FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln) --- 1966,1974 */ recptr = BufferGetLSN(buf); + /* mark this buffer's file cache in OS isn't clean */ + buf->flags |= BM_FADVED; + /* To check if block content changes while flushing. - vadim 01/17/97 */ buf->flags &= ~BM_JUST_DIRTIED; UnlockBufHdr(buf); *** *** 2101,2106 BufferGetLSNAtomic(Buffer buffer) --- 2117,2144 return lsn; } + /* + * Drop duplicate OS cache which is in OS and shared_buffers. This purpose + * is for more efficient file cache space and dirty file cache management. + */ + static void + DropDuplicateOSCache(volatile BufferDesc *bufHdr) + { + SMgrRelationreln; + MdfdVec *v; + off_t seekpos; + + /* Drop OS cache which is higher usage_count in shared_buffer */ + if(DropDuplicateBuffers != -1 && bufHdr->usage_count >= DropDuplicateBuffers) + { + reln = smgropen(bufHdr->tag.rnode, InvalidBackendId); + v = _mdfd_getseg(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum, false, 0); + seekpos = (off_t) BLC
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
Hi, I fix and submit this patch in CF4. In my past patch, it is significant bug which is mistaken caluculation of offset in posix_fadvise():-( However it works well without problem in pgbench. Because pgbench transactions are always random access... And I test my patch in DBT-2 benchmark. Results are under following. * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB OS: RHEL 6.4(x86_64) FS: Ext4 * DBT-2 result(WH400, SESSION=100, ideal_score=5160) Method | score | average | 90%tile | Maximum plain | 3589 | 9.751 | 33.680 | 87.8036 option=off | 3670 | 9.107 | 34.267 | 79.3773 option=on | 4222 | 5.140 | 7.619 | 102.473 "option" is "readahead_strategy" option, and "on" is my proposed. "average", "90%tile", and Maximum represent latency. Average_latency is 2 times faster than plain! * Detail results (uploading now. please wait for a hour...) [plain] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html [option=off] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_off_20140109/HTML/index_thput.html [option=on] http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_on_20140109/HTML/index_thput.html We can see part of super slow latency in my proposed method test. Part of transaction active is 20%, and part of slow transactions is 80%. It might be Pareto principle in CHECKPOINT;-) #It's joke. I will test some join sqls performance and TPC-3 benchmark in this or next week. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/configure --- b/configure *** *** 11303,11309 fi LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" --- 11303,11309 LIBS_including_readline="$LIBS" LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do : as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var" *** a/contrib/pg_prewarm/pg_prewarm.c --- b/contrib/pg_prewarm/pg_prewarm.c *** *** 179,185 pg_prewarm(PG_FUNCTION_ARGS) */ for (block = first_block; block <= last_block; ++block) { ! smgrread(rel->rd_smgr, forkNumber, block, blockbuffer); ++blocks_done; } } --- 179,185 */ for (block = first_block; block <= last_block; ++block) { ! smgrread(rel->rd_smgr, forkNumber, block, blockbuffer, BAS_BULKREAD); ++blocks_done; } } *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1293,1298 include 'filename' --- 1293,1322 + + readahead_strategy (integer) + +readahead_strategyconfiguration parameter + + + + This feature is to select which readahead strategy is used. When we + set off(default), readahead strategy is optimized by OS. On the other + hands, when we set on, readahead strategy is optimized by Postgres. + In typicaly situations, OS readahead strategy will be good working, + however Postgres often knows better readahead strategy before + executing disk access. For example, we can easy to predict access + pattern when we input SQLs, because planner of postgres decides + efficient access pattern to read faster. And it might be random access + pattern or sequential access pattern. It will be less disk IO and more + efficient to use file cache in OS. It will be better performance. + However this optimization is not complete now, so it is necessary to + choose it carefully in considering situations. Default setting is off + that is optimized by OS, and whenever it can change it. + + + + *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 9125,9131 copy_relation_data(SMgrRelation src, SMgrRelation dst, /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); ! smgrread(src, forkNum, blkno, buf); if (!PageIsVerified(page, blkno)) ereport(ERROR, --- 9125,9131 /* If we got a cancel signal during th
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/12/17 21:29), Simon Riggs wrote: These are interesting results. Good research. Thanks! They also show that the benefit of this is very specific to the exact task being performed. I can't see any future for a setting that applies to everything or nothing. We must be more selective. This patch is still needed some human judgement whether readahead is on or off. But it might have been already useful for clever users. However, I'd like to implement adding more the minimum optimization. We also need much better benchmark results, clearly laid out, so they can be reproduced and discussed. I think this feature is big benefit for OLTP, and it might useful for BI now. BI queries are mostly compicated, so we will need to test more in some situations. Printf debug is very useful for debugging my patch, and it will accelerate the optimization. Please keep working on this. OK. I do it patiently. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_rewarm status
(2013/12/18 5:33), Robert Haas wrote: Sounds like it might be worth dusting the patch off again... I'd like to request you to add all_index option and usage_count option. When all_index option is selected, all index become rewarm nevertheless user doesn't input relation name. And usage_count option adds usage_copunt in shared_buffers. Useful buffers will remain long and not to be thrown easly. I think these are easy to implements and useful. So please if you like. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
Hi, I fixed the patch to improve followings. - Can compile in MacOS. - Change GUC name enable_kernel_readahead to readahead_strategy. - Change POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL when we select sequential access strategy, this reason is later... I tested simple two access paterns which are followings in pgbench tables scale size is 1000. A) SELECT count(bid) FROM pgbench_accounts; (Index only scan) B) SELECT count(bid) FROM pgbench_accounts; (Seq scan) In each test, I restart postgres and drop file cache before each test. Unpatched PG is faster than patched in A and B query. It was about 1.3 times faster. Result of A query as expected, because patched PG cannot execute readahead at all. So cache cold situation is bad for patched PG. However, it might good for cache hot situation, because it doesn't read disk IO at all and can calculate file cache usage and know which cache is important. However, result of B query as unexpected, because my patch select POSIX_FADV_SEQUNENTIAL collectry, but it slow. I cannot understand that, nevertheless I read kernel source code... Next, I change POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL in my patch. B query was faster as unpatched PG. In heavily random access benchmark tests which are pgbench and DBT-2, my patched PG is about 1.1 - 1.3 times faster than unpatched PG. But postgres buffer hint strategy algorithm have not optimized for readahead strategy yet, and I don't fix it. It is still only for ring buffer algorithm in shared_buffer. Attached printf-debug patch will show you inside postgres buffer strategy. When you see "S" it selects sequential access strategy, on the other hands, when you see "R" it selects random access strategy. It might interesting for you. It's very visual. Example output is here. [mitsu-ko@localhost postgresql]$ bin/vacuumdb SSS~~SS [mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(aid) FROM pgbench_accounts" QUERY PLAN - Aggregate (cost=2854.29..2854.30 rows=1 width=4) (actual time=33.438..33.438 rows=1 loops=1) -> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.29..2604.29 rows=10 width=4) (actual time=0.072..20.912 rows=10 loops=1) Heap Fetches: 0 Total runtime: 33.552 ms (4 rows) RRR~~RR [mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(bid) FROM pgbench_accounts" SSS~~SS -- Aggregate (cost=2890.00..2890.01 rows=1 width=4) (actual time=40.315..40.315 rows=1 loops=1) -> Seq Scan on pgbench_accounts (cost=0.00..2640.00 rows=10 width=4) (actual time=0.112..23.001 rows=10 loops=1) Total runtime: 40.472 ms (3 rows) Thats's all now. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/configure --- b/configure *** *** 19937,19943 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5 --- 19937,19943 ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5 *** a/doc/src/sgml/config.sgml --- b/doc/src/sgml/config.sgml *** *** 1252,1257 include 'filename' --- 1252,1281 + + readahead_strategy (integer) + +readahead_strategyconfiguration parameter + + + + This feature is to select which readahead strategy is used. When we + set off(default), readahead strategy is optimized by OS. On the other + hands, when we set on, readahead strategy is optimized by Postgres. + In typicaly situations, OS readahead strategy will be good working, + however Postgres often knows better readahead strategy before + executing disk access. For example, we can easy to predict access + pattern when we input SQLs, because planner of postgres decides + effici
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/12/12 9:30), Claudio Freire wrote: On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa wrote: enable_readahead=os|fadvise with os = on, fadvise = off Hmm. fadvise is method and is not a purpose. So I consider another idea of this GUC. Yeah, I was thinking of opening the door for readahead=aio, but whatever clearer than on-off would work ;) I'm very interested in Postgres with libaio, and I'd like to see the perfomance improvements. I'm not sure about libaio, however, it will face exclusive-buffer-lock problem in asynchronous IO. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/12 18:09), Simon Riggs wrote: On 9 December 2013 10:54, KONDO Mitsumasa wrote: (2013/12/09 19:35), Pavel Stehule wrote: 2013/12/9 KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>> Hi Fabrízio, I test your v4 patch, and send your review comments. * Fix typo > 49 -# commited transactions from the master, specify a recovery time delay. > 49 +# committed transactions from the master, specify a recovery time delay. * Fix white space > 177 - if (secs <= 0 && microsecs <=0) > 177 + if (secs <= 0 && microsecs <=0 ) * Add functionality (I propose) We can set negative number at min_standby_apply_delay. I think that this feature is for world wide replication situation. For example, master server is in Japan and slave server is in San Francisco. Japan time fowards than San Francisco time . And if we want to delay in this situation, it can need negative number in min_standby_apply_delay. So I propose that time delay conditional branch change under following. > - if (min_standby_apply_delay > 0) > + if (min_standby_apply_delay != 0) What do you think? It might also be working collectry. what using interval instead absolute time? This is because local time is recorded in XLOG. And it has big cost for calculating global time. I agree with your request here, but I don't think negative values are the right way to implement that, at least it would not be very usable. I think that my proposal is the easiest and simplist way to solve this problem. And I believe that the man who cannot calculate the difference in time-zone doesn't set replication cluster across continents. My suggestion would be to add the TZ to the checkpoint record. This way all users of WAL can see the TZ of the master and act accordingly. I'll do a separate patch for that. It is something useful for also other situations. However, it might be going to happen long and complicated discussions... I think that our hope is to commit this patch in this commit-fest or next final commit-fest. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/12 7:23), Fabrízio de Royes Mello wrote: On Wed, Dec 11, 2013 at 7:47 PM, Andres Freund * hot_standby=off: Makes delay useable with wal_level=archive (and thus > a lower WAL volume) > * standby_mode=off: Configurations that use tools like pg_standby and > similar simply don't need standby_mode=on. If you want to trigger > failover from within the restore_command you *cannot* set it. > * recovery_target_*: It can still make sense if you use > pause_at_recovery_target. I don't think part of his arguments are right very much... We can just set stanby_mode=on when we use "min_standby_apply_delay" with pg_standby and similar simply tools. However, I tend to agree with not to need to prohibit except for standby_mode. So I'd like to propose that changing parameter name of "min_standby_apply_delay" to "min_recovery_apply_delay". It is natural for this feature. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/10 18:38), Andres Freund wrote: "master PITR"? What's that? All PITR is based on recovery.conf and thus not really a "master"? "master PITR" is PITR with "standby_mode = off". It's just recovery from basebackup. They have difference between "master PITR" and "standby" that the former will be independent timelineID, but the latter is same timeline ID taht following the master sever. In the first place, purposes are different. Why should we prohibit using this feature in PITR? I don't see any advantage in doing so. If somebody doesn't want the delay, they shouldn't set it in the configuration file. End of story. Unfortunately, there are a lot of stupid in the world... I think you have these clients, too. There's not really a that meaningful distinction between PITR and replication using archive_command. Especially when using *pause_after. It is meaningless in "master PITR". It will be master which has new timelineID at unexpected timing. I think this feature will be used in a lot of scenarios in which PITR is currently used. We have to judge which is better, we get something potential or to protect stupid. And we had better to wait author's comment... Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/12/10 22:55), Claudio Freire wrote: On Tue, Dec 10, 2013 at 5:03 AM, KONDO Mitsumasa wrote: I revise this patch and re-run performance test, it can work collectry in Linux and no complile wanings. I add GUC about enable_kernel_readahead option in new version. When this GUC is on(default), it works in POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by buffer hint in Postgres, readahead parameter is optimized by postgres. We can change this parameter in their transactions everywhere and everytime. I'd change the naming to OK. I think "on" or "off" naming is not good, too. enable_readahead=os|fadvise with os = on, fadvise = off Hmm. fadvise is method and is not a purpose. So I consider another idea of this GUC. 1)readahead_strategy=os|pg This naming is good for future another implements. If we will want to set maximum readahead paraemeter which is always use POSIX_FADV_SEQUENTIAL, we can set "max". 2)readahead_optimizer=os|pg or readahaed_strategist=os|pg This naming is easy to understand to who is opitimized readahead. But it isn't extensibility for future another implements. And, if you want to keep the on/off values, I'd reverse them. Because off reads more like "I don't do anything special", and in your patch it's quite the opposite. I understand your feeling. If we adopt "on|off" setting, I would like to set GUC optimized_readahead=off|on. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
(2013/12/11 10:25), Tom Lane wrote: > Jeff Janes writes: >> On Tue, Dec 3, 2013 at 11:39 PM, Claudio Freire >> wrote: >>> Problem is, Postgres relies on a working kernel cache for checkpoints. >>> Checkpoint logic would have to be heavily reworked to account for an >>> impaired kernel cache. > >> I don't think it would need anything more than a sorted checkpoint. > > Nonsense. We don't have access to the physical-disk-layout information > needed to do reasonable sorting; OS knows physical-disk-layout which is under following. > [mitsu-ko@ssd ~]$ filefrag -v .bashrc > Filesystem type is: ef53 > File size of .bashrc is 124 (1 block, blocksize 4096) > ext logical physical expected length flags >0 0 15761410 1 eof > .bashrc: 1 extent found If we have to know this information, we can get physical-disk-layout whenever. > to say nothing of doing something > intelligent in a multi-spindle environment, or whenever any other I/O > is going on concurrently. IO scheduler in OS knows it best. So I think BufferedIO is faster than DirectIO in general situations. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
Hi, I revise this patch and re-run performance test, it can work collectry in Linux and no complile wanings. I add GUC about enable_kernel_readahead option in new version. When this GUC is on(default), it works in POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by buffer hint in Postgres, readahead parameter is optimized by postgres. We can change this parameter in their transactions everywhere and everytime. * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB OS: RHEL 6.4(x86_64) FS: Ext4 * Test setting I use "pgbench -c 8 -j 4 -T 2400 -S -P 10 -a" I also use my accurate patch in this test. So I exexuted under following command before each benchmark. 1. cluster all database 2. truncate pgbench_history 3. checkpoint 4. sync 5. checkpoint * postresql.conf shared_buffers = 2048MB maintenance_work_mem = 64MB wal_level = minimal checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.7 * Performance test result ** In memory database size s=1000| 1 | 2 | 3 | avg - readahead=on | 39836 | 40229 | 40055 | 40040 readahead=off | 31259 | 29656 | 30693 | 30536 ratio | 78% | 74% | 77% | 76% ** Over memory database size s=2000| 1 | 2 |3| avg - readahead=on | 1288 | 1370 | 1367 | 1341 readahead=off | 1683 | 1688 | 1395 | 1589 ratio | 131% | 123% | 102% | 118% s=3000| 1 | 2 |3| avg - readahead=on | 965 | 862 | 993 | 940 readahead=off | 1113 | 1098 | 935 | 1049 ratio | 115% | 127% | 94% | 112% It seems good performance expect scale factor=1000. When readahead parameter is off, disk IO keep to a minimum or necessary, therefore it is faster than "readahead=on". "readahead=on" uses useless diskIO. For example, which is faster 8KB random read or 12KB random read from disks in many times transactions? It is self-evident that the former is faster. In scale factor 1000, it becomes to slower buffer-is-hot than "readahead=on". So it seems to less performance. But it is essence in measuring perfomance. And you can confirm it by attached benchmark graphs. We can use this parameter when buffer is reratively hot. If you want to see other trial graphs, I will send. And I will support to MacOS and create document about this patch in this week. #MacOS is in my house. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/configure --- b/configure *** *** 19937,19943 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5 --- 19937,19943 ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 9119,9125 copy_relation_data(SMgrRelation src, SMgrRelation dst, /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); ! smgrread(src, forkNum, blkno, buf); if (!PageIsVerified(page, blkno)) ereport(ERROR, --- 9119,9125 /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); ! smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD); if (!PageIsVerified(page, blkno)) ereport(ERROR, *** a/src/backend/storage/buffer/bufmgr.c --- b/src/backend/storage/buffer/bufmgr.c *** *** 41,46 --- 41,47 #include "pg_trace.h" #include "pgstat.h" #include "postmaster/bgwriter.h" + #include "storage/buf.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" #include "storage/ipc.h" *** *** 451,457 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); ! smgrread(smgr, forkNum, blockNum, (char *) bufBlock); if (track_io_timing) { --- 452,458 if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); ! smgrread(smgr, forkNum, block
Re: [HACKERS] Time-Delayed Standbys
(2013/12/09 20:29), Andres Freund wrote: On 2013-12-09 19:51:01 +0900, KONDO Mitsumasa wrote: Add my comment. We have to consider three situations. 1. PITR 2. replication standby 3. replication standby with restore_command I think this patch cannot delay in 1 situation. Why? I have three reasons. 1. It is written in document. Can we remove it? 2. Name of this feature is "Time-delayed *standbys*", not "Time-delayed *recovery*". Can we change it? 3. I think it is unnessesary in master PITR. And if it can delay in master PITR, it will become master at unexpected timing, not to continue to recovery. It is meaningless. I'd like to ask you what do you expect from this feature and how to use it. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/09 19:35), Pavel Stehule wrote: 2013/12/9 KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>> Hi Fabrízio, I test your v4 patch, and send your review comments. * Fix typo > 49 -# commited transactions from the master, specify a recovery time delay. > 49 +# committed transactions from the master, specify a recovery time delay. * Fix white space > 177 - if (secs <= 0 && microsecs <=0) > 177 + if (secs <= 0 && microsecs <=0 ) * Add functionality (I propose) We can set negative number at min_standby_apply_delay. I think that this feature is for world wide replication situation. For example, master server is in Japan and slave server is in San Francisco. Japan time fowards than San Francisco time . And if we want to delay in this situation, it can need negative number in min_standby_apply_delay. So I propose that time delay conditional branch change under following. > - if (min_standby_apply_delay > 0) > + if (min_standby_apply_delay != 0) What do you think? It might also be working collectry. what using interval instead absolute time? This is because local time is recorded in XLOG. And it has big cost for calculating global time. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/09 19:36), KONDO Mitsumasa wrote: * Problem 1 I read your wittened document. There is "PITR has not affected". However, when I run PITR with min_standby_apply_delay=300, it cannot start server. The log is under following. [mitsu-ko@localhost postgresql]$ bin/pg_ctl -D data2 start server starting [mitsu-ko@localhost postgresql]$ LOG: database system was interrupted; last known up at 2013-12-08 18:57:00 JST LOG: creating missing WAL directory "pg_xlog/archive_status" cp: cannot stat `../arc/0002.history': LOG: starting archive recovery LOG: restored log file "00010041" from archive LOG: redo starts at 0/4128 LOG: consistent recovery state reached at 0/41F0 LOG: database system is ready to accept read only connections LOG: restored log file "00010042" from archive FATAL: cannot wait on a latch owned by another process LOG: startup process (PID 30501) exited with exit code 1 LOG: terminating any other active server processes We need recovery flag for controling PITR situation. Add my comment. We have to consider three situations. 1. PITR 2. replication standby 3. replication standby with restore_command I think this patch cannot delay in 1 situation. So I think you should add only StandbyModeRequested flag in conditional branch. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
Hi Fabrízio, I test your v4 patch, and send your review comments. * Fix typo > 49 -# commited transactions from the master, specify a recovery time delay. > 49 +# committed transactions from the master, specify a recovery time delay. * Fix white space > 177 - if (secs <= 0 && microsecs <=0) > 177 + if (secs <= 0 && microsecs <=0 ) * Add functionality (I propose) We can set negative number at min_standby_apply_delay. I think that this feature is for world wide replication situation. For example, master server is in Japan and slave server is in San Francisco. Japan time fowards than San Francisco time . And if we want to delay in this situation, it can need negative number in min_standby_apply_delay. So I propose that time delay conditional branch change under following. > - if (min_standby_apply_delay > 0) > + if (min_standby_apply_delay != 0) What do you think? It might also be working collectry. * Problem 1 I read your wittened document. There is "PITR has not affected". However, when I run PITR with min_standby_apply_delay=300, it cannot start server. The log is under following. [mitsu-ko@localhost postgresql]$ bin/pg_ctl -D data2 start server starting [mitsu-ko@localhost postgresql]$ LOG: database system was interrupted; last known up at 2013-12-08 18:57:00 JST LOG: creating missing WAL directory "pg_xlog/archive_status" cp: cannot stat `../arc/0002.history': LOG: starting archive recovery LOG: restored log file "00010041" from archive LOG: redo starts at 0/4128 LOG: consistent recovery state reached at 0/41F0 LOG: database system is ready to accept read only connections LOG: restored log file "00010042" from archive FATAL: cannot wait on a latch owned by another process LOG: startup process (PID 30501) exited with exit code 1 LOG: terminating any other active server processes We need recovery flag for controling PITR situation. That's all for now. If you are busy, please fix in your pace. I'm busy and I'd like to wait your time, too:-) Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
(2013/12/05 23:42), Greg Stark wrote: On Thu, Dec 5, 2013 at 8:35 AM, KONDO Mitsumasa wrote: Yes. And using something efficiently DirectIO is more difficult than BufferedIO. If we change write() flag with direct IO in PostgreSQL, it will execute hardest ugly randomIO. Using DirectIO presumes you're using libaio or threads to implement prefetching and asynchronous I/O scheduling. I think in the long term there are only two ways to go here. Either a) we use DirectIO and implement an I/O scheduler in Postgres or b) We use mmap and use new system calls to give the kernel all the information Postgres has available to it to control the I/O scheduler. I agree with part of (b) method. I think MMAP API isn't purpose for controling I/O as others saying. And I think posix_fadivse(), sync_file_range() and fallocate() is easier way to be realized better I/O sheduler in Postgres. These systemcall doesn't cause data corruption at all, and we can just use existing implementaion. They effect only perfomance. My survey of posix_fadvise() and sync_file_range() is here. It's simple rule. #Almost my explaining is written in linux man:-) * Optimize readahead in OS [ posix_fadvise() ] These options is for mainly read perfomance. - POSIX_FADV_SEQUENTIAL flag -> Readahead parameter in OS becomes maximum. - POSIX_FADV_RANDOM flag -> Don't use readahead parameter in OS. It can calculate the file cache frequency and efficiency for using the file cache. - POSIX_FADV_NORMAL -> Readahead parameter in OS optimized dynamically in each situasions. If you doesn't judge strategy of disk controlling, we can select this option. It might be good working in almost cases. * Contorol dirty or clean buffer in OS [ posix_fadvise() and sync_file_range() ] These optinos is for write and read perfomance controling in OS file caches. - POSIX_FADV_DONTNEED -> Drop the file cache. If it is dirty, write disk and drop file cache. If it isn't dirty, it only drop from OS file cache. - sync_file_range() -> If you want to write dirty buffer to disk and remain file cache in OS, you can select this system-call. And it can contorol amount of write size. - POSIX_FADV_NOREUSE -> If you think that the file cache will not be needed, we can set this option. The file cache will be drop soon. - POSIX_FADV_WILLNEED -> If you think that the file cache will be important, we can set this option. The file cache will be tend to remain in OS file caches. That's all. Kernel in OS cannot predict IO pattern perfectly in each midlleware, therefore it is optimized by general heuristic algorithms. I think it is right way. However, PostgreSQL can predict IO pattern in part of planner, executer and checkpointer, so we had better set optimum posix_fadvise() flag or sync_file_range() before/after execute general IO systemcall. I think that they will be good IO contoroling and scheduling method without unreliable implementations. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
(2013/12/04 16:39), Claudio Freire wrote: On Wed, Dec 4, 2013 at 4:28 AM, Tatsuo Ishii wrote: Can we avoid the Linux kernel problem by simply increasing our shared buffer size, say up to 80% of memory? It will be swap more easier. Is that the case? If the system has not enough memory, the kernel buffer will be used for other purpose, and the kernel cache will not work very well anyway. In my understanding, the problem is, even if there's enough memory, the kernel's cache does not work as expected. Problem is, Postgres relies on a working kernel cache for checkpoints. Checkpoint logic would have to be heavily reworked to account for an impaired kernel cache. Really, there's no difference between fixing the I/O problems in the kernel(s) vs in postgres. The only difference is, in the kernel(s), everyone profits, and you've got a huge head start. Yes. And using something efficiently DirectIO is more difficult than BufferedIO. If we change write() flag with direct IO in PostgreSQL, it will execute hardest ugly randomIO. Communicating more with the kernel (through posix_fadvise, fallocate, aio, iovec, etc...) would probably be good, but it does expose more kernel issues. posix_fadvise, for instance, is a double-edged sword ATM. I do believe, however, that exposing those issues and prompting a fix is far preferable than silently working around them. Agreed. And, I believe that controled BufferedIO is faster and easier than controled DirectIO perfectly. In actually, Oracle database uses BufferedIO to access small datasets, and uses DirectIO to access big datasets. It is because using OS file cache more efficiently. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Why we are going to have to go DirectIO
(2013/12/04 11:28), Tatsuo Ishii wrote: Magnus Hagander writes: On Tue, Dec 3, 2013 at 11:44 PM, Josh Berkus wrote: Would certainly be nice. Realistically, getting good automated performace tests will require paying someone like Greg S., Mark or me for 6 solid months to develop them, since worthwhile open source performance test platforms currently don't exist. That money has never been available; maybe I should do a kickstarter. So in order to get *testing* we need to pay somebody. But to build a great database server, we can rely on volunteer efforts or sponsorship from companies who are interested in moving the project forward? And even more to the point, volunteers to reinvent the kernel I/O stack can be found on every street corner? And those volunteers won't need any test scaffolding to be sure that *their* version never has performance regressions? (Well, no, they won't, because no such thing will ever be built. But we do need better test scaffolding for real problems.) Can we avoid the Linux kernel problem by simply increasing our shared buffer size, say up to 80% of memory? It will be swap more easier. I think that we should use latest system-calls in Linux which are like posix_fadvise(), fallocate() and sync_file_range() etc, when we use linux buffered IO. Hoevere, PostgreSQL doesn't use these system-call a lots. Especially, I think that checkpoint algorithm is very ugly.. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/04 4:00), Andres Freund wrote: On 2013-12-03 13:46:28 -0500, Robert Haas wrote: On Tue, Dec 3, 2013 at 12:36 PM, Fabrízio de Royes Mello wrote: On Tue, Dec 3, 2013 at 2:33 PM, Christian Kruse wrote: Hi Fabrizio, looks good to me. I did some testing on 9.2.4, 9.2.5 and HEAD. It applies and compiles w/o errors or warnings. I set up a master and two hot standbys replicating from the master, one with 5 minutes delay and one without delay. After that I created a new database and generated some test data: CREATE TABLE test (val INTEGER); INSERT INTO test (val) (SELECT * FROM generate_series(0, 100)); The non-delayed standby nearly instantly had the data replicated, the delayed standby was replicated after exactly 5 minutes. I did not notice any problems, errors or warnings. Thanks for your review Christian... So, I proposed this patch previously and I still think it's a good idea, but it got voted down on the grounds that it didn't deal with clock drift. I view that as insufficient reason to reject the feature, but others disagreed. I really fail to see why clock drift should be this patch's responsibility. It's not like the world would go under^W data corruption would ensue if the clocks drift. Your standby would get delayed imprecisely. Big deal. From what I know of potential users of this feature, they would set it to at the very least 30min - that's WAY above the range for acceptable clock-drift on servers. Yes. I think that purpose of this patch is long time delay in standby server, and not for little bit careful timing delay. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/11/30 5:34), Fabrízio de Royes Mello wrote: On Fri, Nov 29, 2013 at 5:49 AM, KONDO Mitsumasa mailto:kondo.mitsum...@lab.ntt.co.jp>> wrote: > * Problem1 > Your patch does not code recovery.conf.sample about recovery_time_delay. > Please add it. Fixed. OK. It seems no problem. > * Problem2 > When I set time-delayed standby and start standby server, I cannot access stanby server by psql. It is because PG is in first starting recovery which cannot access by psql. I think that time-delayed standby is only delayed recovery position, it must not affect other functionality. > > I didn't test recoevery in master server with recovery_time_delay. If you have detail test result of these cases, please send me. > Well, I could not reproduce the problem that you described. I run the following test: 1) Clusters - build master - build slave and attach to the master using SR and config recovery_time_delay to 1min. 2) Stop de Slave 3) Run some transactions on the master using pgbench to generate a lot of archives 4) Start the slave and connect to it using psql and in another session I can see all archive recovery log Hmm... I had thought my mistake in reading your email, but it reproduce again. When I sat small recovery_time_delay(=3), it might work collectry. However, I sat long timed recovery_time_delay(=300), it didn't work. My reporduced operation log is under following. [mitsu-ko@localhost postgresql]$ bin/pgbench -T 30 -c 8 -j4 -p5432 starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 8 number of threads: 4 duration: 30 s number of transactions actually processed: 68704 latency average: 3.493 ms tps = 2289.196747 (including connections establishing) tps = 2290.175129 (excluding connections establishing) [mitsu-ko@localhost postgresql]$ vim slave/recovery.conf [mitsu-ko@localhost postgresql]$ bin/pg_ctl -D slave start server starting [mitsu-ko@localhost postgresql]$ LOG: database system was shut down in recovery at 2013-12-03 10:26:41 JST LOG: entering standby mode LOG: consistent recovery state reached at 0/5C4D8668 LOG: redo starts at 0/5C4000D8 [mitsu-ko@localhost postgresql]$ FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up FATAL: the database system is starting up [mitsu-ko@localhost postgresql]$ bin/psql -p6543 psql: FATAL: the database system is starting up [mitsu-ko@localhost postgresql]$ bin/psql -p6543 psql: FATAL: the database system is starting up I attached my postgresql.conf and recovery.conf. It will be reproduced. I think that your patch should be needed recovery flags which are like ArchiveRecoveryRequested and InArchiveRecovery etc. It is because time-delayed standy works only replication situasion. And I hope that it isn't bad in startup standby server and archive recovery. Is it wrong with your image? I think this patch have a lot of potential, however I think that standby functionality is more important than this feature. And we might need to discuss that how behavior is best in this patch. Regards, -- Mitsumasa KONDO NTT Open Source Software Center conf.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
Hi Royes, I'm sorry for my late review... I feel potential of your patch in PG replication function, and it might be something useful for all people. I check your patch and have some comment for improvement. I haven't executed detail of unexpected sutuation yet. But I think that under following problem2 is important functionality problem. So I ask you to solve the problem in first. * Regress test No problem. * Problem1 Your patch does not code recovery.conf.sample about recovery_time_delay. Please add it. * Problem2 When I set time-delayed standby and start standby server, I cannot access stanby server by psql. It is because PG is in first starting recovery which cannot access by psql. I think that time-delayed standby is only delayed recovery position, it must not affect other functionality. I didn't test recoevery in master server with recovery_time_delay. If you have detail test result of these cases, please send me. My first easy review of your patch is that all. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logging WAL when updating hintbit
(2013/11/15 19:27), Sawada Masahiko wrote: On Thu, Nov 14, 2013 at 7:51 PM, Florian Weimer wrote: On 11/14/2013 07:02 AM, Sawada Masahiko wrote: I attached patch adds new wal_level 'all'. Shouldn't this be a separate setting? It's useful for storage which requires rewriting a partially written sector before it can be read again. Thank you for comment. Actually, I had thought to add separate parameter. I think that he said that if you can proof that amount of WAL is almost same and without less performance same as before, you might not need to separate parameter in your patch. Did you test about amount of WAL size in your patch? I'd like to know it. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio
(2013/11/19 12:03), Peter Geoghegan wrote: On Mon, Nov 18, 2013 at 6:12 PM, KONDO Mitsumasa wrote: I confirmed that Itagaki-san and Mr Cerdic disscution. He said that raw values be just simple. However, were his changes just simple? I cannot understand his aesthetics sense and also you, too:-( It's too complicated, and do you know how to tuning PG from information of local_* and temp_*? At least, I think that most user cannot tuning from these information, and it might not be useful information only part of them. All of those costs are cumulative aggregates. If we didn't aggregate them, then the user couldn't possibly determine them on their own, to any approximation. That's the difference. If you think the local_* and temp_* aren't very useful, I'm inclined to agree, but it's too late to do anything about that now. I regret past decision of Itagaki-san's patch, and improvement might not be possible. However, we can change it, if we get have logical reason to change it. No. It's not for geek tools and people having pre-packaged solution in big company, but also for common DBA tools. I don't think that the tool needs to be expensive. If selecting from the pg_stat_statements view every 1-3 seconds is too expensive for such a tool, we can have a discussion about being smarter, because there certainly are ways to optimize it. I can understand why you say my patch is heavy now! Your monitoring methods are redically heavy. In general, we get pg_stat_statements view every 1 min - 5min. It is because monitoring SQLs must not heavier than common main SQLs. If we measure the real performance, we don't measure with monitoring SQL's cost. And, I cannot still understand you'd like to collect drastic detail performance of statements. I'd like to only know max, avg and stddev in each statement. They are enough, because we can improve them by using these information. Regarding your min/max patch: I'm opposed to adding even more to the spinlock-protected counters struct, so that we can get an exact answer to a question where an approximate answer would very likely be good enough. And as Itagaki-san said 4 years ago, who is to say that what you've done here for buffers (or equally, what you've done in your min/max patch) is more interesting than the same thing but for another cost? The point of having what you've removed from the pg_stat_statements docs about calculating averages is that it is an example that can be generalized from. I certainly think there should be better tooling to make displaying costs over time easier, or characterizing the distribution, but unfortunately this isn't it. Something like pg_stat_statements is allowed to be approximate. That's considered an appropriate trade-off. Most obviously, today there can be hash table collisions, and some of the entries can therefore be plain wrong. Previously, I put the probability of 1 collision in the hash table at about 1% when pg_stat_statements.max is set to 10,000. So if your min/max patch was "implemented in userspace", and an outlier is lost in the noise with just one second of activity, I'm not terribly concerned about that. It's a trade-off, and if you don't think it's the correct one, well then I'm afraid that's just where you and I differ. As I've said many times, if you want to have a discussion about making aggressive snapshotting of the pg_stat_statements view more practical, I think that would be very useful. In summary of your comment, your patch is lower cost than I proposed patch. Because my patch has long lock problem, on the other hands your patch doesn't these problem. Is it right? If it true, I can understand your theoretically, but I'm not sure without real practice or benchmark that it is really or not. So we will need benchmark test in my patch and yours. I try it. By the way, MySQL and Oracle database which are very popular have these statistics. I think that your argument might disturb people who wants to migration from these database and will accelerate popularity of these database more. I think that there should be better tooling built on top of pg_stat_statements. I don't know what Oracle does, but I'm pretty sure that MySQL has nothing like pg_stat_statements. Please correct me if I'm mistaken. I joined the db tech show case 2013 which is held in japan last week. Oracle speaker intoroduced performance schema and like these in MySQL 5.6. This is the slide of his. It's almost in japanese, but please see it since 31page. It is wirtten in SQL. http://www.slideshare.net/yoyamasaki/20131110-tuning-onmysql56 In MySQL 5.6, it has information which are sum_time, min_time, avg_time, max_time and sum_lock_time. I think it is useful for understanding our executing statements. As I said on the min/max thread, if we're no
Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio
(2013/11/19 11:12), KONDO Mitsumasa wrote: (2013/11/19 3:56), Peter Geoghegan wrote: On Mon, Nov 18, 2013 at 10:49 AM, Fujii Masao wrote: The same idea was proposed before but not committed because Itagaki thought that pg_stat_statements view should report only raw values. Please read the following thread. I have the same feeling with him. Anyway we should listen to more opinions from other people, though. http://www.postgresql.org/message-id/20091222172719.8b86.52131...@oss.ntt.co.jp I confirmed that Itagaki-san and Mr Cerdic disscution. He said that raw values be just simple. However, were his changes just simple? I cannot understand his aesthetics sense and also you, too:-( PG8.4 before: 012 CREATE FUNCTION pg_stat_statements( 013 OUT userid oid, 014 OUT dbid oid, 015 OUT query text, 016 OUT calls int8, 017 OUT total_time float8, 018 OUT rows int8 019 ) PG9.0 after: 012 CREATE FUNCTION pg_stat_statements( 013 OUT userid oid, 014 OUT dbid oid, 015 OUT query text, 016 OUT calls int8, 017 OUT total_time float8, 018 OUT rows int8, 019 OUT shared_blks_hit int8, 020 OUT shared_blks_read int8, 021 OUT shared_blks_written int8, 022 OUT local_blks_hit int8, 023 OUT local_blks_read int8, 024 OUT local_blks_written int8, 025 OUT temp_blks_read int8, 026 OUT temp_blks_written int8 027 ) It's too complicated, and do you know how to tuning PG from information of local_* and temp_*? At least, I think that most user cannot tuning from these information, and it might not be useful information only part of them. +1 from me. I think that a higher level tool needs to be built on pg_stat_statements to make things easy for those that want a slick, pre-packaged solution. No. It's not for geek tools and people having pre-packaged solution in big company, but also for common DBA tools. By the way, MySQL and Oracle database which are very popular have these statistics. I think that your argument might disturb people who wants to migration from these database and will accelerate popularity of these database more. As I said on the min/max thread, if we're not doing enough to help people who would like to build such a tool, we should discuss how we can do better. Could you tell me how to get min/max statistics with low cost? I'm not sure about detail of your patch in CF, but it seems very high cost. Repeatedly, I think that if we want to get drastic detail statistics, we have to create another tools of statistics. Your patch will be these statistics tools. However, pg_stat_statement sholud be just light weight. Regards, Oh, I forgot to say it... I beleive that essence of information technology is to show more informative information in little information with low cost. If it wrong, please let me know the reason. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio
(2013/11/19 3:56), Peter Geoghegan wrote: On Mon, Nov 18, 2013 at 10:49 AM, Fujii Masao wrote: The same idea was proposed before but not committed because Itagaki thought that pg_stat_statements view should report only raw values. Please read the following thread. I have the same feeling with him. Anyway we should listen to more opinions from other people, though. http://www.postgresql.org/message-id/20091222172719.8b86.52131...@oss.ntt.co.jp I confirmed that Itagaki-san and Mr Cerdic disscution. He said that raw values be just simple. However, were his changes just simple? I cannot understand his aesthetics sense and also you, too:-( PG8.4 before: 012 CREATE FUNCTION pg_stat_statements( 013 OUT userid oid, 014 OUT dbid oid, 015 OUT query text, 016 OUT calls int8, 017 OUT total_time float8, 018 OUT rows int8 019 ) PG9.0 after: 012 CREATE FUNCTION pg_stat_statements( 013 OUT userid oid, 014 OUT dbid oid, 015 OUT query text, 016 OUT calls int8, 017 OUT total_time float8, 018 OUT rows int8, 019 OUT shared_blks_hit int8, 020 OUT shared_blks_read int8, 021 OUT shared_blks_written int8, 022 OUT local_blks_hit int8, 023 OUT local_blks_read int8, 024 OUT local_blks_written int8, 025 OUT temp_blks_read int8, 026 OUT temp_blks_written int8 027 ) It's too complicated, and do you know how to tuning PG from information of local_* and temp_*? At least, I think that most user cannot tuning from these information, and it might not be useful information only part of them. +1 from me. I think that a higher level tool needs to be built on pg_stat_statements to make things easy for those that want a slick, pre-packaged solution. No. It's not for geek tools and people having pre-packaged solution in big company, but also for common DBA tools. By the way, MySQL and Oracle database which are very popular have these statistics. I think that your argument might disturb people who wants to migration from these database and will accelerate popularity of these database more. As I said on the min/max thread, if we're not doing enough to help people who would like to build such a tool, we should discuss how we can do better. Could you tell me how to get min/max statistics with low cost? I'm not sure about detail of your patch in CF, but it seems very high cost. Repeatedly, I think that if we want to get drastic detail statistics, we have to create another tools of statistics. Your patch will be these statistics tools. However, pg_stat_statement sholud be just light weight. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio
(2013/11/18 20:16), Haribabu kommi wrote: On 18 October 2013 13:35 KONDO Mitsumasa wrote: This patch conflicts pg_stat_statement_min_max_exectime patch which I submitted, and pg_stat_statement_min_max_exectime patch also adds new columns which are min_time and max_time. So I'd like to change it in this opportunity. This patch adds another column "shared_blks_hit_percent" to pg_stat_statements view Which is very beneficial to the user to know how much percentage of blks are hit. All changes are fine and working as described. Marked as ready for committer. Thank you for your reviewing! However, I'd like to add average time in each statement, too. Attached patch is my latest one. Adding shared_blks_hit_percent and ave_time. This is the adding main code. + total_time / calls::float AS avg_time, If this patch and min/max and stddev patch will be commited, we can see more detail and simple information in pg_stat_statements, by light-weight coding. Regards, -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql new file mode 100644 index 000..f0a8e0f --- /dev/null +++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql @@ -0,0 +1,63 @@ +/* contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit + +/* First we have to remove them from the extension */ +ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements; +ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(); + +/* Then we can drop them */ +DROP VIEW pg_stat_statements; +DROP FUNCTION pg_stat_statements(); + +/* Now redefine */ +CREATE FUNCTION pg_stat_statements( +OUT userid oid, +OUT dbid oid, +OUT query text, +OUT calls int8, +OUT total_time float8, +OUT rows int8, +OUT shared_blks_hit int8, +OUT shared_blks_read int8, +OUT shared_blks_dirtied int8, +OUT shared_blks_written int8, +OUT local_blks_hit int8, +OUT local_blks_read int8, +OUT local_blks_dirtied int8, +OUT local_blks_written int8, +OUT temp_blks_read int8, +OUT temp_blks_written int8, +OUT blk_read_time float8, +OUT blk_write_time float8 +) +RETURNS SETOF record +AS 'MODULE_PATHNAME' +LANGUAGE C; + +CREATE VIEW pg_stat_statements AS + SELECT userid, + dbid, + query, + calls, + total_time, + rows, + CASE WHEN shared_blks_hit + shared_blks_read > 0 + THEN 100.0 * (shared_blks_hit::float / (shared_blks_hit + shared_blks_read)) + ELSE 0 END AS shared_blks_hit_percent, + shared_blks_hit, + shared_blks_read, + shared_blks_dirtied, + shared_blks_written, + local_blks_hit, + local_blks_read, + local_blks_dirtied, + local_blks_written, + temp_blks_read, + temp_blks_written, + blk_read_time, + blk_write_time + FROM pg_stat_statements(); + +GRANT SELECT ON pg_stat_statements TO PUBLIC; diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql new file mode 100644 index 000..41dc16b --- /dev/null +++ b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql @@ -0,0 +1,65 @@ +/* contrib/pg_stat_statements/pg_stat_statements--1.2.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit + +-- Register functions. +CREATE FUNCTION pg_stat_statements_reset() +RETURNS void +AS 'MODULE_PATHNAME' +LANGUAGE C; + +CREATE FUNCTION pg_stat_statements( +OUT userid oid, +OUT dbid oid, +OUT query text, +OUT calls int8, +OUT total_time float8, +OUT rows int8, +OUT shared_blks_hit int8, +OUT shared_blks_read int8, +OUT shared_blks_dirtied int8, +OUT shared_blks_written int8, +OUT local_blks_hit int8, +OUT local_blks_read int8, +OUT local_blks_dirtied int8, +OUT local_blks_written int8, +OUT temp_blks_read int8, +OUT temp_blks_written int8, +OUT blk_read_time float8, +OUT blk_write_time float8 +) +RETURNS SETOF record +AS 'MODULE_PATHNAME' +LANGUAGE C; + +-- Register a view on the function for ease of use. +CREATE VIEW pg_stat_statements AS + SELECT userid, + dbid, + query, + calls, + total_time, + total_time / calls::float AS avg_time, + rows, + CASE WHEN shared_blks_hit + shared_blks_read > 0 + THEN 100.0 * (shared_blks_hit::float / (shared_blks_hit + shared_blks_read)) + ELSE 0 END AS shared_blks_hit_percent, + shared_blks_hit, + sh
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/11/18 11:25), Claudio Freire wrote: On Sun, Nov 17, 2013 at 11:02 PM, KONDO Mitsumasa wrote: However, my patch is on the way and needed to more improvement. I am going to add method of controlling readahead by GUC, for user can freely select readahed parameter in their transactions. Rather, I'd try to avoid fadvising consecutive or almost-consecutive blocks. Detecting that is hard at the block level, but maybe you can tie that detection into the planner, and specify a sequential strategy when the planner expects index-heap correlation? I think we had better to develop these patches in step by step each patches, because it is difficult that readahead optimizetion is completely come true from a beginning of one patch. We need flame-work in these patches, first. Well, problem is, that without those smarts, I don't think this patch can be enabled by default. It will considerably hurt common use cases for postgres. Yes. I have thought as much you that defalut setting is false. (use normal readahead as before). Next version of my patch will become these. But I guess we'll have a better idea about that when we see how much of a performance impact it makes when you run those tests, so no need to guess in the dark. Yes, sure. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/11/15 13:48), Claudio Freire wrote: On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this test. That's close to the kernel version I was using, so you should see the same effect. OK. You proposed readahead maximum patch, I think it seems to get benefit for perofomance and your part of argument is really true. Your patch becomes maximum readahead, when a sql is selected index range scan. Is it right? Ehm... sorta. I think that your patch assumes that pages are ordered by index-data. No. It just knows which pages will be needed, and fadvises them. No guessing involved, except the guess that the scan will not be aborted. There's a heuristic to stop limited scans from attempting to fadvise, and that's that prefetch strategy is applied only from the Nth+ page walk. We may completely optimize kernel readahead in PostgreSQL in the future, however it is very difficult and takes long time that it completely comes true from a beginning. So I propose GUC switch that can use in their transactions.(I will create this patch in this CF.). If someone off readahed for using file cache more efficient in his transactions, he can set "SET readahead = off". PostgreSQL is open source, and I think that it becomes clear which case it is effective for, by using many people. It improves index-only scans the most, but I also attempted to handle heap prefetches. That's where the kernel started conspiring against me, because I used many naturally-clustered indexes, and THERE performance was adversely affected because of that kernel bug. I also create gaussinan-distributed pgbench now and submit this CF. It can clear which situasion is effective, partially we will know. You may want to try your patch with more real workloads, and maybe you'll confirm what I found out last time I messed with posix_fadvise. If my experience is still relevant, those patterns will have suffered a severe performance penalty with this patch, because it will disable kernel read-ahead on sequential index access. It may still work for sequential heap scans, because the access strategy will tell the kernel to do read-ahead, but many other access methods will suffer. The decisive difference with your patch is that my patch uses buffer hint control architecture, so it can control readahaed smarter in some cases. Indeed, but it's not enough. See my above comment about naturally clustered indexes. The planner expects that, and plans accordingly. It will notice correlation between a PK and physical location, and will treat an index scan over PK to be almost sequential. With your patch, that assumption will be broken I believe. ~ However, my patch is on the way and needed to more improvement. I am going to add method of controlling readahead by GUC, for user can freely select readahed parameter in their transactions. Rather, I'd try to avoid fadvising consecutive or almost-consecutive blocks. Detecting that is hard at the block level, but maybe you can tie that detection into the planner, and specify a sequential strategy when the planner expects index-heap correlation? I think we had better to develop these patches in step by step each patches, because it is difficult that readahead optimizetion is completely come true from a beginning of one patch. We need flame-work in these patches, first. Try OLAP-style queries. I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell me good OLAP benchmark tools? I don't really know. Skimming the specs, I'm not sure if those queries generate large index range queries. You could try, maybe with autoexplain? OK, I do. And, I will use simple large index range queries with explain command. Regards, -- Mitsuamsa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2013/11/15 2:09), Fujii Masao wrote: Agreed. Could you tell me your agreed reason? I am sorry that I suspect you doesn't understand this disccusion enough:-( Regards, -- Mitsumasa KONDO NTT Open Source Software Ceter -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2013/11/15 11:31), Peter Geoghegan wrote: On Thu, Nov 14, 2013 at 6:28 PM, KONDO Mitsumasa wrote: It is confirmation just to make sure, does "this patch" mean my patch? I agree with you about not adding another lock implementation. It will becomes overhead. Yes, I referred to your patch. I don't want to go down this road, because aggregation and constructing a timeline feels like the job of another tool. I am concerned about local minima and maxima. Even with the ability to reset min/max independently, you can't do so for each entry individually. And this approach won't scale to a histogram or more sophisticated ways of characterizing distribution, particularly not multiplicatively for things other than execution time (blocks hit and so on) I think that pg_stat_statements is light-weight monitoring tool, not whole monitoring tool. This feature is very good for everyone to get statistics. If you'd like to get more detail statistics, I suggest you to add another monitoring tools like pg_stat_statements_full. It might more heavy, but it can get more detail information. Everyone will welcome to new features of that. - that spinlock needs to be held for very little time indeed to preserve pg_stat_statements current low overhead. I'd like to leave pg_stat_statement low overhead. My patch realizes it. I don't add new locks and complicated code in my patch. As I said above, lets figure out how to have your tool or a similar tool acquire snapshots inexpensively and frequently instead. We tried to solve this problem using our tool in the past. However, it is difficult that except log output method which is log_statement=all option. So we try to add new feature to pg_stat_statement, it would help DBA to provide useful statistics without overhead. That is a discussion I'd be happy to have. IMO pg_stat_statements ought to be as cheap as possible, and do one thing well - aggregate fixed-unit costs cumulatively. I am also happy to your discussion! I'd like to install your new patch and give you my comment. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/11/15 11:17), Peter Geoghegan wrote: On Thu, Nov 14, 2013 at 6:18 PM, KONDO Mitsumasa wrote: I will fix it. Could you tell me your Mac OS version and gcc version? I have only mac book air with Maverick OS(10.9). I have an idea that Mac OSX doesn't have posix_fadvise at all. Didn't you use the relevant macros so that the code at least builds on those platforms? Thank you for your nice advice, too. I try to fix macro program. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2013/11/14 7:11), Peter Geoghegan wrote: On Wed, Oct 23, 2013 at 8:52 PM, Alvaro Herrera wrote: Hmm, now if we had portable atomic addition, so that we could spare the spinlock ... And adding a histogram or min/max for something like execution time isn't an approach that can be made to work for every existing cost tracked by pg_stat_statements. So, taking all that into consideration, I'm afraid this patch gets a -1 from me. It is confirmation just to make sure, does "this patch" mean my patch? I agree with you about not adding another lock implementation. It will becomes overhead. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
(2013/11/15 2:03), Fujii Masao wrote: On Thu, Nov 14, 2013 at 9:09 PM, KONDO Mitsumasa wrote: Hi, I create a patch that is improvement of disk-read and OS file caches. It can optimize kernel readahead parameter using buffer access strategy and posix_fadvice() in various disk-read situations. When I compiled the HEAD code with this patch on MacOS, I got the following error and warnings. gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -I../../../../src/include -c -o fd.o fd.c fd.c: In function 'BufferHintIOAdvise': fd.c:1182: error: 'POSIX_FADV_SEQUENTIAL' undeclared (first use in this function) fd.c:1182: error: (Each undeclared identifier is reported only once fd.c:1182: error: for each function it appears in.) fd.c:1185: error: 'POSIX_FADV_RANDOM' undeclared (first use in this function) make[4]: *** [fd.o] Error 1 make[3]: *** [file-recursive] Error 2 make[2]: *** [storage-recursive] Error 2 make[1]: *** [install-backend-recurse] Error 2 make: *** [install-src-recurse] Error 2 tablecmds.c:9120: warning: passing argument 5 of 'smgrread' makes pointer from integer without a cast bufmgr.c:455: warning: passing argument 5 of 'smgrread' from incompatible pointer type Thanks you for your report! I will fix it. Could you tell me your Mac OS version and gcc version? I have only mac book air with Maverick OS(10.9). Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimize kernel readahead using buffer access strategy
Hi Claudio, (2013/11/14 22:53), Claudio Freire wrote: On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa wrote: I create a patch that is improvement of disk-read and OS file caches. It can optimize kernel readahead parameter using buffer access strategy and posix_fadvice() in various disk-read situations. In general OS, readahead parameter was dynamically decided by disk-read situations. If long time disk-read was happened, readahead parameter becomes big. However it is based on experienced or heuristic algorithm, it causes waste disk-read and throws out useful OS file caches in some case. It is bad for disk-read performance a lot. It would be relevant to know which kernel did you use for those tests. I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this test. A while back, I tried to use posix_fadvise to prefetch index pages. I search your past work. Do you talk about this ML-thread? Or is there another latest discussion? I see your patch is interesting, but it wasn't submitted to CF and stopping discussions. http://www.postgresql.org/message-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2txmy+skjd6g_vjc5...@mail.gmail.com I ended up finding out that interleaving posix_fadvise with I/O like that severly hinders (ie: completely disables) the kernel's read-ahead algorithm. Your patch becomes maximum readahead, when a sql is selected index range scan. Is it right? I think that your patch assumes that pages are ordered by index-data. This assumption is partially wrong. If your assumption is true, we don't need CLUSTER command. In actuary, CLUSTER command becomes better performance than nothing. How exactly did you set up those benchmarks? pg_bench defaults? My detail test setting is under following, * Server info CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (2U/12C) RAM: 6GB -> I reduced it intentionally in OS paraemter, because large memory tests have long time. HDD: SEAGATE Model: ST2000NM0001 @ 7200rpm * 1 RAID: none. * postgresql.conf(summarized) shared_buffers = 600MB (10% of RAM = 6GB) work_mem = 1MB maintenance_work_mem = 64MB wal_level = archive fsync = on archive_mode = on checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.7 * pgbench settings pgbench -j 4 -c 32 -T 600 pgbench pg_bench does not exercise heavy sequential access patterns, or long index scans. It performs many single-page index lookups per transaction and that's it. Yes, your argument is right. And it is also a fact that performance becomes better in these situations. You may want to try your patch with more real workloads, and maybe you'll confirm what I found out last time I messed with posix_fadvise. If my experience is still relevant, those patterns will have suffered a severe performance penalty with this patch, because it will disable kernel read-ahead on sequential index access. It may still work for sequential heap scans, because the access strategy will tell the kernel to do read-ahead, but many other access methods will suffer. The decisive difference with your patch is that my patch uses buffer hint control architecture, so it can control readahaed smarter in some cases. However, my patch is on the way and needed to more improvement. I am going to add method of controlling readahead by GUC, for user can freely select readahed parameter in their transactions. Try OLAP-style queries. I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell me good OLAP benchmark tools? Regards, -- Mitsumasa KONDO NTT Open Source Software -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Oh! Sorry... I forgot to attach my latest patch. Regards, -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql new file mode 100644 index 000..929d623 --- /dev/null +++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql @@ -0,0 +1,51 @@ +/* contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit + +/* First we have to remove them from the extension */ +ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements; +ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(); + +/* Then we can drop them */ +DROP VIEW pg_stat_statements; +DROP FUNCTION pg_stat_statements(); + +/* Now redefine */ +CREATE FUNCTION pg_stat_statements( +OUT userid oid, +OUT dbid oid, +OUT query text, +OUT calls int8, +OUT total_time float8, +OUT min_time float8, +OUT max_time float8, +OUT stddev_time float8, +OUT rows int8, +OUT shared_blks_hit int8, +OUT shared_blks_read int8, +OUT shared_blks_dirtied int8, +OUT shared_blks_written int8, +OUT local_blks_hit int8, +OUT local_blks_read int8, +OUT local_blks_dirtied int8, +OUT local_blks_written int8, +OUT temp_blks_read int8, +OUT temp_blks_written int8, +OUT blk_read_time float8, +OUT blk_write_time float8 +) +RETURNS SETOF record +AS 'MODULE_PATHNAME' +LANGUAGE C; + +CREATE VIEW pg_stat_statements AS + SELECT * FROM pg_stat_statements(); + +GRANT SELECT ON pg_stat_statements TO PUBLIC; + +/* New Function */ +CREATE FUNCTION pg_stat_statements_reset_time() +RETURNS void +AS 'MODULE_PATHNAME' +LANGUAGE C; diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql new file mode 100644 index 000..d590acc --- /dev/null +++ b/contrib/pg_stat_statements/pg_stat_statements--1.2.sql @@ -0,0 +1,52 @@ +/* contrib/pg_stat_statements/pg_stat_statements--1.2.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit + +-- Register functions. +CREATE FUNCTION pg_stat_statements_reset() +RETURNS void +AS 'MODULE_PATHNAME' +LANGUAGE C; + +CREATE FUNCTION pg_stat_statements_reset_time() +RETURNS void +AS 'MODULE_PATHNAME' +LANGUAGE C; + +CREATE FUNCTION pg_stat_statements( +OUT userid oid, +OUT dbid oid, +OUT query text, +OUT calls int8, +OUT total_time float8, +OUT min_time float8, +OUT max_time float8, +OUT stddev_time float8, +OUT rows int8, +OUT shared_blks_hit int8, +OUT shared_blks_read int8, +OUT shared_blks_dirtied int8, +OUT shared_blks_written int8, +OUT local_blks_hit int8, +OUT local_blks_read int8, +OUT local_blks_dirtied int8, +OUT local_blks_written int8, +OUT temp_blks_read int8, +OUT temp_blks_written int8, +OUT blk_read_time float8, +OUT blk_write_time float8 +) +RETURNS SETOF record +AS 'MODULE_PATHNAME' +LANGUAGE C; + +-- Register a view on the function for ease of use. +CREATE VIEW pg_stat_statements AS + SELECT * FROM pg_stat_statements(); + +GRANT SELECT ON pg_stat_statements TO PUBLIC; + +-- Don't want this to be available to non-superusers. +REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC; +REVOKE ALL ON FUNCTION pg_stat_statements_reset_time() FROM PUBLIC; diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index e8aed61..5c63940 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -4,8 +4,10 @@ MODULE_big = pg_stat_statements OBJS = pg_stat_statements.o EXTENSION = pg_stat_statements -DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \ - pg_stat_statements--unpackaged--1.0.sql +DATA = pg_stat_statements--1.2.sql \ + pg_stat_statements--1.0--1.1.sql \ + pg_stat_statements--1.1--1.2.sql \ + pg_stat_statements--unpackaged--1.0.sql ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql index 5be281e..5662273 100644 --- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql +++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql @@ -1,7 +1,7 @@ /* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */ -- complain if script is sourced in psql, rather than via ALTER EXTENSION -\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit /* First we have to remove them from the extension */ ALTER EXTENSION pg_stat_statements DROP VIEW pg_sta
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2013/10/21 20:17), KONDO Mitsumasa wrote: > (2013/10/18 22:21), Andrew Dunstan wrote: >> If we're going to extend pg_stat_statements, even more than min and max >> I'd like to see the standard deviation in execution time. > OK. I do! I am making some other patches, please wait more! I add stddev_time and fix some sources. Psql result of my latest patch is under following. userid | 10 dbid| 16384 query | UPDATE pgbench_tellers SET tbalance = tbalance + ? WHERE tid = ?; calls | 74746 total_time | 1094.291998 min_time| 0.007 max_time| 15.091 stddev_time | 0.100439187720684 rows| 74746 shared_blks_hit | 302346 shared_blks_read| 6 shared_blks_dirtied | 161 shared_blks_written | 0 local_blks_hit | 0 local_blks_read | 0 local_blks_dirtied | 0 local_blks_written | 0 temp_blks_read | 0 temp_blks_written | 0 blk_read_time | 0 blk_write_time | 0 I don't think a lot that order of columns in this table. If have any idea, please send me. And thanks for a lot of comments and discussion, I am going to refer to these for not only this patch but also development of pg_statsinfo and pg_stats_reporter:-) Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
Hi All, (2013/10/22 22:26), Stephen Frost wrote: * Dimitri Fontaine (dimi...@2ndquadrant.fr) wrote: In our case, what I keep experiencing with tuning queries is that we have like 99% of them running under acceptable threshold and 1% of them taking more and more time. This is usually described (at least where I come from) as 'rare events', which goes to Tom's point that averages, stddev, etc, are not ideal (though they are still better than nothing). good reason to suppose that query runtime is Gaussian? (I'd bet not; in particular, multimodal behavior seems very likely due to things like plan changes.) If not, how much does that affect the usefulness of a standard-deviation calculation? Oscillating plan changes may fit multimodal but I don't feel that's typical. My experience has been it's either an extremely rare plan difference or it's a shift from one plan to another over time. After all, all of avg, min, max and stdev are only numerical value for predicting model. There aren't the robustness and strictness such as Write Ahead Logging. It resembles a weather forecast. They are still better than nothing. It is needed a human judgment to finally suppose a cause from the numerical values. By the way, we can guess probability of the value from stdev. Therefore we can guess easily even if there is an extreme value in min/max whether it is normal or not. What I've been gathering from my quick chat this morning is that either you know how to characterize the distribution and then the min max and average are useful on their own, or you need to keep track of an histogram where all the bins are of the same size to be able to learn what the distribution actually is. A histogram would certainly be useful. We may also wish to look into outlier/rare event detection methods and increase the logging we do in those cases (if possible). Of course keeping enough stats seems to always begin with keeping the min, max and average, so we can just begin there. We would just be unable to answer interesting questions with just that. It would probably be good to do some research into techniques for outlier detection which minimizes CPU and storage cost. pg_stat_statement is often used in operating database system, so I don't like high CPU usage implementation. The software which will be lessor performance just to install it is too unpleasant to accept. And if we need more detail information for SQL tuning, it would be better to develop other useful performance tuning and monitoring contrib not to use in operating database system. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/10/22 12:52), Fujii Masao wrote: On Tue, Oct 22, 2013 at 12:47 PM, Amit Kapila wrote: On Mon, Oct 21, 2013 at 4:40 PM, KONDO Mitsumasa wrote: (2013/10/19 14:58), Amit Kapila wrote: On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa wrote: In general, my thinking is that we should prefer compression to reduce IO (WAL volume), because reducing WAL volume has other benefits as well like sending it to subscriber nodes. I think it will help cases where due to less n/w bandwidth, the disk allocated for WAL becomes full due to high traffic on master and then users need some alternative methods to handle such situations. Do you talk about archiving WAL file? One of the points what I am talking about is sending data over network to subscriber nodes for streaming replication and another is WAL in pg_xlog. Both scenario's get benefited if there is is WAL volume. It can easy to reduce volume that we set and add compression command with copy command at archive_command. Okay. I think many users would like to use a method which can reduce WAL volume and the users which don't find it enough useful in their environments due to decrease in TPS or not significant reduction in WAL have the option to disable it. I favor to select compression algorithm for higher performance. If we need to compress WAL file more, in spite of lessor performance, we can change archive copy command with high compression algorithm and add documents that how to compress archive WAL files at archive_command. Does it wrong? No, it is not wrong, but there are scenario's as mentioned above where less WAL volume can be beneficial. In actual, many of NoSQLs use snappy for purpose of higher performance. Okay, you can also check the results with snappy algorithm, but don't just rely completely on snappy for this patch, you might want to think of another alternative for this patch. So, our consensus is to introduce the hooks for FPW compression so that users can freely select their own best compression algorithm? Yes, it will be also good for future improvement. But I think WAL compression for disaster recovery system should be need in walsender and walreceiver proccess, and it is propety architecture for DR system. Higher compression ratio with high CPU usage algorithm in FPW might affect bad for perfomance in master server. If we can set compression algorithm in walsender and walreciever, performance is same as before or better, and WAL send performance will be better. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add min and max execute statement time in pg_stat_statement
(2013/10/18 22:21), Andrew Dunstan wrote: > If we're going to extend pg_stat_statements, even more than min and max > I'd like to see the standard deviation in execution time. OK. I do! I am making some other patches, please wait more! Regards, -- Mitsumasa KONDO NTT Open Source Software Center.; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/10/19 14:58), Amit Kapila wrote: > On Tue, Oct 15, 2013 at 11:41 AM, KONDO Mitsumasa > wrote: > I think in general also snappy is mostly preferred for it's low CPU > usage not for compression, but overall my vote is also for snappy. I think low CPU usage is the best important factor in WAL compression. It is because WAL write is sequencial write, so few compression ratio improvement cannot change PostgreSQL's performance, and furthermore raid card with writeback feature. Furthermore PG executes programs by single proccess, high CPU usage compression algorithm will cause lessor performance. >> I found compression algorithm test in HBase. I don't read detail, but it >> indicates snnapy algorithm gets best performance. >> http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of > > The dataset used for performance is quite different from the data > which we are talking about here (WAL). > "These are the scores for a data which consist of 700kB rows, each > containing a binary image data. They probably won’t apply to things > like numeric or text data." Yes, you are right. We need testing about compression algorithm in WAL write. >> I think it is necessary to make best efforts in community than I do the best >> choice with strict test. > > Sure, it is good to make effort to select the best algorithm, but if > you are combining this patch with inclusion of new compression > algorithm in PG, it can only make the patch to take much longer time. I think if our direction is specifically decided, it is easy to make the patch. Complession patch's direction isn't still become clear, it will be a troublesome patch which is like sync-rep patch. > In general, my thinking is that we should prefer compression to reduce > IO (WAL volume), because reducing WAL volume has other benefits as > well like sending it to subscriber nodes. I think it will help cases > where due to less n/w bandwidth, the disk allocated for WAL becomes > full due to high traffic on master and then users need some > alternative methods to handle such situations. Do you talk about archiving WAL file? It can easy to reduce volume that we set and add compression command with copy command at archive_command. > I think many users would like to use a method which can reduce WAL > volume and the users which don't find it enough useful in their > environments due to decrease in TPS or not significant reduction in > WAL have the option to disable it. I favor to select compression algorithm for higher performance. If we need to compress WAL file more, in spite of lessor performance, we can change archive copy command with high compression algorithm and add documents that how to compress archive WAL files at archive_command. Does it wrong? In actual, many of NoSQLs use snappy for purpose of higher performance. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Who is pgFoundery administrator?
(2013/10/02 18:57), Michael Paquier wrote: wrote: Who is pgFoundery administrator or board member now? I would like to send e-mail them. At least, it does not have information and support page in pgFoundery homepage. Why don't you consider github as a potential solution? It is because github is too various software to search PostgreSQL good tools:-( In contrast, pgfoundry is easy to search good tools for PostgreSQL:-) When we'd like to search PG good tools in google, it can difficult to search our tools. Because we don't do keyword search optimizations. I hope pgfoundry will be more robust and popular. It has been still needed. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Improvement of pg_stat_statement usage about buffer hit ratio
Hi, I submit improvement of pg_stat_statement usage patch in CF3. In pg_stat_statement, I think buffer hit ratio is very important value. However, it is difficult to calculate it, and it need complicated SQL. This patch makes it more simple usage and documentation. > -bench=# SELECT query, calls, total_time, rows, 100.0 * shared_blks_hit / > - nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent > +bench=# SELECT query, calls, total_time, rows, shared_blks_hit_percent >FROM pg_stat_statements ORDER BY total_time DESC LIMIT 5; It will be very simple:-) This patch conflicts pg_stat_statement_min_max_exectime patch which I submitted, and pg_stat_statement_min_max_exectime patch also adds new columns which are min_time and max_time. So I'd like to change it in this opportunity. Regards, -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index e8aed61..5c63940 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -4,8 +4,10 @@ MODULE_big = pg_stat_statements OBJS = pg_stat_statements.o EXTENSION = pg_stat_statements -DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \ - pg_stat_statements--unpackaged--1.0.sql +DATA = pg_stat_statements--1.2.sql \ + pg_stat_statements--1.0--1.1.sql \ + pg_stat_statements--1.1--1.2.sql \ + pg_stat_statements--unpackaged--1.0.sql ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql index 5be281e..5662273 100644 --- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql +++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql @@ -1,7 +1,7 @@ /* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */ -- complain if script is sourced in psql, rather than via ALTER EXTENSION -\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit /* First we have to remove them from the extension */ ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements; diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql new file mode 100644 index 000..f0a8e0f --- /dev/null +++ b/contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql @@ -0,0 +1,63 @@ +/* contrib/pg_stat_statements/pg_stat_statements--1.1--1.2.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit + +/* First we have to remove them from the extension */ +ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements; +ALTER EXTENSION pg_stat_statements DROP FUNCTION pg_stat_statements(); + +/* Then we can drop them */ +DROP VIEW pg_stat_statements; +DROP FUNCTION pg_stat_statements(); + +/* Now redefine */ +CREATE FUNCTION pg_stat_statements( +OUT userid oid, +OUT dbid oid, +OUT query text, +OUT calls int8, +OUT total_time float8, +OUT rows int8, +OUT shared_blks_hit int8, +OUT shared_blks_read int8, +OUT shared_blks_dirtied int8, +OUT shared_blks_written int8, +OUT local_blks_hit int8, +OUT local_blks_read int8, +OUT local_blks_dirtied int8, +OUT local_blks_written int8, +OUT temp_blks_read int8, +OUT temp_blks_written int8, +OUT blk_read_time float8, +OUT blk_write_time float8 +) +RETURNS SETOF record +AS 'MODULE_PATHNAME' +LANGUAGE C; + +CREATE VIEW pg_stat_statements AS + SELECT userid, + dbid, + query, + calls, + total_time, + rows, + CASE WHEN shared_blks_hit + shared_blks_read > 0 + THEN 100.0 * (shared_blks_hit::float / (shared_blks_hit + shared_blks_read)) + ELSE 0 END AS shared_blks_hit_percent, + shared_blks_hit, + shared_blks_read, + shared_blks_dirtied, + shared_blks_written, + local_blks_hit, + local_blks_read, + local_blks_dirtied, + local_blks_written, + temp_blks_read, + temp_blks_written, + blk_read_time, + blk_write_time + FROM pg_stat_statements(); + +GRANT SELECT ON pg_stat_statements TO PUBLIC; diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql deleted file mode 100644 index 42e4d68..000 --- a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql +++ /dev/null @@ -1,43 +0,0 @@ -/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */ - --- complain if script is sourced in psql, rather than via CREATE EXTENSION -\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit - --- Register functions. -CREATE FUNCTION pg_stat_statement
[HACKERS] Add min and max execute statement time in pg_stat_statement
I submit patch adding min and max execute statement time in pg_stat_statement in next CF. pg_stat_statement have execution time, but it is average execution time and does not provide detail information very much. So I add min and max execute statement time in pg_stat_statement columns. Usage is almost same as before. However, I add pg_stat_statements_reset_time() function to get min_time and max_time in the specific period. This function resets or inits min and max execution time before. Regards, -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile index e8aed61..5c63940 100644 --- a/contrib/pg_stat_statements/Makefile +++ b/contrib/pg_stat_statements/Makefile @@ -4,8 +4,10 @@ MODULE_big = pg_stat_statements OBJS = pg_stat_statements.o EXTENSION = pg_stat_statements -DATA = pg_stat_statements--1.1.sql pg_stat_statements--1.0--1.1.sql \ - pg_stat_statements--unpackaged--1.0.sql +DATA = pg_stat_statements--1.2.sql \ + pg_stat_statements--1.0--1.1.sql \ + pg_stat_statements--1.1--1.2.sql \ + pg_stat_statements--unpackaged--1.0.sql ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql index 5be281e..5662273 100644 --- a/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql +++ b/contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql @@ -1,7 +1,7 @@ /* contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql */ -- complain if script is sourced in psql, rather than via ALTER EXTENSION -\echo Use "ALTER EXTENSION pg_stat_statements UPDATE TO '1.1'" to load this file. \quit +\echo Use "ALTER EXTENSION pg_stat_statements UPDATE" to load this file. \quit /* First we have to remove them from the extension */ ALTER EXTENSION pg_stat_statements DROP VIEW pg_stat_statements; diff --git a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql b/contrib/pg_stat_statements/pg_stat_statements--1.1.sql deleted file mode 100644 index 42e4d68..000 --- a/contrib/pg_stat_statements/pg_stat_statements--1.1.sql +++ /dev/null @@ -1,43 +0,0 @@ -/* contrib/pg_stat_statements/pg_stat_statements--1.1.sql */ - --- complain if script is sourced in psql, rather than via CREATE EXTENSION -\echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit - --- Register functions. -CREATE FUNCTION pg_stat_statements_reset() -RETURNS void -AS 'MODULE_PATHNAME' -LANGUAGE C; - -CREATE FUNCTION pg_stat_statements( -OUT userid oid, -OUT dbid oid, -OUT query text, -OUT calls int8, -OUT total_time float8, -OUT rows int8, -OUT shared_blks_hit int8, -OUT shared_blks_read int8, -OUT shared_blks_dirtied int8, -OUT shared_blks_written int8, -OUT local_blks_hit int8, -OUT local_blks_read int8, -OUT local_blks_dirtied int8, -OUT local_blks_written int8, -OUT temp_blks_read int8, -OUT temp_blks_written int8, -OUT blk_read_time float8, -OUT blk_write_time float8 -) -RETURNS SETOF record -AS 'MODULE_PATHNAME' -LANGUAGE C; - --- Register a view on the function for ease of use. -CREATE VIEW pg_stat_statements AS - SELECT * FROM pg_stat_statements(); - -GRANT SELECT ON pg_stat_statements TO PUBLIC; - --- Don't want this to be available to non-superusers. -REVOKE ALL ON FUNCTION pg_stat_statements_reset() FROM PUBLIC; diff --git a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql index e84a3cb..0addba0 100644 --- a/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql +++ b/contrib/pg_stat_statements/pg_stat_statements--unpackaged--1.0.sql @@ -4,5 +4,6 @@ \echo Use "CREATE EXTENSION pg_stat_statements" to load this file. \quit ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset(); +ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements_reset_time(); ALTER EXTENSION pg_stat_statements ADD function pg_stat_statements(); ALTER EXTENSION pg_stat_statements ADD view pg_stat_statements; diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c index ea930af..8f9b641 100644 --- a/contrib/pg_stat_statements/pg_stat_statements.c +++ b/contrib/pg_stat_statements/pg_stat_statements.c @@ -76,6 +76,7 @@ static const uint32 PGSS_FILE_HEADER = 0x20120328; #define USAGE_DECREASE_FACTOR (0.99) /* decreased every entry_dealloc */ #define STICKY_DECREASE_FACTOR (0.50) /* factor for sticky entries */ #define USAGE_DEALLOC_PERCENT 5 /* free this % of entries at once */ +#define EXEC_TIME_INIT (-1) /* initial execution time */ #define JUMBLE_SIZE1024 /* query serialization buffer size */ @@ -102,6 +103,8 @@ typedef struct Counters { int64 calls; /* # of times executed */ double total_time; /* total execution time, in msec */ + double min
Re: [HACKERS] Compression of full-page-writes
(2013/10/15 22:01), k...@rice.edu wrote: Google's lz4 is also a very nice algorithm with 33% better compression performance than snappy and 2X the decompression performance in some benchmarks also with a bsd license: https://code.google.com/p/lz4/ If we judge only performance, we will select lz4. However, we should think another important factor which is software robustness, achievement, bug fix history, and etc... If we see unknown bugs, can we fix it or improve algorithm? It seems very difficult, because we only use it and don't understand algorihtms. Therefore, I think that we had better to select robust and having more user software. Regards, -- Mitsumasa KONDO NTT Open Source Software -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/10/15 13:33), Amit Kapila wrote: Snappy is good mainly for un-compressible data, see the link below: http://www.postgresql.org/message-id/CAAZKuFZCOCHsswQM60ioDO_hk12tA7OG3YcJA8v=4yebmoa...@mail.gmail.com This result was gotten in ARM architecture, it is not general CPU. Please see detail document. http://www.reddit.com/r/programming/comments/1aim6s/lz4_extremely_fast_compression_algorithm/c8y0ew9 I found compression algorithm test in HBase. I don't read detail, but it indicates snnapy algorithm gets best performance. http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of In fact, most of modern NoSQL storages use snappy. Because it has good performance and good licence(BSD license). I think it is bit difficult to prove that any one algorithm is best for all kind of loads. I think it is necessary to make best efforts in community than I do the best choice with strict test. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Release note fix for timeline item
Sorry for my reply late... (2013/10/08 23:26), Bruce Momjian wrote: > First, I want to apologize for not completing the release notes earlier > so that others could review them. I started working on the release > notes on Friday, but my unfamiliarity with the process and fear of > making a mistake caused many delays. I have improved the documentation > on the process which will hopefully help next time. There isn't anything in particular that I was dissatisfied about it. You are right that there is alot of details skipped in the release note text. I have developed the attached patch which I think does a better job. Is it OK? Yes, off course! Thanks for your sincere action! Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/10/13 0:14), Amit Kapila wrote: On Fri, Oct 11, 2013 at 10:36 PM, Andres Freund wrote: But maybe pglz is just not a good fit for this, it really isn't a very good algorithm in this day and aage. +1. This compression algorithm is needed more faster than pglz which is like general compression algorithm, to avoid the CPU bottle-neck. I think pglz doesn't have good performance, and it is like fossil compression algorithm. So we need to change latest compression algorithm for more better future. Do you think that if WAL reduction or performance with other compression algorithm (for ex. snappy) is better, then chances of getting the new compression algorithm in postresql will be more? Latest compression algorithms papers(also snappy) have indecated. I think it is enough to select algorithm. It may be also good work in postgres. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/10/08 20:13), Haribabu kommi wrote: I chosen the sync_commit=off mode because it generates more tps, thus it increases the volume of WAL. I did not think to there. Sorry... I will test with sync_commit=on mode and provide the test results. OK. Thanks! -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
Hi, I tested dbt-2 benchmark in single instance and synchronous replication. Unfortunately, my benchmark results were not seen many differences... * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB * Result ** Single instance** | NOTPM | 90%tile | Average | S.Deviation +---+-+-+- no-patched | 3322.93 | 20.469071 | 5.882 | 10.478 patched | 3315.42 | 19.086105 | 5.669 | 9.108 ** Synchronous Replication ** | NOTPM | 90%tile | Average | S.Deviation +---+-+-+- no-patched | 3275.55 | 21.332866 | 6.072 | 9.882 patched | 3318.82 | 18.141807 | 5.757 | 9.829 ** Detail of result http://pgstatsinfo.projects.pgfoundry.org/DBT-2_Fujii_patch/ I set full_page_write = compress with Fujii's patch in DBT-2. But it does not seems to effect for eleminating WAL files. I will try to DBT-2 benchmark more once, and try to normal pgbench in my test server. Regards, -- Mitsumasa KONDO NTT Open Source Software Center <><><><> -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/10/08 17:33), Haribabu kommi wrote: The checkpoint_timeout and checkpoint_segments are increased to make sure no checkpoint happens during the test run. Your setting is easy occurred checkpoint in checkpoint_segments = 256. I don't know number of disks in your test server, in my test server which has 4 magnetic disk(1.5k rpm), postgres generates 50 - 100 WALs per minutes. And I cannot understand your setting which is sync_commit = off. This setting tend to cause cpu bottle-neck and data-loss. It is not general in database usage. Therefore, your test is not fair comparison for Fujii's patch. Going back to my DBT-2 benchmark, I have not got good performance (almost same performance). So I am checking hunk, my setting, or something wrong in Fujii's patch now. I am going to try to send test result tonight. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Who is pgFoundery administrator?
(2013/10/02 17:37), KONDO Mitsumasa wrote: > I want to submit new project in pgFoundery project. Our new project was approved yesterday! Thanks very much for pgFoundery crew. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Who is pgFoundery administrator?
Hi, I want to submit new project in pgFoundery project. I submitted new project which is WAL archive copy tool with directIO method in pgFoundery homepage 2 weeks ago, but it does not have approved and responded at all:-( Who is pgFoundery administrator or board member now? I would like to send e-mail them. At least, it does not have information and support page in pgFoundery homepage. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
(2013/09/30 13:55), Amit Kapila wrote: On Mon, Sep 30, 2013 at 10:04 AM, Fujii Masao wrote: Yep, please! It's really helpful! OK! I test with single instance and synchronous replication constitution. By the way, you posted patch which is sync_file_range() WAL writing method in 3 years ago. I think it is also good for performance. As the reason, I read sync_file_range() and fdatasync() in latest linux kernel code(3.9.11), fdatasync() writes in dirty buffers of the whole file, on the other hand, sync_file_range() writes in partial dirty buffers. In more detail, these functions use the same function in kernel source code, fdatasync() is vfs_fsync_range(file, 0, LLONG_MAX, 1), and sync_file_range() is vfs_fsync_range(file, offset, amount, 1). It is obvious that which is more efficiently in WAL writing. You had better confirm it in linux kernel's git. I think your conviction will be more deeply. https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/sync.c?id=refs/tags/v3.11.2 I think it will be useful if you can get the data for 1 and 2 threads (may be with pgbench itself) as well, because the WAL reduction is almost sure, but the only thing is that it should not dip tps in some of the scenarios. That's right. I also want to know about this patch in MD environment, because MD has strong point in sequential write which like WAL writing. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Compression of full-page-writes
Hi Fujii-san, (2013/09/30 12:49), Fujii Masao wrote: > On second thought, the patch could compress WAL very much because I used pgbench. I will do the same measurement by using another benchmark. If you hope, I can test this patch in DBT-2 benchmark in end of this week. I will use under following test server. * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB This is PG-REX test server as you know. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
(2013/09/27 5:29), Peter Eisentraut wrote: > This patch no longer applies. I will try to create this patch in next commit fest. If you have nice idea, please send me! Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
Sorry for my delay reply. Since I have had vacation last week, I replied from gmail. However, it was stalled post to pgsql-hackers:-( (2013/09/21 7:54), Fabien COELHO wrote: However this pattern induces stronger cache effects which are maybe not too realistic, because neighboring keys in the middle are more likely to be chosen. I think that your opinion is right. However, in effect, it is a paseudo-benchmark, so that I think that such a simple mechanism is also necessary. Have you considered adding a "randomization" layer, that is once you have a key in [1 .. > n] centered around n/2, then you perform a pseudo-random transformation into the same > domain so that key values are scattered over the whole domain? Yes. I also consider this patch. It can realize by adding linear mapping array which is created by random generator. However, current erand48 algorithm is not high accuracy and fossil algorithm, I do not know whether it works well. If we realize it, we may need more accurate random generator algorithm which is like Mersenne Twister. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] gaussian distribution pgbench
Sorry for my delay reply. Since I have had vacation last week, I replyed from gmail. However, it was stalled post to pgsql-hackers:-( (2013/09/21 6:05), Kevin Grittner wrote: > You had accidentally added to the CF In Progress. Oh, I had completely mistook this CF schedule :-) Maybe, Horiguchi-san is same situation... However, because of your moving, I become first submitter in next CF. Thank you for moving ! -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 9.4 regression
(2013/09/05 0:04), Andres Freund wrote: > I'd vote for adding zeroing *after* the fallocate() first. +1, with FALLOC_FL_KEEP_SIZE flag. At least, fallocate with FALLOC_FL_KEEP_SIZE flag is faster than nothing in my developing sorted checkpoint. I adopted it to relation file, so I don't know about WAL file. With FALLOC_FL_KEEP_SIZE flag does not change file size and not do zero fill. It seems to only get continuous physical disk space in file system. Here is draft patch. Regards, -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dc47c47..85b582c 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -3409,6 +3409,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock) (errcode_for_file_access(), errmsg("could not create file \"%s\": %m", tmppath))); +#if defined(HAVE_FALLOCATE) + fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, XLogSegSize); +#endif + /* * Zero-fill the file. We have to do this the hard way to ensure that all * the file space has really been allocated --- on platforms that allow -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Add pgbench option: CHECKPOINT before starting benchmark
Hi, I add checkpoint option to pgbench. pgbench is simple and useful benchmark for every user. However, result of benchmark greatly changes by some situations which are in executing checkpoint, number of dirty buffers in share_buffers, and so on. For such a problem, it is custom to carry out a checkpoint before starting benchmark. But it is a fact that the making of the script takes time, like under following script. psql -U postgres -d pgbench -p5432 -c "CHECKPOINT" pgbench -T 600 -c 12 -j4 -U postgres -d pgbench -p 5432 However, this script have a problem. This script execute "CHECKPOINT" -> "VACUUM" -> "starting benchmark". If relpages have lot of dirty pages, VACUUM generate dirty buffers on shared_buffers, and it will cause bad heavily checkpoint. I think pgbench would be more easy and accuracy benchmark tools for everyone. So I set checkpoint before starting benchmark. This patch's output is here. - [mitsu-ko@localhost pgbench]$ ./pgbench starting vacuum...end. starting checkpoint...end. transaction type: TPC-B (sort of) scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 number of transactions per client: 10 number of transactions actually processed: 10/10 tps = 312.851958 (including connections establishing) tps = 364.524478 (excluding connections establishing) - It execute "VACUUM" -> "CHECKPOINT" -> "starting benchmark". I think it is ideal setting for more accuracy benchmark. My patches option difinition is here. [mitsu-ko@localhost pgbench]$ ./pgbench --help ~ -N, --no-checkpoint do not run CHECKPOINT after initialization ~ In latest commited pgbench, -N is "--skip-some-updates skip updates of pgbench_tellers and pgbench_branches". But I cannot understand why -N is this option, so I set this option -u, and -N is "do not run CHECKPOINT option". What do you think? -- Mitsumasa KONDO NTT Open Source Software Center diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index ad8e272..523c278 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -347,6 +347,7 @@ usage(void) " -i, --initialize invokes initialization mode\n" " -F, --fillfactor=NUM set fill factor\n" " -n, --no-vacuum do not run VACUUM after initialization\n" + " -N, --no-checkpoint do not run CHECKPOINT after initialization\n" " -q, --quiet quiet logging (one message each 5 seconds)\n" " -s, --scale=NUM scaling factor\n" " --foreign-keys create foreign key constraints between tables\n" @@ -366,7 +367,8 @@ usage(void) " protocol for submitting queries " "(default: simple)\n" " -n, --no-vacuum do not run VACUUM before tests\n" - " -N, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n" + " -N, --no-checkpoint do not run CHECKPOINT before tests\n" + " -u, --skip-some-updates skip updates of pgbench_tellers and pgbench_branches\n" " -P, --progress=NUM show thread progress report every NUM seconds\n" " -r, --report-latencies report average latency per command\n" " -R, --rate=SPEC target rate in transactions per second\n" @@ -1520,7 +1522,7 @@ disconnect_all(CState *state, int length) /* create tables and setup data */ static void -init(bool is_no_vacuum) +init(bool is_no_vacuum, bool is_no_checkpoint) { /* The scale factor at/beyond which 32bit integers are incapable of storing * 64bit values. @@ -1775,6 +1777,12 @@ init(bool is_no_vacuum) } } + /* checkpoint */ + if (!is_no_checkpoint) + { + fprintf(stderr, "checkpoint...\n"); + executeStatement(con, "checkpoint"); + } fprintf(stderr, "done.\n"); PQfinish(con); @@ -2213,6 +2221,7 @@ main(int argc, char **argv) {"jobs", required_argument, NULL, 'j'}, {"log", no_argument, NULL, 'l'}, {"no-vacuum", no_argument, NULL, 'n'}, + {"no-checkpoint", no_argument, NULL, 'N'}, {"port", required_argument, NULL, 'p'}, {"progress", required_argument, NULL, 'P'}, {"protocol", required_argument, NULL, 'M'}, @@ -2220,7 +2229,7 @@ main(int argc, char **argv) {"report-latencies", no_argument, NULL, 'r'}, {"scale", required_argument, NULL, 's'}, {"select-only", no_argument, NULL, 'S'}, - {"skip-some-updates", no_argument, NULL, 'N'}, + {"skip-some-updates", no_argument, NULL, 'u'}, {"time", required_argument, NULL, 'T'}, {"transactions", required_argument, NULL, 't'}, {"username", required_argument, NULL, 'U'}, @@ -2240,6 +2249,7 @@ main(int argc, char **argv) int nclients = 1; /* default number of simulated clients */ int nthreads = 1; /* default number of threads */ int is_init_mode = 0; /* initialize mode? */ + int is_no_checkpoint = 0; /* no checkpoint at all before testing?