Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-08-01 Thread Fabien COELHO


Hello,

Version one is k' = 1 + (a * k + b) modulo n with a prime with 
respect to n, n being the number of keys. This is nearly possible, 
but for the modulo operator which is currently missing, and that I'm 
planning to submit for this very reason, but probably another time.


That's pretty crude,


Yep. It is very simple, it is much better than nothing, and for a database 
test is may be good enough.


although I don't object to a modulo operator.  It would be nice to be 
able to use a truly random permutation, which is not hard to generate 
but probably requires O(n) storage, likely a problem for large scale 
factors.


That is indeed the actual issue in my mind. I was thinking of permutations 
with a formula, which are not so easy to find and may end-up looking like 
(a*k+b)%n anyway. I had the same issue for generating random data for a 
schema (see http://www.coelho.net/datafiller.html).


Maybe somebody who knows more math than I do (like you, probably!) can 
come up with something more clever.


I can certainly suggest other formula, but that does not mean beautiful 
code, thus would probably be rejected. I'll see.


An alternative to this whole process may be to hash/modulo a non uniform 
random value.


  id = 1 + hash(some-random()) % n

But the hashing changes the distribution as it adds collisions, so I have 
to think about how to be able to control the distribution in that case, 
and what hash function to use.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-08-01 Thread Mitsumasa KONDO
Hi,

2014-08-01 16:26 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr


  Maybe somebody who knows more math than I do (like you, probably!) can
 come up with something more clever.


 I can certainly suggest other formula, but that does not mean beautiful
 code, thus would probably be rejected. I'll see.

 An alternative to this whole process may be to hash/modulo a non uniform
 random value.

   id = 1 + hash(some-random()) % n

 But the hashing changes the distribution as it adds collisions, so I have
 to think about how to be able to control the distribution in that case, and
 what hash function to use.

I think that we have to consider and select reproducible method, because
benchmark is always needed robust and reproducible result. And if we
realize this idea, we might need more accurate random generator that is
like Mersenne twister algorithm.  erand48 algorithm is slow and not
accurate very much.

By the way, I don't know relativeness of this topic and command line
option... Well whatever...

Regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Robert Haas
On Wed, Jul 30, 2014 at 4:18 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 nor am I in favor of patch B.

 Yep. Would providing these as additional contrib files be more acceptable?
 Something like tpc-b-gauss.sql... Otherwise there is no example available
 to show the feature.

To be honest, it just feels like clutter to me.  If we added examples
for every feature that is as significant as this one is, we'd end up
with twice the installation footprint, and most of it would be stuff
nobody ever looked at.  I think the documentation is good enough that
people will be able to understand how to use this feature, which is
good enough for me.

One thing that might still be worth doing is including the standard
pgbench scripts in the pgbench documentation.  Then we could say
things like and you could also modify these.  Right now I tend to
end up cut-and-pasting from the source code, which is fine if you're a
hacker but not so user-friendly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Robert Haas
On Wed, Jul 30, 2014 at 9:00 PM, Mitsumasa KONDO
kondo.mitsum...@gmail.com wrote:
 Hmm... It doesn't have harm for pgbench source code. And, in general,
 checking script is useful for avoiding bug.

Not if nobody runs it, or if people run it but don't know what the
output should look like.  I think anyone who knows enough to find bugs
by running these scripts probably doesn't need the scripts.

 No, patch B is still needed. Please tell me the reason. I don't like
 deciding by someones feeling,
 and it needs logical reason. Our documentation is better than the past. I
 think it can easy to understand decile probability.
 This part of the discussion is needed to continue...

 Would providing these as additional contrib files be more acceptable?
 Something like tpc-b-gauss.sql... Otherwise there is no example available
 to show the feature.

 I agree the test script and including command line options. It's not harm,
 and it's useful.

As to all of this, I simply don't agree that the stuff has enough
value to justify including it.  Now, of course, that is subjective:
one person may think it has enough value, while another person may
think that it does not have enough value.  So it just comes down to a
question of opinion, and we make those judgements of opinion all the
time.  If we included everything that everyone who works on the code
wants included, we'd end up with a bloated mess of stuff that nobody
cares about; indeed, we have a significant amount of stuff in the
source code that IMHO looks like somebody's debugging leftovers that
should have been removed before commit.  I don't want to add more
unless there is clear and convincing evidence that a significant
number of people want it, and that is not the case here.

Now, if we get a few reports from people saying, hey, I was doing some
benchmarking with pgbench, and I found the new gaussian feature to be
really excellent, but it sucked that there was no command-line option
for it, we can go back and add one.  No problem!  But in the meantime,
we've added the core of the feature without cluttering up the list of
command-line options with things that may or may not prove to be
useful.

One of the concerns that I have about the proposal of simply slapping
a gaussian or exponential modifier onto \setrandom aid 1 :naccounts is
that, while it will allow you to make part of the relation hot and
another part of the relation cold, you really can't get any more
fine-grained than that.  If you use exponential, all the hot accounts
will be near the beginning of the relation, and if you use gaussian,
they'll all be in the middle.  I'm not sure exactly will happen after
some updating has happened; I'm guessing some of the keys will still
be in their original location and others will have been pushed to the
end of the relation following relation-extension.  But there's no way,
with those command line options, to for example have 5 hot spots
distributed uniformly through the relation; or even to have the end of
the relation rather than the beginning or the middle as the hot spot.
You can do those things with the newly-enhanced \setrand *and a custom
script* but not with just a command-line option.  So that makes me
think that people who find these new facilities useful might not get
all that much use out of the command-line option anyway; and we can't
have a command-line option for every behavior anyone ever wants.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Fabien COELHO


Hello Robert,

[...]

One of the concerns that I have about the proposal of simply slapping a 
gaussian or exponential modifier onto \setrandom aid 1 :naccounts is 
that, while it will allow you to make part of the relation hot and 
another part of the relation cold, you really can't get any more 
fine-grained than that. If you use exponential, all the hot accounts 
will be near the beginning of the relation, and if you use gaussian, 
they'll all be in the middle.


That is a very good remark. Although I thought of it, I do not have a very 
good solution yet:-)


From a testing perspective, if we assume that keys have no semantics, a 
reasonable assumption is that the distribution of access for actual 
realistic workloads is probably exponential (of gaussian, anyway hardly 
uniform), but without direct correlation between key values.


In order to simulate that, we would have to apply a fixed (pseudo-)random 
permutation to the exponential-drawn key values. This is a non trivial 
problem. The version zero of solving it is to do nothing... it is the 
current status;-) Version one is k' = 1 + (a * k + b) modulo n with a 
prime with respect to n, n being the number of keys. This is nearly 
possible, but for the modulo operator which is currently missing, and that 
I'm planning to submit for this very reason, but probably another time.


I'm not sure exactly will happen after some updating has happened; I'm 
guessing some of the keys will still be in their original location and 
others will have been pushed to the end of the relation following 
relation-extension.


This is a not too bad side. What matters most in the long term is not the 
key value correlation, but the actual storage correlation, i.e. whether 
two tuples required are in the same page or not. At the beginning of a 
simulation, with close key numbers being picked up with an exponential 
distribution, the correlation is more that what would be expected. 
However, once a significant amount of the table has been updated, this 
initial artificial correlation is going to fade, and the test should 
become more realistic.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-31 Thread Robert Haas
On Thu, Jul 31, 2014 at 10:01 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 One of the concerns that I have about the proposal of simply slapping a
 gaussian or exponential modifier onto \setrandom aid 1 :naccounts is that,
 while it will allow you to make part of the relation hot and another part of
 the relation cold, you really can't get any more fine-grained than that. If
 you use exponential, all the hot accounts will be near the beginning of the
 relation, and if you use gaussian, they'll all be in the middle.

 That is a very good remark. Although I thought of it, I do not have a very
 good solution yet:-)

 From a testing perspective, if we assume that keys have no semantics, a
 reasonable assumption is that the distribution of access for actual
 realistic workloads is probably exponential (of gaussian, anyway hardly
 uniform), but without direct correlation between key values.

 In order to simulate that, we would have to apply a fixed (pseudo-)random
 permutation to the exponential-drawn key values. This is a non trivial
 problem. The version zero of solving it is to do nothing... it is the
 current status;-) Version one is k' = 1 + (a * k + b) modulo n with a
 prime with respect to n, n being the number of keys. This is nearly
 possible, but for the modulo operator which is currently missing, and that
 I'm planning to submit for this very reason, but probably another time.

That's pretty crude, although I don't object to a modulo operator.  It
would be nice to be able to use a truly random permutation, which is
not hard to generate but probably requires O(n) storage, likely a
problem for large scale factors.  Maybe somebody who knows more math
than I do (like you, probably!) can come up with something more
clever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-30 Thread Robert Haas
On Tue, Jul 29, 2014 at 4:41 AM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 Attached B patch does turn incorrect setrandom syntax into errors instead
 of ignoring extra parameters.

 First A patch is repeated to help commitfest references.

 Oops, I applied the change on the wrong part:-(

 Here is the change on part A which checks setrandom syntax, and B for
 completeness.

I've committed the changes to pgbench.c and the documentation changes
with some further wordsmithing.  I don't think including the other
changes in patch A is a good idea, nor am I in favor of patch B.  But
thanks for your and Kondo-san's hard work on this; I think this will
be quite useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-30 Thread Fabien COELHO


Hello Robert,

I've committed the changes to pgbench.c and the documentation changes 
with some further wordsmithing.


Ok, thanks a lot for your reviews and your help with improving the 
documentation.



I don't think including the other changes in patch A is a good idea,


Fine. It was mostly for testing and checking purposes.


nor am I in favor of patch B.


Yep. Would providing these as additional contrib files be more acceptable? 
Something like tpc-b-gauss.sql... Otherwise there is no example 
available to show the feature.


Thanks again,

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-30 Thread Mitsumasa KONDO
Hi,

2014-07-31 5:18 GMT+09:00 Fabien COELHO coe...@cri.ensmp.fr:

  I've committed the changes to pgbench.c and the documentation changes
 with some further wordsmithing.


 Ok, thanks a lot for your reviews and your help with improving the
 documentation.

Yeah, thanks for all relative members.


  I don't think including the other changes in patch A is a good idea,


 Fine. It was mostly for testing and checking purposes.

Hmm... It doesn't have harm for pgbench source code. And, in general,
checking script is useful for avoiding bug.

 nor am I in favor of patch B.


 Yep.

No, patch B is still needed. Please tell me the reason. I don't like
deciding by someones feeling,
and it needs logical reason. Our documentation is better than the past. I
think it can easy to understand decile probability.
This part of the discussion is needed to continue...

Would providing these as additional contrib files be more acceptable?
 Something like tpc-b-gauss.sql... Otherwise there is no example available
 to show the feature.

I agree the test script and including command line options. It's not harm,
and it's useful.

Best regards,
--
Mitsumasa KONDO


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Hello Robert,

I wish to agree, but my interpretation of the previous code is that 
they were ignored before, so ISTM that we are stuck with keeping the 
same unfortunate behavior.


I don't agree.  I'm not in a huge hurry to fix all the places where 
pgbench currently lacks error checks just because I don't have enough to 
do (hint: I do have enough to do), but when we're adding more 
complicated syntax in one particular place, bringing the error checks in 
that portion of the code up to scratch is an eminently sensible thing to 
do, and we should do it.


Ok. I'm in favor of that anyway. It is just that was afraid that changing 
behavior, however poor the said behavior, could be a blocker.



Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.


Sorry. It does not break my mailer which relies on internal headers, but 
I'll try to be compatible with this gmail features in the future.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Hello Robert,


3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc  6 OR argc  6.  I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.


I wish to agree, but my interpretation of the previous code is that they
were ignored before, so ISTM that we are stuck with keeping the same
unfortunate behavior.


I don't agree.


Attached B patch does turn incorrect setrandom syntax into errors instead 
of ignoring extra parameters.


First A patch is repeated to help commitfest references.

--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max  min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the 

Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-29 Thread Fabien COELHO


Attached B patch does turn incorrect setrandom syntax into errors instead of 
ignoring extra parameters.


First A patch is repeated to help commitfest references.


Oops, I applied the change on the wrong part:-(

Here is the change on part A which checks setrandom syntax, and B for 
completeness.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..16e44bd 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case. For a 5.0 threshold value, the looping probability
+	 * is about e^{-5} * 2 / pi ~ 0.43%.
+	 */
+	do
+	{
+		/*
+		 * pg_erand48 generates [0,1), but for the basic version of the
+		 * Box-Muller transform the two uniformly distributed random numbers
+		 * are expected in (0, 1] (see http://en.wikipedia.org/wiki/Box_muller)
+		 */
+		double rand1 = 1.0 - pg_erand48(thread-random_state);
+		double rand2 = 1.0 - pg_erand48(thread-random_state);
+
+		/* Box-Muller basic form transform */
+		double var_sqrt = sqrt(-2.0 * log(rand1));
+		stdev = var_sqrt * sin(2.0 * M_PI * rand2);
+
+		/*
+		 * we may try with cos, but there may be a bias induced if the previous
+		 * value fails the test. To be on the safe side, let us try over.
+		 */
+	}
+	while (stdev  -threshold || stdev = threshold);
+
+	/* stdev is in [-threshold, threshold), normalization to [0,1) */
+	rand = (stdev + threshold) / (threshold * 2.0);
+
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
 /* call PQexec() and exit() on failure */
 static void
 executeStatement(PGconn *con, const char *sql)
@@ -1319,6 +1391,7 @@ top:
 			char	   *var;
 			int64		min,
 		max;
+			double		threshold = 0;
 			char		res[64];
 
 			if (*argv[2] == ':')
@@ -1364,11 +1437,11 @@ top:
 			}
 
 			/*
-			 * getrand() needs to be able to subtract max from min and add one
-			 * to the result without overflowing.  Since we know max  min, we
-			 * can detect overflow just by checking for a negative result. But
-			 * we must check both that the subtraction doesn't overflow, and
-			 * that adding one to the result doesn't overflow either.
+			 * Generate random number functions need to be able to subtract
+			 * max from min and add one to the result without overflowing.
+			 * Since we know max  min, we can detect overflow just by checking
+			 * for a negative result. But we must check both that the subtraction
+			 * doesn't overflow, and that adding one to the result doesn't overflow either.
 			 */
 			if (max - min  0 || (max - min) + 1  0)
 			{
@@ -1377,10 +1450,64 @@ top:
 return true;
 			}
 
+			if (argc == 4 || /* uniform without or with uniform keyword */
+(argc == 5  pg_strcasecmp(argv[4], uniform) == 0))
+			

Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-28 Thread Robert Haas
On Wed, Jul 23, 2014 at 12:39 PM, Fabien COELHO coe...@cri.ensmp.fr wrote:
 3. Similarly, I suggest that the use of gaussian or uniform be an
 error when argc  6 OR argc  6.  I also suggest that the
 parenthesized distribution type be dropped from the error message in
 all cases.

 I wish to agree, but my interpretation of the previous code is that they
 were ignored before, so ISTM that we are stuck with keeping the same
 unfortunate behavior.

I don't agree.  I'm not in a huge hurry to fix all the places where
pgbench currently lacks error checks just because I don't have enough
to do (hint: I do have enough to do), but when we're adding more
complicated syntax in one particular place, bringing the error checks
in that portion of the code up to scratch is an eminently sensible
thing to do, and we should do it.

Also, please stop changing the title of this thread every other post.
It breaks threading for me (and anyone else using gmail), and that
makes the thread hard to follow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-24 Thread Mitsumasa KONDO
Hi,

Thank you for your grate documentation and fix working!!!
It becomes very helpful for understanding our feature.

I add two feature in gauss_B_4.patch.

1) Add gaussianProbability() function
It is same as exponentialProbability(). And the feature is as same as
before.

2) Add result of max/min percent of the range
It is almost same as --exponential option's result. However, max percent of
the range is center of distribution
and min percent of the range is most side of distribution.
Here is the output example,

+ pgbench_account's aid selected with a truncated gaussian distribution

+ standard deviation threshold: 5.0

+ decile percents: 0.0% 0.1% 2.1% 13.6% 34.1% 34.1% 13.6% 2.1% 0.1% 0.0%

+ probability of max/min percent of the range: 4.0% 0.0%


And I add the explanation about this in the document.

I'm very appreciate for your works!!!


Best regards,

--

Mitsumasa KONDO


gauss_B_5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] gaussian distribution pgbench -- splits v4

2014-07-23 Thread Fabien COELHO


Hello Robert,


Some review comments:


Thanks a lot for your return.

Please find attached two new parts of the patch (A for setrandom 
extension, B for pgbench embedded test case extension).



1. I suggest that getExponentialrand and getGaussianrand be renamed to
getExponentialRand and getGaussianRand.


Done.

It was named like that because getrand was used for the uniform case.



2. I suggest that the code be changed so that the branch currently
labeled as /* uniform with extra argument */ become a hard error
instead of a warning.

3. Similarly, I suggest that the use of gaussian or uniform be an
error when argc  6 OR argc  6.  I also suggest that the
parenthesized distribution type be dropped from the error message in
all cases.


I wish to agree, but my interpretation of the previous code is that they 
were ignored before, so ISTM that we are stuck with keeping the same 
unfortunate behavior.



4. This question mark seems like it should be a period:

+  * value fails the test? To be on the safe side, let us try over.


Indeed.


5. With regards to the following paragraph:

 para
+  The default random distribution is uniform, that is all values in the
+  range are drawn with equal probability. The gaussian and exponential
+  options allow to change this default. The mandatory
+  replaceablethreshold/ double value controls the actual distribution
+  with gaussian or exponential.
+ /para

This paragraph needs a bit of copy-editing.  Here's an attempt: By
default, all values in the range are drawn with equal probability.
The literalgaussian/ and literalexponential/ options modify
this behavior; each requires a mandatory threshold which determines
the precise shape of the distribution.  The following paragraph
should be changed to begin with For a Gaussian distribution and the
one after For an exponential distribution.


Ok. I've kept uniform in the first sentence, because this is both
an option name and it is significant in term of probabilities.


6. Overall, I think the documentation here looks much better now, but
I suggest adding one or two example to the Gaussian section.  Like
this: for example, if threshold is 2.0, 68% of the values will fall in
the middle third of the interval; with a threshold of 3.0, 99.7% of
the values will fall in the middle third of the interval.  These
numbers are fabricated, and the middle third of the interval might not
be the best part to talk about, but you get the idea (I hope).


Done with threshold value 4.0 so I have a middle quarter and a middle 
half.


--
Fabien.diff --git a/contrib/pgbench/README b/contrib/pgbench/README
new file mode 100644
index 000..6881256
--- /dev/null
+++ b/contrib/pgbench/README
@@ -0,0 +1,5 @@
+# gaussian and exponential tests
+# with XXX as expo or gauss
+psql test  test-init.sql
+./pgbench -M prepared -f test-XXX-run.sql -t 100 -P 1 -n test
+psql test  test-XXX-check.sql
diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 4aa8a50..e07206a 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -98,6 +98,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -471,6 +473,76 @@ getrand(TState *thread, int64 min, int64 max)
 	return min + (int64) ((max - min + 1) * pg_erand48(thread-random_state));
 }
 
+/*
+ * random number generator: exponential distribution from min to max inclusive.
+ * the threshold is so that the density of probability for the last cut-off max
+ * value is exp(-threshold).
+ */
+static int64
+getExponentialRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double cut, uniform, rand;
+	Assert(threshold  0.0);
+	cut = exp(-threshold);
+	/* erand in [0, 1), uniform in (0, 1] */
+	uniform = 1.0 - pg_erand48(thread-random_state);
+	/*
+	 * inner expresion in (cut, 1] (if threshold  0),
+	 * rand in [0, 1)
+	 */
+	Assert((1.0 - cut) != 0.0);
+	rand = - log(cut + (1.0 - cut) * uniform) / threshold;
+	/* return int64 random number within between min and max */
+	return min + (int64)((max - min + 1) * rand);
+}
+
+/* random number generator: gaussian distribution from min to max inclusive */
+static int64
+getGaussianRand(TState *thread, int64 min, int64 max, double threshold)
+{
+	double		stdev;
+	double		rand;
+
+	/*
+	 * Get user specified random number from this loop, with
+	 * -threshold  stdev = threshold
+	 *
+	 * This loop is executed until the number is in the expected range.
+	 *
+	 * As the minimum threshold is 2.0, the probability of looping is low:
+	 * sqrt(-2 ln(r)) = 2 = r = e^{-2} ~ 0.135, then when taking the average
+	 * sinus multiplier as 2/pi, we have a 8.6% looping probability in the
+	 * worst case.