date:20140401

Re: [HACKERS] [PATCH] Negative Transition Aggregate Functions (WIP)

2014-04-01 Thread Dean Rasheed

On 31 March 2014 01:58, Florian Pflug  wrote:
> Attached are updated patches that include the EXPLAIN changes mentioned
> above and updated docs.
>

These patches need re-basing --- they no longer apply to HEAD.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] gaussian distribution pgbench

2014-04-01 Thread Fabien COELHO



Please find attached an updated version "v13" for this patch.

I have (I hope) significanlty improved the documentation, including not so 
helpful mathematical explanation about the actual meaning of the threshold 
value. If a native English speaker could check the documentation, it would 
be nice!


I have improved the implementation of the exponential distribution so as 
to avoid a loop, which allows to lift the minimum threshold value 
constraint, and the exponential pgbench summary displays decile and 
first/last percent drawing probabilities. However, the same simplification 
cannot be applied on the gaussian distribution part which must rely on a 
loop, thus needs a minimal threshold for performance. I have also checked 
(see the 4 attached scripts) the actual distribution against the computed 
probabilities.



I disagree with the suggestion to remove the included gaussian & 
exponential tests variants, because (1) it would mean removing the 
specific summaries as well, which are essential to help feel how the 
feature works; (2) the corresponding code in the source is rather 
straightforward; (3) the tests correspond to the schema and data created 
with -i, so it makes sense that they are stored in pgbench; (4) in order 
for this feature to be used, it is best that it is available directly and 
simply from pgbench, and not to be thought for elsewhere.



If this is a commit blocker, then the embedded script will have to be 
removed, but I really think that they add a significant value to pgbench 
and its "non uniform" features because they make it easy to test.



If Mitsumasa-san aggrees with these proposed changes, I would suggest to
apply this patch.

--
Fabiendiff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index 7c1e59e..eb1ecb3 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef HAVE_SYS_SELECT_H
 #include 
 #endif
@@ -98,6 +99,8 @@ static int	pthread_join(pthread_t th, void **thread_return);
 #define LOG_STEP_SECONDS	5	/* seconds between log messages */
 #define DEFAULT_NXACTS	10		/* default nxacts */
 
+#define MIN_GAUSSIAN_THRESHOLD		2.0	/* minimum threshold for gauss */
+
 int			nxacts = 0;			/* number of transactions per client */
 int			duration = 0;		/* duration in seconds */
 
@@ -169,6 +172,14 @@ bool		is_connect;			/* establish connection for each transaction */
 bool		is_latencies;		/* report per-command latencies */
 int			main_pid;			/* main process id used in log filename */
 
+/* gaussian distribution tests: */
+double		stdev_threshold;   /* standard deviation threshold */
+booluse_gaussian = false;
+
+/* exponential distribution tests: */
+double		exp_threshold;   /* threshold for exponential */
+bool		use_exponential = false;
+
 char	   *pghost = "";
 char	   *pgport = "";
 char	   *login = NULL;
@@ -330,6 +341,88 @@ static char *select_only = {
 	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
 };
 
+/* --exponential case */
+static char *exponential_tpc_b = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;\n"
+	"UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --exponential with -N case */
+static char *exponential_simple_update = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers) " * :scale\n"
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"\\setrandom bid 1 :nbranches\n"
+	"\\setrandom tid 1 :ntellers\n"
+	"\\setrandom delta -5000 5000\n"
+	"BEGIN;\n"
+	"UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+	"INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);\n"
+	"END;\n"
+};
+
+/* --exponential with -S case */
+static char *exponential_select_only = {
+	"\\set naccounts " CppAsString2(naccounts) " * :scale\n"
+	"\\setrandom aid 1 :naccounts exponential :exp_threshold\n"
+	"SELECT abalance FROM pgbench_accounts WHERE aid = :aid;\n"
+};
+
+/* --gaussian case */
+static char *gaussian_tpc_b = {
+	"\\set nbranches " CppAsString2(nbranches) " * :scale\n"
+	"\\set ntellers " CppAsString2(ntellers)

Re: [HACKERS] GSoC 2014 proposal

2014-04-01 Thread Heikki Linnakangas

On 03/30/2014 11:50 PM, Иван Парфилов wrote:

The implementation of this algorithm would be for data type cube and based
on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N
d-dimensional data points, the clustering feature CF of the set is defined
as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
square sum of data points. Clustering features are organized in a CF tree,
which is a height balanced tree with two parameters: branching factor B and
threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for
implementation [2].
The GiST is a balanced tree structure like a B-tree, containing pairs. GiST key is a member of a user-defined class, and
represents some property that is true of all data items reachable from the
pointer associated with the key. The GiST provides a possibility to create
custom data types with indexed access methods and extensible set of
queries.

The BIRCH algorithm as described in the paper describes building a tree
in memory. If I understood correctly, you're suggesting to use a
pre-built GiST index instead. Interesting idea!

There are a couple of signifcant differences between the CF tree
described in the paper and GiST:

1. In GiST, a leaf item always represents one heap tuple. In the CF
tree, a leaf item represents a cluster, which consists of one or more
tuples. So the CF tree doesn't store an entry for every input tuple,
which makes it possible to keep it in memory.

2. In the CF tree, "all entries in a leaf node must satisfy a threshold
requirement, with respect to a threshold value T: the diameter (or
radius) has to be less than T". GiST imposes no such restrictions. An
item can legally be placed anywhere in the tree; placing it badly will
just lead to degraded search performance, but it's still a legal GiST tree.

3. A GiST index, like any other index in PostgreSQL, holds entries also
for deleted tuples, until the index is vacuumed. So you cannot just use
information from a non-leaf node and use it in the result, as the
information summarized at a non-leaf level includes noise from the dead
tuples.

Can you elaborate how you are planning to use a GiST index to implement
BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
strict in where in the tree an item can be stored, and lets the operator
class to specify exactly when a node is split etc.

We need to implement it to create GiST-based CF-tree to use it in BIRCH
algorithm.

*Example of usage(approximate):*

create table cube_test (v cube);

+> insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5,
-0.2]),

cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

cluster | val1 | val2

-+--+

1 | 1.2 | 0.4

0 | 0.5 | -0.2

1 | 0.6 | 1.0

1 | 1.0 | 0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would
be implemented.

From the example, it seems that birch(...) would be an aggregate
function. Aggregates in PostgreSQL currently work by scanning all the
input data. That would certainly be a pretty straightforward way to
implement BIRCH too. Every input tuple would be passed to the the
so-called "transition function" (which you would write), which would
construct a CF tree on-the-fly. At the end, the result would be
constructed from the CF tree. With this approach, the CF tree would be
kept in memory, and thrown away after the query.

That would be straightforward, but wouldn't involve GiST at all. To use
an index to implement an aggregate would require planner/executor
changes. That would be interesting, but offhand I have no idea what that
would look like. We'll need more details on that.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

73 matches

Mail list logo