Re: [HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib

2014-04-07 Thread Hai Qian
Hi Maxence,

This is really really good. Hopefully we will have a fruitful summer, and
make MADlib a better product.

Thanks

Hai

--
*Pivotal http://www.gopivotal.com/*
A new platform for a new era


On Sat, Apr 5, 2014 at 7:14 AM, Maxence Ahlouche maxence.ahlou...@gmail.com
 wrote:

 Hi all,

 2014-04-03 9:04 GMT+02:00 Hai Qian hq...@gopivotal.com:

 Hi Maxence,

 We are very glad that you are enthusiastic about MADlib.Your proposal
 looks
 very nice. The algorithms that you proposed will be very useful, and can
 serve as a good start. After learning about the C++ abstraction layer and
 testing suite, you should be able to quickly grasp how to program for
 MADlib. We can fill in the details of the plan later. The conflict in your
 time schedule shouldn't be a problem.

 It would be nice if you could provide a little bit more introduction to
 these algorithms and their importance in the proposal.

 Overall it is very good. And thank you again

 Hai


 As you requested, I've described the clustering algorithms on my github
 [0]. I'm copying it at the end of this email, just in case.
 In the case of OPTICS, I still need some time to grasp the details of how
 it works, so I've only included a basic description of this one.

 Regards,
 Maxence

 [0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst


 Details about the different clustering algorithms
 =

 This file describes the k-means algorithm, which is currently the only
 clustering algorithm implemented in MADlib, and the k-medoids and
 OPTICS algorithm, which I plan to implement this summer, as a GSoC
 project. I'll also explain the DBSCAN algorithm, as OPTICS is only an
 extension of this one.

 The goal of all these algorithms is to determine clusters in a set of
 points. While this problem is NP-hard, it can be solved efficiently
 thanks to these algorithms, even though the result is usually not
 perfect.

 I've tried not to add too many details, and taken some shortcuts, so
 there may be some inaccuracies. I wanted to keep this readable (not
 sure that goal was achieved, though), but if anyone wants more
 precisions, I'll gladly add them.

 k-means
 ---

 The k-means algorithm is the one implemented in MADlib. It selects
 *k* points, either randomly, among the dataset, or set by the user,
 to use as initial centroids (center of clusters). *k* must be
 defined by the user; the algorithm is unable to guess the number of
 clusters.

 All the points in the dataset are then affected to the nearest
 centroid, thus making *k* clusters.

 The next step is to compute the new centroids as a mean of all the
 points in a cluster. Then reassign all the points to then new
 centroids, and start all over again until there is no change in
 cluster assignation.

 This algorithm is usually pretty fast (even though it can be made to
 take an exponential time to converge). Still, it is easy to get a
 wrong result because of local convergence, e.g. having one cluster
 split in two parts by the algorithm, or two clusters merged. It is
 also pretty sensitive to outliers (points that don't obviously belong
 to any cluster), and the final result depend greatly on the initial
 centroids.

 This algorithm doesn't work well with non-euclidean distance, in
 general.

 Another think to note is that k-means will result in Voronoi cell
 shaped clusters, which is not always what we want.

 k-medoids
 -

 The k-medoids algorithm works the same way as k-means, save for a few
 exceptions.

 With k-medoids, the centroids are always points of the dataset (and
 are then called medoids). The new medoids are the points in clusters
 which minimize the sum of pairwise dissimilarities (in other terms,
 the point for which the average distance to other points in the
 cluster is minimal). This makes the algorithm less sensitive to
 outliers than k-means.

 This algorithm is computationnally more intensive than k-means, but
 still fast.

 As for k-means, it is possible to run the algorithm several times with
 different initial centroids, and get the best result (i.e. the one
 that minimizes the sum of distances from points to their centroids).

 DBSCAN
 ---

 DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*)
 is a clustering algorithm based on the density of clusters.

 This adresses several limitations of k-means and k-medoids: it does
 not assign a cluster to outliers, and allows the detection of
 weird-shaped clusters. Moreover, it doesn't need to be told the number
 of clusters.

 A point is called dense if enough other points are near enough from
 it, where enough other points and near enough are defined by the
 parameters *min_points* and *epsilon*.

 *min_points* therefore represents the minimum number of points
 required to make a cluster, and *epsilon* is the maximum distance a
 point can be from a cluster to be considered part of this cluster.

 For every unassigned point, count 

[HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib

2014-04-05 Thread Maxence Ahlouche
Hi all,

2014-04-03 9:04 GMT+02:00 Hai Qian hq...@gopivotal.com:

 Hi Maxence,

 We are very glad that you are enthusiastic about MADlib.Your proposal looks
 very nice. The algorithms that you proposed will be very useful, and can
 serve as a good start. After learning about the C++ abstraction layer and
 testing suite, you should be able to quickly grasp how to program for
 MADlib. We can fill in the details of the plan later. The conflict in your
 time schedule shouldn't be a problem.

 It would be nice if you could provide a little bit more introduction to
 these algorithms and their importance in the proposal.

 Overall it is very good. And thank you again

 Hai


As you requested, I've described the clustering algorithms on my github
[0]. I'm copying it at the end of this email, just in case.
In the case of OPTICS, I still need some time to grasp the details of how
it works, so I've only included a basic description of this one.

Regards,
Maxence

[0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst


Details about the different clustering algorithms
=

This file describes the k-means algorithm, which is currently the only
clustering algorithm implemented in MADlib, and the k-medoids and
OPTICS algorithm, which I plan to implement this summer, as a GSoC
project. I'll also explain the DBSCAN algorithm, as OPTICS is only an
extension of this one.

The goal of all these algorithms is to determine clusters in a set of
points. While this problem is NP-hard, it can be solved efficiently
thanks to these algorithms, even though the result is usually not
perfect.

I've tried not to add too many details, and taken some shortcuts, so
there may be some inaccuracies. I wanted to keep this readable (not
sure that goal was achieved, though), but if anyone wants more
precisions, I'll gladly add them.

k-means
---

The k-means algorithm is the one implemented in MADlib. It selects
*k* points, either randomly, among the dataset, or set by the user,
to use as initial centroids (center of clusters). *k* must be
defined by the user; the algorithm is unable to guess the number of
clusters.

All the points in the dataset are then affected to the nearest
centroid, thus making *k* clusters.

The next step is to compute the new centroids as a mean of all the
points in a cluster. Then reassign all the points to then new
centroids, and start all over again until there is no change in
cluster assignation.

This algorithm is usually pretty fast (even though it can be made to
take an exponential time to converge). Still, it is easy to get a
wrong result because of local convergence, e.g. having one cluster
split in two parts by the algorithm, or two clusters merged. It is
also pretty sensitive to outliers (points that don't obviously belong
to any cluster), and the final result depend greatly on the initial
centroids.

This algorithm doesn't work well with non-euclidean distance, in
general.

Another think to note is that k-means will result in Voronoi cell
shaped clusters, which is not always what we want.

k-medoids
-

The k-medoids algorithm works the same way as k-means, save for a few
exceptions.

With k-medoids, the centroids are always points of the dataset (and
are then called medoids). The new medoids are the points in clusters
which minimize the sum of pairwise dissimilarities (in other terms,
the point for which the average distance to other points in the
cluster is minimal). This makes the algorithm less sensitive to
outliers than k-means.

This algorithm is computationnally more intensive than k-means, but
still fast.

As for k-means, it is possible to run the algorithm several times with
different initial centroids, and get the best result (i.e. the one
that minimizes the sum of distances from points to their centroids).

DBSCAN
---

DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*)
is a clustering algorithm based on the density of clusters.

This adresses several limitations of k-means and k-medoids: it does
not assign a cluster to outliers, and allows the detection of
weird-shaped clusters. Moreover, it doesn't need to be told the number
of clusters.

A point is called dense if enough other points are near enough from
it, where enough other points and near enough are defined by the
parameters *min_points* and *epsilon*.

*min_points* therefore represents the minimum number of points
required to make a cluster, and *epsilon* is the maximum distance a
point can be from a cluster to be considered part of this cluster.

For every unassigned point, count his *epsilon*-neighbours (not sure
this term exists, but I'll use it for convenience). If there are too
few, consider this point an outlier; else, create a new cluster and
put the current point in it, along with its neighbours, the neighbours
of its neighbours, and so on.

The main problem with DBSCAN is that it doesn't work well for clusters
with very different 

Re: [HACKERS] GSoC 2014 proposal

2014-04-03 Thread Alexander Korotkov
On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.comwrote:

 On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

 The BIRCH algorithm as described in the paper describes building a tree
 in memory. If I understood correctly, you're suggesting to use a pre-built
 GiST index instead. Interesting idea!

 There are a couple of signifcant differences between the CF tree
 described in the paper and GiST:

 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
 a leaf item represents a cluster, which consists of one or more tuples. So
 the CF tree doesn't store an entry for every input tuple, which makes it
 possible to keep it in memory.

 2. In the CF tree, all entries in a leaf node must satisfy a threshold
 requirement, with respect to a threshold value T: the diameter (or radius)
 has to be less than T. GiST imposes no such restrictions. An item can
 legally be placed anywhere in the tree; placing it badly will just lead to
 degraded search performance, but it's still a legal GiST tree.

 3. A GiST index, like any other index in PostgreSQL, holds entries also
 for deleted tuples, until the index is vacuumed. So you cannot just use
 information from a non-leaf node and use it in the result, as the
 information summarized at a non-leaf level includes noise from the dead
 tuples.

 Can you elaborate how you are planning to use a GiST index to implement
 BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
 strict in where in the tree an item can be stored, and lets the operator
 class to specify exactly when a node is split etc.


 Hmmm, it's likely I've imagined something quite outside of this paper, and
 even already suggested it to Ivan... :)
 I need a little time to rethink it.


Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If
consistent method see CF entry which diameter is greater than T then it
returns true. Otherwise it returns false and put this CF entry into output
area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.

This modification would have following advantages:
1) User can build GiST index once and then try clustering with different
parameters. Initial GiST index build would be slowest operation while other
steps is expected to be fast.
2) Use GiST infrastructure and automatically get buffering build.

The drawback is that building GiST index is more expensive than building
in-memory CF tree with given threshold T (assuming T is well chosen).

Does it make any sense?

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 proposal

2014-04-03 Thread Heikki Linnakangas

On 04/03/2014 04:15 PM, Alexander Korotkov wrote:

On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.comwrote:


On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas 
hlinnakan...@vmware.com wrote:


The BIRCH algorithm as described in the paper describes building a tree
in memory. If I understood correctly, you're suggesting to use a pre-built
GiST index instead. Interesting idea!

There are a couple of signifcant differences between the CF tree
described in the paper and GiST:

1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
a leaf item represents a cluster, which consists of one or more tuples. So
the CF tree doesn't store an entry for every input tuple, which makes it
possible to keep it in memory.

2. In the CF tree, all entries in a leaf node must satisfy a threshold
requirement, with respect to a threshold value T: the diameter (or radius)
has to be less than T. GiST imposes no such restrictions. An item can
legally be placed anywhere in the tree; placing it badly will just lead to
degraded search performance, but it's still a legal GiST tree.

3. A GiST index, like any other index in PostgreSQL, holds entries also
for deleted tuples, until the index is vacuumed. So you cannot just use
information from a non-leaf node and use it in the result, as the
information summarized at a non-leaf level includes noise from the dead
tuples.

Can you elaborate how you are planning to use a GiST index to implement
BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
strict in where in the tree an item can be stored, and lets the operator
class to specify exactly when a node is split etc.



Hmmm, it's likely I've imagined something quite outside of this paper, and
even already suggested it to Ivan... :)
I need a little time to rethink it.



Using GiST we can implement BIRCH-like clustering like so:
1) Build a CF tree as GiST index without restriction of T threshold value.
2) Scan CF tree with threshold T with some auxiliary operator. If
consistent method see CF entry which diameter is greater than T then it
returns true. Otherwise it returns false and put this CF entry into output
area (could be either in-memory or temporary table).
3) Process other steps of algorithm as usual.


I still don't understand how that would work. You can't trust the 
non-leaf entries, because their CF value can contain deleted entries. So 
you have to scan every tuple anyway. Once you do that, what's the point 
of the index?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2014 proposal

2014-04-03 Thread Alexander Korotkov
On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 04/03/2014 04:15 PM, Alexander Korotkov wrote:

 On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.com
 wrote:

  On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas 
 hlinnakan...@vmware.com wrote:

  The BIRCH algorithm as described in the paper describes building a tree
 in memory. If I understood correctly, you're suggesting to use a
 pre-built
 GiST index instead. Interesting idea!

 There are a couple of signifcant differences between the CF tree
 described in the paper and GiST:

 1. In GiST, a leaf item always represents one heap tuple. In the CF
 tree,
 a leaf item represents a cluster, which consists of one or more tuples.
 So
 the CF tree doesn't store an entry for every input tuple, which makes it
 possible to keep it in memory.

 2. In the CF tree, all entries in a leaf node must satisfy a threshold
 requirement, with respect to a threshold value T: the diameter (or
 radius)
 has to be less than T. GiST imposes no such restrictions. An item can
 legally be placed anywhere in the tree; placing it badly will just lead
 to
 degraded search performance, but it's still a legal GiST tree.

 3. A GiST index, like any other index in PostgreSQL, holds entries also
 for deleted tuples, until the index is vacuumed. So you cannot just use
 information from a non-leaf node and use it in the result, as the
 information summarized at a non-leaf level includes noise from the dead
 tuples.

 Can you elaborate how you are planning to use a GiST index to implement
 BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
 strict in where in the tree an item can be stored, and lets the operator
 class to specify exactly when a node is split etc.


 Hmmm, it's likely I've imagined something quite outside of this paper,
 and
 even already suggested it to Ivan... :)
 I need a little time to rethink it.


 Using GiST we can implement BIRCH-like clustering like so:
 1) Build a CF tree as GiST index without restriction of T threshold value.
 2) Scan CF tree with threshold T with some auxiliary operator. If
 consistent method see CF entry which diameter is greater than T then it
 returns true. Otherwise it returns false and put this CF entry into output
 area (could be either in-memory or temporary table).
 3) Process other steps of algorithm as usual.


 I still don't understand how that would work. You can't trust the non-leaf
 entries, because their CF value can contain deleted entries. So you have to
 scan every tuple anyway. Once you do that, what's the point of the index?


Yeah, it is limitation of this idea. It's not going to be auto-updatable
CF. User can build index on freshly vacuumed table and play with clustering
some time. Updates on table during that time would be either allowed
clustering error or prohibited. Another potential solution is to let this
index to hold some snapshot of data. But it seems not possible to do in
extension now.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 proposal

2014-04-02 Thread Alexander Korotkov
On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas
hlinnakan...@vmware.comwrote:

 The BIRCH algorithm as described in the paper describes building a tree in
 memory. If I understood correctly, you're suggesting to use a pre-built
 GiST index instead. Interesting idea!

 There are a couple of signifcant differences between the CF tree described
 in the paper and GiST:

 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
 a leaf item represents a cluster, which consists of one or more tuples. So
 the CF tree doesn't store an entry for every input tuple, which makes it
 possible to keep it in memory.

 2. In the CF tree, all entries in a leaf node must satisfy a threshold
 requirement, with respect to a threshold value T: the diameter (or radius)
 has to be less than T. GiST imposes no such restrictions. An item can
 legally be placed anywhere in the tree; placing it badly will just lead to
 degraded search performance, but it's still a legal GiST tree.

 3. A GiST index, like any other index in PostgreSQL, holds entries also
 for deleted tuples, until the index is vacuumed. So you cannot just use
 information from a non-leaf node and use it in the result, as the
 information summarized at a non-leaf level includes noise from the dead
 tuples.

 Can you elaborate how you are planning to use a GiST index to implement
 BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
 strict in where in the tree an item can be stored, and lets the operator
 class to specify exactly when a node is split etc.


Hmmm, it's likely I've imagined something quite outside of this paper, and
even already suggested it to Ivan... :)
I need a little time to rethink it.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 proposal

2014-04-01 Thread Heikki Linnakangas

On 03/30/2014 11:50 PM, Иван Парфилов wrote:

The implementation of this algorithm would be for data type cube and based
on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N
d-dimensional data points, the clustering feature CF of the set is defined
as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
square sum of data points. Clustering features are organized in a CF tree,
which is a height balanced tree with two parameters: branching factor B and
threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for
implementation [2].
The GiST is a balanced tree structure like a B-tree, containing key,
pointer pairs. GiST key is a member of a user-defined class, and
represents some property that is true of all data items reachable from the
pointer associated with the key. The GiST provides a possibility to create
custom data types with indexed access methods and extensible set of
queries.


The BIRCH algorithm as described in the paper describes building a tree 
in memory. If I understood correctly, you're suggesting to use a 
pre-built GiST index instead. Interesting idea!


There are a couple of signifcant differences between the CF tree 
described in the paper and GiST:


1. In GiST, a leaf item always represents one heap tuple. In the CF 
tree, a leaf item represents a cluster, which consists of one or more 
tuples. So the CF tree doesn't store an entry for every input tuple, 
which makes it possible to keep it in memory.


2. In the CF tree, all entries in a leaf node must satisfy a threshold 
requirement, with respect to a threshold value T: the diameter (or 
radius) has to be less than T. GiST imposes no such restrictions. An 
item can legally be placed anywhere in the tree; placing it badly will 
just lead to degraded search performance, but it's still a legal GiST tree.


3. A GiST index, like any other index in PostgreSQL, holds entries also 
for deleted tuples, until the index is vacuumed. So you cannot just use 
information from a non-leaf node and use it in the result, as the 
information summarized at a non-leaf level includes noise from the dead 
tuples.


Can you elaborate how you are planning to use a GiST index to implement 
BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more 
strict in where in the tree an item can be stored, and lets the operator 
class to specify exactly when a node is split etc.



We need to implement it to create GiST-based CF-tree to use it in BIRCH
algorithm.


*Example of usage(approximate):*

create table cube_test (v cube);

+ insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, 
-0.2]),


   cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

  cluster | val1 | val2

-+--+

   1  |  1.2 |  0.4

   0  |  0.5 | -0.2

   1  |  0.6 |  1.0

   1  |  1.0 |  0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would
be implemented.


From the example, it seems that birch(...) would be an aggregate 
function. Aggregates in PostgreSQL currently work by scanning all the 
input data. That would certainly be a pretty straightforward way to 
implement BIRCH too. Every input tuple would be passed to the the 
so-called transition function (which you would write), which would 
construct a CF tree on-the-fly. At the end, the result would be 
constructed from the CF tree. With this approach, the CF tree would be 
kept in memory, and thrown away after the query.


That would be straightforward, but wouldn't involve GiST at all. To use 
an index to implement an aggregate would require planner/executor 
changes. That would be interesting, but offhand I have no idea what that 
would look like. We'll need more details on that.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2014 proposal

2014-04-01 Thread Heikki Linnakangas

On 03/30/2014 11:50 PM, Иван Парфилов wrote:

* Quantifiable results*

  Adding support of BIRCH algorithm for data type cube


Aside from the details of *how* that would work, the other question is:

Do we want this in contrib/cube? There are currently no clustering 
functions, or any other statistical functions or similar, in 
contrib/cube. Just basic contains/contained/overlaps operators. And 
B-tree comparison operators which are pretty useless for cube.


Do we want to start adding such features to cube, in contrib? Or should 
that live outside the PostgreSQL source tree, in an separate extension, 
so that it could live on its own release schedule, etc. If BIRCH goes 
into contrib/cube, that's an invitation to add all kinds of functions to it.


We received another GSoC application to add another clustering algorithm 
to the MADlib project. MADlib is an extension to PostgreSQL with a lot 
of different statistical tools, so MADlib would be a natural home for 
BIRCH too. But if it requires backend changes (ie. changes to GiST), 
then that needs to be discussed on pgsql-hackers, and it probably would 
be better to do a reference implementation in contrib/cube. MADlib could 
later copy it from there.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSoC 2014 proposal

2014-03-30 Thread Иван Парфилов
Hello, hackers! This is my GSoC proposal.

*Short description:*

Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense or another) to each other than to those in other
groups (clusters). It is a main task of exploratory data mining, and a
common technique for statistical data analysis, used in many fields,
including machine learning, pattern recognition, image analysis,
information retrieval, and bioinformatics. The purpose of this project is
to add support of BIRCH(balanced iterative reducing and clustering using
hierarchies) algorithm [1] for data type cube.

*Benefits to the PostgreSQL Community*

Support of BIRCH algorithm for data type cube would be actual for many
PostgreSQL applications (for example, to solve data clustering problem for
high dimensional datasets and for large datasets).

* Quantifiable results*

 Adding support of BIRCH algorithm for data type cube

*Project Details*
BIRCH (balanced iterative reducing and clustering using hierarchies) is an
unsupervised data mining algorithm used to perform hierarchical clustering
over particularly large data-sets.

The BIRCH algorithm (Balanced Iterative Reducing and Clustering
Hierarchies) of Zhang [1] was developed to handle massive datasets that are
too large to be contained in the main memory (RAM). To minimize I/O costs,
every datum is read once and only once. BIRCH transforms the data set into
compact, locally similar subclusters, each with summary statistics attached
(called clustering features). Then, instead of using the full data set,
these summary statistics can be used. This approach is most advantageous in
two situations: when the data cannot be loaded into memory due to its size;
and/or when some form of combinatorial optimization is required and the
size of the solution space makes finding global maximums/minimums difficult.

Key properties of BIRCH algorithm:

single scan of the dataset enough;

I/O cost minimization: Organize data in a in-memory, height-balanced tree;

Each clustering decision is made without scanning all the points or
clusters.

The implementation of this algorithm would be for data type cube and based
on GiST.

The key concept of BIRCH algorithm is clustering feature. Given a set of N
d-dimensional data points, the clustering feature CF of the set is defined
as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the
square sum of data points. Clustering features are organized in a CF tree,
which is a height balanced tree with two parameters: branching factor B and
threshold T.

Because the structure of CF tree is similar to B+-tree we can use GiST for
implementation [2].
The GiST is a balanced tree structure like a B-tree, containing key,
pointer pairs. GiST key is a member of a user-defined class, and
represents some property that is true of all data items reachable from the
pointer associated with the key. The GiST provides a possibility to create
custom data types with indexed access methods and extensible set of
queries.

There are seven methods that an index operator class for GiST must provide,
and an eighth that is optional:

-consistent

-union

-compress

-decompress

-penalty

-picksplit

-equal

-distance (optional).

We need to implement it to create GiST-based CF-tree to use it in BIRCH
algorithm.


*Example of usage(approximate):*

create table cube_test (v cube);

insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]),

  cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) );

create index gist_cf on cube_test using gist(v);

--Prototype(approximate)

--birch(maxNodeEntries, distThreshold, distFunction)

SELECT birch(4.1, 0.2, 1) FROM cube_test;

 cluster | val1 | val2

-+--+

  1  |  1.2 |  0.4

  0  |  0.5 | -0.2

  1  |  0.6 |  1.0

  1  |  1.0 |  0.6

Accordingly, in this GSoC project BIRCH algorithm for data type cube would
be implemented.


*Inch-stones*

 1) Solve architecture questions with help of community.

 2) First, approximate implementation(implement distance methods, implement
GiST interface methods, implement BIRCH algorithm for data type cube).

3) Approximate implementation evaluation.

4) Final refactoring, documentation, testing.


* Project Schedule*

 until May 19

 Solve architecture questions with help of community.

 20 May - 27 June

 First, approximate implementation.

 28 June - 11 August

 Approximate implementation evaluation. Fixing bugs and performance testing.

 August 11 - August 18:

 Final refactoring, write tests, improve documentation.

* Completeness Criteria*

 Support of BIRCH algorithm for data type cube is implemented and working.

* Links*

1) http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
2) http://www.postgresql.org/docs/9.1/static/gist-implementation.html

 

With best regards, Ivan Parfilov.


Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-03-20 Thread Thom Brown
Hi all,

There is 1 day left to get submissions in, so students should ensure
that they submit their proposals as soon as possible.  No submissions
will be accepted beyond the deadline of 19:00 UTC tomorrow (Friday
21st March).

Regards

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSoC 2014

2014-03-12 Thread Ashutosh Dhundhara
Hello all,
I am Ashutosh Dhundhara from Thapat University, Patiala-India presently 
pursuing Bachelors degree in Computer Science and Engineering.
This year I wish to work for PostgreSQL under the flagship of GSoC 2014. So 
please help regarding this. I have a few questions :

1) Do I have to choose all ideas from the GSoC wiki page or any one of them ?
2) What is the deadline for fixing bugs which will account for selection 
procedure ?

Please guide me on how to proceed.
 
Regards, 
Ashutosh Dhundhara

Re: [HACKERS] GSoC 2014

2014-03-12 Thread Atri Sharma
On Wed, Mar 12, 2014 at 8:05 PM, Ashutosh Dhundhara 
ashutoshdhundh...@yahoo.com wrote:

 Hello all,
 I am Ashutosh Dhundhara from Thapat University, Patiala-India presently
 pursuing Bachelors degree in Computer Science and Engineering.
 This year I wish to work for PostgreSQL under the flagship of GSoC 2014.
 So please help regarding this. I have a few questions :

 1) Do I have to choose all ideas from the GSoC wiki page or any one of
 them ?
 2) What is the deadline for fixing bugs which will account for selection
 procedure ?

 Please guide me on how to proceed.



You can propose your own ideas as well. You can pick any number of ideas
from GSoC 2014 wiki page and send proposals for them.

The deadline for proposal is next friday, I believe.

Regards,

Atri


-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-03-10 Thread Thom Brown
All students and mentors (and backup mentors) should now register to
this year's GSoC.  Students only have until Friday next week (up until
21st March 19:00 UTC) to apply.

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-03-02 Thread Tan Tran
Hi Greg, pgsql-advocacy, and pgsql-hackers,

I'm interested in doing my GSoC project on this idea. I'm new to indexing and 
WAL, which I haven't encountered in my classes, but it sounds interesting and 
valuable to Postgresql. So here's my draft proposal. Do you mind giving your 
opinion and corrections? With your help I'll add some technical detail to my 
plans.

Thanks,
Tan Tran

Introduction
In write-ahead logging (WAL), all modifications to a database are 
written to a write-ahead log before being flushed to disk at periodic 
checkpoints. This method saves I/O operations, enables a continuous backup, 
and, in the case of database failure, guarantees data integrity up until the 
last saved checkpoint. In Postgresql’s implementation, transactions are written 
to XLog, which is divided into 16MB files (“segments”) that together comprise a 
complete history of transactions. Transactions are continually appended to the 
latest segment, while checkpointing continually archives segments up until the 
last checkpoint. Internally, a suite of XLog structures and functions 
interfaces with the various resource managers so they can log a sufficient 
amount of data to restore data (“redo”) in case of failure.
Another Postgresql feature is the creation of indexes on a invariant 
custom field; for example, on the LastName of a Person even though the primary 
key is ID. These custom indexes speed up row lookup. Postgres currently 
supports four index types: B-tree, GiST, and GIN, and hash. Indexes on the 
former three are WAL-recoverable, but hashing is not.

2. Proposal
As a GSoC student, I will implement WAL recovery of hash indexes using 
the other index types’ WAL code as a guide. Roughly, I will:
- Devise a way to store and retrieve hashing data within the XLog data 
structures. 
- In the existing skeleton for hash_redo(XLogRecPtr lsn, XLogRecord *record) in 
hash.c, branch to code for the various redo operations: creating an index, 
inserting into an index, deleting an index, and page operations (split, delete, 
update?).
- Code each branch by drawing on examples from btree_redo, gin_redo, and 
gist_redo, the existing XLog code of the other index types.

Benefits
Hash index searching is O(1), which is asymptotically faster than the O(n lg n) 
searching of a B-tree, and does not require custom indexing functions like GIN 
and GIST inherently do. Therefore it is desirable for rows that will only be 
retrieved on an equality or inequality relation. However, two things currently 
stand in the way of its popular use. From the Postgresql documentation,
“Hash index operations are not presently WAL-logged, so hash indexes 
might need to be rebuilt with REINDEX after a database crash if there were 
unwritten changes. Also, changes to hash indexes are not replicated over 
streaming or file-based replication after the initial base backup, so they give 
wrong answers to queries that subsequently use them. For these reasons, hash 
index use is presently discouraged.”
My project would solve the first problem, after which I would like to stay on 
and fix the second.

To be written: Quantifiable Results, Schedule, Completeness Criteria, Bio


On Feb 28, 2014, at 6:21 AM, Greg Stark st...@mit.edu wrote:

 On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote:
 Who would be up for mentoring this year?  And are there any project
 ideas folk would like to suggest?
 
 I mentored in the past and felt I didn't do a very good job because I
 didn't really understand the project the student was working on.
 
 There's precisely one project that I feel I would be competent to
 mentor at this point. Making hash indexes WAL recoverable. This is
 something that's easy to define the scope of and easy to determine if
 the student is on track and easy to measure when finished. It's
 something where as far as I can tell all the mentor work will be
 purely technical advice.
 
 Also it's something the project really really needs and is perfectly
 sized for a GSOC project IMHO. Also it's a great project for a student
 who might be interested in working on Postgres in the future since it
 requires learning all our idiosyncratic build and source conventions
 but doesn't require huge or controversial architectural changes.
 
 I fear a number of items in the Wiki seem unrealistically large
 projects for GSOC IMNSHO.
 
 -- 
 greg
 
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-02-28 Thread Greg Stark
On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote:
 Who would be up for mentoring this year?  And are there any project
 ideas folk would like to suggest?

I mentored in the past and felt I didn't do a very good job because I
didn't really understand the project the student was working on.

There's precisely one project that I feel I would be competent to
mentor at this point. Making hash indexes WAL recoverable. This is
something that's easy to define the scope of and easy to determine if
the student is on track and easy to measure when finished. It's
something where as far as I can tell all the mentor work will be
purely technical advice.

Also it's something the project really really needs and is perfectly
sized for a GSOC project IMHO. Also it's a great project for a student
who might be interested in working on Postgres in the future since it
requires learning all our idiosyncratic build and source conventions
but doesn't require huge or controversial architectural changes.

I fear a number of items in the Wiki seem unrealistically large
projects for GSOC IMNSHO.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-02-10 Thread Alexander Korotkov
Hi!

On Tue, Jan 28, 2014 at 9:34 PM, Thom Brown t...@linux.com wrote:

 And I'd be fine with being admin again this year, unless there's
 anyone else who would like to take up the mantle?


Thanks for your work. I would like to see you as admin this year again.

Who would be up for mentoring this year?  And are there any project
 ideas folk would like to suggest?


I would like to be mentor.

--
With best regards,
Alexander Korotkov.


Re: [HACKERS] GSoC 2014 - mentors, students and admins

2014-01-29 Thread Heikki Linnakangas

On 01/28/2014 07:34 PM, Thom Brown wrote:

And I'd be fine with being admin again this year, unless there's
anyone else who would like to take up the mantle?


Please do, thanks!


Who would be up for mentoring this year?


I can mentor.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GSoC 2014 - mentors, students and admins

2014-01-28 Thread Thom Brown
Hi all,

Application to Google Summer of Code 2014 can be made as of next
Monday (3rd Feb), and then there will be a 12 day window in which to
submit an application.

I'd like to gauge interest from both mentors and students as to
whether we'll want to do this.

And I'd be fine with being admin again this year, unless there's
anyone else who would like to take up the mantle?

Who would be up for mentoring this year?  And are there any project
ideas folk would like to suggest?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers