Re: [HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib
Hi Maxence, This is really really good. Hopefully we will have a fruitful summer, and make MADlib a better product. Thanks Hai -- *Pivotal http://www.gopivotal.com/* A new platform for a new era On Sat, Apr 5, 2014 at 7:14 AM, Maxence Ahlouche maxence.ahlou...@gmail.com wrote: Hi all, 2014-04-03 9:04 GMT+02:00 Hai Qian hq...@gopivotal.com: Hi Maxence, We are very glad that you are enthusiastic about MADlib.Your proposal looks very nice. The algorithms that you proposed will be very useful, and can serve as a good start. After learning about the C++ abstraction layer and testing suite, you should be able to quickly grasp how to program for MADlib. We can fill in the details of the plan later. The conflict in your time schedule shouldn't be a problem. It would be nice if you could provide a little bit more introduction to these algorithms and their importance in the proposal. Overall it is very good. And thank you again Hai As you requested, I've described the clustering algorithms on my github [0]. I'm copying it at the end of this email, just in case. In the case of OPTICS, I still need some time to grasp the details of how it works, so I've only included a basic description of this one. Regards, Maxence [0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst Details about the different clustering algorithms = This file describes the k-means algorithm, which is currently the only clustering algorithm implemented in MADlib, and the k-medoids and OPTICS algorithm, which I plan to implement this summer, as a GSoC project. I'll also explain the DBSCAN algorithm, as OPTICS is only an extension of this one. The goal of all these algorithms is to determine clusters in a set of points. While this problem is NP-hard, it can be solved efficiently thanks to these algorithms, even though the result is usually not perfect. I've tried not to add too many details, and taken some shortcuts, so there may be some inaccuracies. I wanted to keep this readable (not sure that goal was achieved, though), but if anyone wants more precisions, I'll gladly add them. k-means --- The k-means algorithm is the one implemented in MADlib. It selects *k* points, either randomly, among the dataset, or set by the user, to use as initial centroids (center of clusters). *k* must be defined by the user; the algorithm is unable to guess the number of clusters. All the points in the dataset are then affected to the nearest centroid, thus making *k* clusters. The next step is to compute the new centroids as a mean of all the points in a cluster. Then reassign all the points to then new centroids, and start all over again until there is no change in cluster assignation. This algorithm is usually pretty fast (even though it can be made to take an exponential time to converge). Still, it is easy to get a wrong result because of local convergence, e.g. having one cluster split in two parts by the algorithm, or two clusters merged. It is also pretty sensitive to outliers (points that don't obviously belong to any cluster), and the final result depend greatly on the initial centroids. This algorithm doesn't work well with non-euclidean distance, in general. Another think to note is that k-means will result in Voronoi cell shaped clusters, which is not always what we want. k-medoids - The k-medoids algorithm works the same way as k-means, save for a few exceptions. With k-medoids, the centroids are always points of the dataset (and are then called medoids). The new medoids are the points in clusters which minimize the sum of pairwise dissimilarities (in other terms, the point for which the average distance to other points in the cluster is minimal). This makes the algorithm less sensitive to outliers than k-means. This algorithm is computationnally more intensive than k-means, but still fast. As for k-means, it is possible to run the algorithm several times with different initial centroids, and get the best result (i.e. the one that minimizes the sum of distances from points to their centroids). DBSCAN --- DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*) is a clustering algorithm based on the density of clusters. This adresses several limitations of k-means and k-medoids: it does not assign a cluster to outliers, and allows the detection of weird-shaped clusters. Moreover, it doesn't need to be told the number of clusters. A point is called dense if enough other points are near enough from it, where enough other points and near enough are defined by the parameters *min_points* and *epsilon*. *min_points* therefore represents the minimum number of points required to make a cluster, and *epsilon* is the maximum distance a point can be from a cluster to be considered part of this cluster. For every unassigned point, count
[HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib
Hi all, 2014-04-03 9:04 GMT+02:00 Hai Qian hq...@gopivotal.com: Hi Maxence, We are very glad that you are enthusiastic about MADlib.Your proposal looks very nice. The algorithms that you proposed will be very useful, and can serve as a good start. After learning about the C++ abstraction layer and testing suite, you should be able to quickly grasp how to program for MADlib. We can fill in the details of the plan later. The conflict in your time schedule shouldn't be a problem. It would be nice if you could provide a little bit more introduction to these algorithms and their importance in the proposal. Overall it is very good. And thank you again Hai As you requested, I've described the clustering algorithms on my github [0]. I'm copying it at the end of this email, just in case. In the case of OPTICS, I still need some time to grasp the details of how it works, so I've only included a basic description of this one. Regards, Maxence [0] https://github.com/viodlen/gsoc_2014/blob/master/algorithm_details.rst Details about the different clustering algorithms = This file describes the k-means algorithm, which is currently the only clustering algorithm implemented in MADlib, and the k-medoids and OPTICS algorithm, which I plan to implement this summer, as a GSoC project. I'll also explain the DBSCAN algorithm, as OPTICS is only an extension of this one. The goal of all these algorithms is to determine clusters in a set of points. While this problem is NP-hard, it can be solved efficiently thanks to these algorithms, even though the result is usually not perfect. I've tried not to add too many details, and taken some shortcuts, so there may be some inaccuracies. I wanted to keep this readable (not sure that goal was achieved, though), but if anyone wants more precisions, I'll gladly add them. k-means --- The k-means algorithm is the one implemented in MADlib. It selects *k* points, either randomly, among the dataset, or set by the user, to use as initial centroids (center of clusters). *k* must be defined by the user; the algorithm is unable to guess the number of clusters. All the points in the dataset are then affected to the nearest centroid, thus making *k* clusters. The next step is to compute the new centroids as a mean of all the points in a cluster. Then reassign all the points to then new centroids, and start all over again until there is no change in cluster assignation. This algorithm is usually pretty fast (even though it can be made to take an exponential time to converge). Still, it is easy to get a wrong result because of local convergence, e.g. having one cluster split in two parts by the algorithm, or two clusters merged. It is also pretty sensitive to outliers (points that don't obviously belong to any cluster), and the final result depend greatly on the initial centroids. This algorithm doesn't work well with non-euclidean distance, in general. Another think to note is that k-means will result in Voronoi cell shaped clusters, which is not always what we want. k-medoids - The k-medoids algorithm works the same way as k-means, save for a few exceptions. With k-medoids, the centroids are always points of the dataset (and are then called medoids). The new medoids are the points in clusters which minimize the sum of pairwise dissimilarities (in other terms, the point for which the average distance to other points in the cluster is minimal). This makes the algorithm less sensitive to outliers than k-means. This algorithm is computationnally more intensive than k-means, but still fast. As for k-means, it is possible to run the algorithm several times with different initial centroids, and get the best result (i.e. the one that minimizes the sum of distances from points to their centroids). DBSCAN --- DBSCAN (*Density-Based Spatial Clustering of Applications with Noise*) is a clustering algorithm based on the density of clusters. This adresses several limitations of k-means and k-medoids: it does not assign a cluster to outliers, and allows the detection of weird-shaped clusters. Moreover, it doesn't need to be told the number of clusters. A point is called dense if enough other points are near enough from it, where enough other points and near enough are defined by the parameters *min_points* and *epsilon*. *min_points* therefore represents the minimum number of points required to make a cluster, and *epsilon* is the maximum distance a point can be from a cluster to be considered part of this cluster. For every unassigned point, count his *epsilon*-neighbours (not sure this term exists, but I'll use it for convenience). If there are too few, consider this point an outlier; else, create a new cluster and put the current point in it, along with its neighbours, the neighbours of its neighbours, and so on. The main problem with DBSCAN is that it doesn't work well for clusters with very different
Re: [HACKERS] GSoC 2014 proposal
On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. Using GiST we can implement BIRCH-like clustering like so: 1) Build a CF tree as GiST index without restriction of T threshold value. 2) Scan CF tree with threshold T with some auxiliary operator. If consistent method see CF entry which diameter is greater than T then it returns true. Otherwise it returns false and put this CF entry into output area (could be either in-memory or temporary table). 3) Process other steps of algorithm as usual. This modification would have following advantages: 1) User can build GiST index once and then try clustering with different parameters. Initial GiST index build would be slowest operation while other steps is expected to be fast. 2) Use GiST infrastructure and automatically get buffering build. The drawback is that building GiST index is more expensive than building in-memory CF tree with given threshold T (assuming T is well chosen). Does it make any sense? -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 proposal
On 04/03/2014 04:15 PM, Alexander Korotkov wrote: On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.comwrote: On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. Using GiST we can implement BIRCH-like clustering like so: 1) Build a CF tree as GiST index without restriction of T threshold value. 2) Scan CF tree with threshold T with some auxiliary operator. If consistent method see CF entry which diameter is greater than T then it returns true. Otherwise it returns false and put this CF entry into output area (could be either in-memory or temporary table). 3) Process other steps of algorithm as usual. I still don't understand how that would work. You can't trust the non-leaf entries, because their CF value can contain deleted entries. So you have to scan every tuple anyway. Once you do that, what's the point of the index? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2014 proposal
On Thu, Apr 3, 2014 at 11:21 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04/03/2014 04:15 PM, Alexander Korotkov wrote: On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov aekorot...@gmail.com wrote: On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. Using GiST we can implement BIRCH-like clustering like so: 1) Build a CF tree as GiST index without restriction of T threshold value. 2) Scan CF tree with threshold T with some auxiliary operator. If consistent method see CF entry which diameter is greater than T then it returns true. Otherwise it returns false and put this CF entry into output area (could be either in-memory or temporary table). 3) Process other steps of algorithm as usual. I still don't understand how that would work. You can't trust the non-leaf entries, because their CF value can contain deleted entries. So you have to scan every tuple anyway. Once you do that, what's the point of the index? Yeah, it is limitation of this idea. It's not going to be auto-updatable CF. User can build index on freshly vacuumed table and play with clustering some time. Updates on table during that time would be either allowed clustering error or prohibited. Another potential solution is to let this index to hold some snapshot of data. But it seems not possible to do in extension now. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 proposal
On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas hlinnakan...@vmware.comwrote: The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. Hmmm, it's likely I've imagined something quite outside of this paper, and even already suggested it to Ivan... :) I need a little time to rethink it. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 proposal
On 03/30/2014 11:50 PM, Иван Парфилов wrote: The implementation of this algorithm would be for data type cube and based on GiST. The key concept of BIRCH algorithm is clustering feature. Given a set of N d-dimensional data points, the clustering feature CF of the set is defined as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the square sum of data points. Clustering features are organized in a CF tree, which is a height balanced tree with two parameters: branching factor B and threshold T. Because the structure of CF tree is similar to B+-tree we can use GiST for implementation [2]. The GiST is a balanced tree structure like a B-tree, containing key, pointer pairs. GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries. The BIRCH algorithm as described in the paper describes building a tree in memory. If I understood correctly, you're suggesting to use a pre-built GiST index instead. Interesting idea! There are a couple of signifcant differences between the CF tree described in the paper and GiST: 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, a leaf item represents a cluster, which consists of one or more tuples. So the CF tree doesn't store an entry for every input tuple, which makes it possible to keep it in memory. 2. In the CF tree, all entries in a leaf node must satisfy a threshold requirement, with respect to a threshold value T: the diameter (or radius) has to be less than T. GiST imposes no such restrictions. An item can legally be placed anywhere in the tree; placing it badly will just lead to degraded search performance, but it's still a legal GiST tree. 3. A GiST index, like any other index in PostgreSQL, holds entries also for deleted tuples, until the index is vacuumed. So you cannot just use information from a non-leaf node and use it in the result, as the information summarized at a non-leaf level includes noise from the dead tuples. Can you elaborate how you are planning to use a GiST index to implement BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more strict in where in the tree an item can be stored, and lets the operator class to specify exactly when a node is split etc. We need to implement it to create GiST-based CF-tree to use it in BIRCH algorithm. *Example of usage(approximate):* create table cube_test (v cube); + insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]), cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) ); create index gist_cf on cube_test using gist(v); --Prototype(approximate) --birch(maxNodeEntries, distThreshold, distFunction) SELECT birch(4.1, 0.2, 1) FROM cube_test; cluster | val1 | val2 -+--+ 1 | 1.2 | 0.4 0 | 0.5 | -0.2 1 | 0.6 | 1.0 1 | 1.0 | 0.6 Accordingly, in this GSoC project BIRCH algorithm for data type cube would be implemented. From the example, it seems that birch(...) would be an aggregate function. Aggregates in PostgreSQL currently work by scanning all the input data. That would certainly be a pretty straightforward way to implement BIRCH too. Every input tuple would be passed to the the so-called transition function (which you would write), which would construct a CF tree on-the-fly. At the end, the result would be constructed from the CF tree. With this approach, the CF tree would be kept in memory, and thrown away after the query. That would be straightforward, but wouldn't involve GiST at all. To use an index to implement an aggregate would require planner/executor changes. That would be interesting, but offhand I have no idea what that would look like. We'll need more details on that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2014 proposal
On 03/30/2014 11:50 PM, Иван Парфилов wrote: * Quantifiable results* Adding support of BIRCH algorithm for data type cube Aside from the details of *how* that would work, the other question is: Do we want this in contrib/cube? There are currently no clustering functions, or any other statistical functions or similar, in contrib/cube. Just basic contains/contained/overlaps operators. And B-tree comparison operators which are pretty useless for cube. Do we want to start adding such features to cube, in contrib? Or should that live outside the PostgreSQL source tree, in an separate extension, so that it could live on its own release schedule, etc. If BIRCH goes into contrib/cube, that's an invitation to add all kinds of functions to it. We received another GSoC application to add another clustering algorithm to the MADlib project. MADlib is an extension to PostgreSQL with a lot of different statistical tools, so MADlib would be a natural home for BIRCH too. But if it requires backend changes (ie. changes to GiST), then that needs to be discussed on pgsql-hackers, and it probably would be better to do a reference implementation in contrib/cube. MADlib could later copy it from there. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GSoC 2014 proposal
Hello, hackers! This is my GSoC proposal. *Short description:* Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The purpose of this project is to add support of BIRCH(balanced iterative reducing and clustering using hierarchies) algorithm [1] for data type cube. *Benefits to the PostgreSQL Community* Support of BIRCH algorithm for data type cube would be actual for many PostgreSQL applications (for example, to solve data clustering problem for high dimensional datasets and for large datasets). * Quantifiable results* Adding support of BIRCH algorithm for data type cube *Project Details* BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. The BIRCH algorithm (Balanced Iterative Reducing and Clustering Hierarchies) of Zhang [1] was developed to handle massive datasets that are too large to be contained in the main memory (RAM). To minimize I/O costs, every datum is read once and only once. BIRCH transforms the data set into compact, locally similar subclusters, each with summary statistics attached (called clustering features). Then, instead of using the full data set, these summary statistics can be used. This approach is most advantageous in two situations: when the data cannot be loaded into memory due to its size; and/or when some form of combinatorial optimization is required and the size of the solution space makes finding global maximums/minimums difficult. Key properties of BIRCH algorithm: single scan of the dataset enough; I/O cost minimization: Organize data in a in-memory, height-balanced tree; Each clustering decision is made without scanning all the points or clusters. The implementation of this algorithm would be for data type cube and based on GiST. The key concept of BIRCH algorithm is clustering feature. Given a set of N d-dimensional data points, the clustering feature CF of the set is defined as the triple CF = (N,LS,SS), where LS is the linear sum and SS is the square sum of data points. Clustering features are organized in a CF tree, which is a height balanced tree with two parameters: branching factor B and threshold T. Because the structure of CF tree is similar to B+-tree we can use GiST for implementation [2]. The GiST is a balanced tree structure like a B-tree, containing key, pointer pairs. GiST key is a member of a user-defined class, and represents some property that is true of all data items reachable from the pointer associated with the key. The GiST provides a possibility to create custom data types with indexed access methods and extensible set of queries. There are seven methods that an index operator class for GiST must provide, and an eighth that is optional: -consistent -union -compress -decompress -penalty -picksplit -equal -distance (optional). We need to implement it to create GiST-based CF-tree to use it in BIRCH algorithm. *Example of usage(approximate):* create table cube_test (v cube); insert into cube_test values (cube(array[1.2, 0.4]), cube(array[0.5, -0.2]), cube(array[0.6, 1.0]),cube(array[1.0, 0.6]) ); create index gist_cf on cube_test using gist(v); --Prototype(approximate) --birch(maxNodeEntries, distThreshold, distFunction) SELECT birch(4.1, 0.2, 1) FROM cube_test; cluster | val1 | val2 -+--+ 1 | 1.2 | 0.4 0 | 0.5 | -0.2 1 | 0.6 | 1.0 1 | 1.0 | 0.6 Accordingly, in this GSoC project BIRCH algorithm for data type cube would be implemented. *Inch-stones* 1) Solve architecture questions with help of community. 2) First, approximate implementation(implement distance methods, implement GiST interface methods, implement BIRCH algorithm for data type cube). 3) Approximate implementation evaluation. 4) Final refactoring, documentation, testing. * Project Schedule* until May 19 Solve architecture questions with help of community. 20 May - 27 June First, approximate implementation. 28 June - 11 August Approximate implementation evaluation. Fixing bugs and performance testing. August 11 - August 18: Final refactoring, write tests, improve documentation. * Completeness Criteria* Support of BIRCH algorithm for data type cube is implemented and working. * Links* 1) http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf 2) http://www.postgresql.org/docs/9.1/static/gist-implementation.html With best regards, Ivan Parfilov.
Re: [HACKERS] GSoC 2014 - mentors, students and admins
Hi all, There is 1 day left to get submissions in, so students should ensure that they submit their proposals as soon as possible. No submissions will be accepted beyond the deadline of 19:00 UTC tomorrow (Friday 21st March). Regards Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GSoC 2014
Hello all, I am Ashutosh Dhundhara from Thapat University, Patiala-India presently pursuing Bachelors degree in Computer Science and Engineering. This year I wish to work for PostgreSQL under the flagship of GSoC 2014. So please help regarding this. I have a few questions : 1) Do I have to choose all ideas from the GSoC wiki page or any one of them ? 2) What is the deadline for fixing bugs which will account for selection procedure ? Please guide me on how to proceed. Regards, Ashutosh Dhundhara
Re: [HACKERS] GSoC 2014
On Wed, Mar 12, 2014 at 8:05 PM, Ashutosh Dhundhara ashutoshdhundh...@yahoo.com wrote: Hello all, I am Ashutosh Dhundhara from Thapat University, Patiala-India presently pursuing Bachelors degree in Computer Science and Engineering. This year I wish to work for PostgreSQL under the flagship of GSoC 2014. So please help regarding this. I have a few questions : 1) Do I have to choose all ideas from the GSoC wiki page or any one of them ? 2) What is the deadline for fixing bugs which will account for selection procedure ? Please guide me on how to proceed. You can propose your own ideas as well. You can pick any number of ideas from GSoC 2014 wiki page and send proposals for them. The deadline for proposal is next friday, I believe. Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] GSoC 2014 - mentors, students and admins
All students and mentors (and backup mentors) should now register to this year's GSoC. Students only have until Friday next week (up until 21st March 19:00 UTC) to apply. Thanks Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2014 - mentors, students and admins
Hi Greg, pgsql-advocacy, and pgsql-hackers, I'm interested in doing my GSoC project on this idea. I'm new to indexing and WAL, which I haven't encountered in my classes, but it sounds interesting and valuable to Postgresql. So here's my draft proposal. Do you mind giving your opinion and corrections? With your help I'll add some technical detail to my plans. Thanks, Tan Tran Introduction In write-ahead logging (WAL), all modifications to a database are written to a write-ahead log before being flushed to disk at periodic checkpoints. This method saves I/O operations, enables a continuous backup, and, in the case of database failure, guarantees data integrity up until the last saved checkpoint. In Postgresql’s implementation, transactions are written to XLog, which is divided into 16MB files (“segments”) that together comprise a complete history of transactions. Transactions are continually appended to the latest segment, while checkpointing continually archives segments up until the last checkpoint. Internally, a suite of XLog structures and functions interfaces with the various resource managers so they can log a sufficient amount of data to restore data (“redo”) in case of failure. Another Postgresql feature is the creation of indexes on a invariant custom field; for example, on the LastName of a Person even though the primary key is ID. These custom indexes speed up row lookup. Postgres currently supports four index types: B-tree, GiST, and GIN, and hash. Indexes on the former three are WAL-recoverable, but hashing is not. 2. Proposal As a GSoC student, I will implement WAL recovery of hash indexes using the other index types’ WAL code as a guide. Roughly, I will: - Devise a way to store and retrieve hashing data within the XLog data structures. - In the existing skeleton for hash_redo(XLogRecPtr lsn, XLogRecord *record) in hash.c, branch to code for the various redo operations: creating an index, inserting into an index, deleting an index, and page operations (split, delete, update?). - Code each branch by drawing on examples from btree_redo, gin_redo, and gist_redo, the existing XLog code of the other index types. Benefits Hash index searching is O(1), which is asymptotically faster than the O(n lg n) searching of a B-tree, and does not require custom indexing functions like GIN and GIST inherently do. Therefore it is desirable for rows that will only be retrieved on an equality or inequality relation. However, two things currently stand in the way of its popular use. From the Postgresql documentation, “Hash index operations are not presently WAL-logged, so hash indexes might need to be rebuilt with REINDEX after a database crash if there were unwritten changes. Also, changes to hash indexes are not replicated over streaming or file-based replication after the initial base backup, so they give wrong answers to queries that subsequently use them. For these reasons, hash index use is presently discouraged.” My project would solve the first problem, after which I would like to stay on and fix the second. To be written: Quantifiable Results, Schedule, Completeness Criteria, Bio On Feb 28, 2014, at 6:21 AM, Greg Stark st...@mit.edu wrote: On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote: Who would be up for mentoring this year? And are there any project ideas folk would like to suggest? I mentored in the past and felt I didn't do a very good job because I didn't really understand the project the student was working on. There's precisely one project that I feel I would be competent to mentor at this point. Making hash indexes WAL recoverable. This is something that's easy to define the scope of and easy to determine if the student is on track and easy to measure when finished. It's something where as far as I can tell all the mentor work will be purely technical advice. Also it's something the project really really needs and is perfectly sized for a GSOC project IMHO. Also it's a great project for a student who might be interested in working on Postgres in the future since it requires learning all our idiosyncratic build and source conventions but doesn't require huge or controversial architectural changes. I fear a number of items in the Wiki seem unrealistically large projects for GSOC IMNSHO. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2014 - mentors, students and admins
On Tue, Jan 28, 2014 at 5:34 PM, Thom Brown t...@linux.com wrote: Who would be up for mentoring this year? And are there any project ideas folk would like to suggest? I mentored in the past and felt I didn't do a very good job because I didn't really understand the project the student was working on. There's precisely one project that I feel I would be competent to mentor at this point. Making hash indexes WAL recoverable. This is something that's easy to define the scope of and easy to determine if the student is on track and easy to measure when finished. It's something where as far as I can tell all the mentor work will be purely technical advice. Also it's something the project really really needs and is perfectly sized for a GSOC project IMHO. Also it's a great project for a student who might be interested in working on Postgres in the future since it requires learning all our idiosyncratic build and source conventions but doesn't require huge or controversial architectural changes. I fear a number of items in the Wiki seem unrealistically large projects for GSOC IMNSHO. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2014 - mentors, students and admins
Hi! On Tue, Jan 28, 2014 at 9:34 PM, Thom Brown t...@linux.com wrote: And I'd be fine with being admin again this year, unless there's anyone else who would like to take up the mantle? Thanks for your work. I would like to see you as admin this year again. Who would be up for mentoring this year? And are there any project ideas folk would like to suggest? I would like to be mentor. -- With best regards, Alexander Korotkov.
Re: [HACKERS] GSoC 2014 - mentors, students and admins
On 01/28/2014 07:34 PM, Thom Brown wrote: And I'd be fine with being admin again this year, unless there's anyone else who would like to take up the mantle? Please do, thanks! Who would be up for mentoring this year? I can mentor. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] GSoC 2014 - mentors, students and admins
Hi all, Application to Google Summer of Code 2014 can be made as of next Monday (3rd Feb), and then there will be a 12 day window in which to submit an application. I'd like to gauge interest from both mentors and students as to whether we'll want to do this. And I'd be fine with being admin again this year, unless there's anyone else who would like to take up the mantle? Who would be up for mentoring this year? And are there any project ideas folk would like to suggest? Thanks Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers