[HACKERS] [GSoC] kmedoids status report

2014-08-07 Thread Maxence Ahlouche
Hi!

Here is a report of what has been discussed yesterday on IRC.

The kmedoids module now seems to work correctly on basic datasets. I've
also implemented its variants with different seeding methods: random
initial medoids, and initial medoids distributed among the points (similar
to kmeans++ [0]).

The next steps are:

   - Making better tests (1-2d)
   - Writing the documentation (1d)
   - Adapting my code to GP and HAWQ -- btw, are default parameters now
   available in GP and HAWQ? (1-2d)
   - Refactoring kmedoids and kmeans, as there is code duplication between
   those two.
   For this step, I don't know if I'll have time to create a clustering
   module, and make kmeans and kmedoids submodules of it. If yes, then it's
   perfect; otherwise, I'll just rename the common functions in kmeans, and
   have kmedoids call them from there.

Hai also helped me setup (once more) the VM where GreenPlum and HAWQ are
installed, so that I can test my code on these DBMS.

As a reminder, I'm supposed to stop coding next Monday, and then the last
week is dedicated to documentation, tests, refactoring and polishing.

Regards,

Maxence

[0] https://en.wikipedia.org/wiki/K-means%2B%2B


Re: [HACKERS] [GSoC] Clustering in MADlib - status update

2014-06-22 Thread Maxence Ahlouche
Hi!

Here's my report for week 5.

Week 5 - 2014/06/22

This week has been full of debugging of the main SQL function. The previous
week, I had been able to come up with a working function to compute a
medoid for a given group of points, but since then I've struggled to
integrate it with the rest of the SQL. Some errors were trivial (for
example some parameters that I had written with underscores instead of
using camelCase - Hai spotted this one, I think i'd never have found it by
myself), others less so. But it's coming!

According to the timeline I had planned at the beginning on the project,
I'm definitely late. The module I'm still writing should have been finished
last week, and it's not even working yet. It seems I've been far too
optimist in this timeline. For the second step, as I'll have less time than
expected, I'm thinking to switch from OPTICS to DBSCAN, which at least I
have fully understood (OPTICS is quite complicated). Is everyone ok with
this?

Next week is the evaluation week. Hopefully I'll be allowed to continue
working on this project, even though I haven't provided much result until
now :p As for me, I don't have to complain: I've always been provided
patience and clear answers to my questions. Only the phone calls didn't
turn as good as they sounded, but this problem will be fixed at our next
meeting, as we'll now use IRC!


Re: [HACKERS] [GSoC] Clustering in MADlib - status update

2014-06-15 Thread Maxence Ahlouche
Hi! Here is my report for the last two weeks.Weeks 3 and 4 - 2014/06/15

During my third week, I haven't had time to work on GSoC a lot, because of
my exams and my relocation (that's why I didn't deem necessary to post a
report last Sunday). But last week has been much more productive, as I am
now working full time!

I have developped an aggregate that computes the sum of pairwise
dissimilarities in a cluster, for a given medoid. Thanks to Hai and Atri, I
have also developped the main SQL function that actually computes the
k-medoids. This function is still under debugging, so I have not committed
it yet.

According to my planning, I am not on time: I should have finished working
on k-medoids on Friday. When I made this timeline, I largely underestimated
the time needed to get started in this project, and overestimated the time
I thought I could spend on GSoC during my exams. But things will now go
much faster!

As for our weekly phone call, I have lots of difficulties understanding
what is said, partly because of me not being used to hearing english, but
mostly because of low quality sound. Last time, I hardly understood half of
what's been said; which is quite unfortunate, given that I'm supposed to
take advices during this phone call. So I'd like to suggest an alternative:
an IRC channel, for example. And for those who don't have an IRC client
ready: http://webchat.freenode.net/ . For example, the channel #gsoc-madlib
would surely be appropriate :) Also, I've had a change in my timetable,
which makes Tuesday inconvenient for this phone call. Is it possible to
change the day? I'm available at this hour on Monday, Wednesday and
Thursday. Of course, if this change annoys too much people, I'll deal with
Tuesday :)

Finally, for the coming week, I'll finish debugging k-medoids, write all
the secondary functions (e.g. random inital medoids), and write the doc.


Regards,

Maxence A.

-- 
Maxence Ahlouche
06 06 66 97 00


Re: [HACKERS] [GSoC] Clustering in MADlib - status update

2014-06-02 Thread Maxence Ahlouche
Hi!

2014-06-02 19:16 GMT+02:00 Hai Qian hq...@gopivotal.com:

 I like the second option for refactoring the code. I think it is doable.

 And where is your code on Github?


It's not on Github, but on my own Gitlab (a self-hosted open-source
alternative to github). You can find it here [0]. I'm using two repos: one
is a clone of madlib, the other contains my reports, my test script and
other stuff.

[0] http://git.viod.eu/public/

-- 
Maxence Ahlouche
06 06 66 97 00


Re: [HACKERS] [GSoC] Clustering in MADlib - status update

2014-06-01 Thread Maxence Ahlouche
Hi all!

I've pushed my report for this week on my repo [0]. Here is a copy!
Attached is the patch containing my work for this week.
Week 2 - 2014/01/01

This week, I have worked on the beginning of the kmedoids module.
Unfortunately, I was supposed to have something working for last Wednesday,
and it is still not ready, mostly because I've lost time this week by being
sick, and by packing all my stuff in preparation for relocation.

The good news now: this week is my last school (exam) week, and that means
full-time GSoC starting next Monday! Also, I've studied the kmeans module
quite thoroughly, and I can finally understand how it all goes on, at the
exception of one bit: the enormous SQL request used to update the
IterationController.

For kmedoids, I've abandoned the idea of making the loop by myself and have
decided instead to stick to copying kmeans as much as possible, as it seems
easier than doing it all by myself. The only part that remains to be
adapted is that big SQL query I haven't totally understood yet. I've asked
the help of Atri, but surely the help of an experienced MADlib hacker would
speed things up :) Atri and I would also like to deal with this through a
voip meeting, to ease communication. If anyone wants to join, you're
welcome!

As for the technology we'll use, I have a Mumble server running somewhere,
if that fits to everyone. Otherwise, suggest something!

I am available from Monday 2 at 3 p.m. (UTC) to Wednesday 4 at 10 a.m.
(exam weeks are quite light).

This week, I have also faced the first design decisions I have to make. For
kmedoids, the centroids are points of the dataset. So, if I wanted to
identify them precisely, I'd need to use their ids, but that would mean
having a prototype different than the kmeans one. So, for now, I've decided
to use the points coordinates only, hoping I will not run into trouble. If
I ever do, switching to ids should'nt be too hard. Also, if the user wants
to input initial medoids, he can input whatever points he wants, be they
part of the dataset or not. After the first iteration, the centroids will
anyway be points of the dataset (maybe I could just select the points
nearest to the coordinates they input as initial centroids).

Second, I'll need to refactor the code in kmeans and kmedoids, as these two
modules are very similar. There are several options for this:

   1. One big clustering module containing everything clustering-related
   (ugly but easy option);
   2. A clustering module and kmeans, kmedoids, optics, utils
   submodules (the best imo, but I'm not sure it's doable);
   3. A clustering_utils module at the same level as the others (less
   ugly than the first one, but easy too).

Any opinions?

Next week, I'll get a working kmedoids module, do some refactoring, and
then add the extra methods, similar to what's done in kmeans, for the
different seedings. Once that's done, I'll make it compatible with all
three ports (I'm currently producing Postgres-only code, as it's the
easiest for me to test), and write the tests and doc. The deadline for this
last step is in two weeks; I don't know yet if I'll be on time by then or
not. It will depend on how fast I can get kmedoids working, and how fast
I'll go once I'm full time GSoC.

Finally, don't hesitate to tell me if you think my decisions are wrong, I'm
glad to learn :)
[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst

-- 
Maxence Ahlouche
06 06 66 97 00
diff --git a/src/config/Modules.yml b/src/config/Modules.yml
index bf48d82..8f3431f 100644
--- a/src/config/Modules.yml
+++ b/src/config/Modules.yml
@@ -20,6 +20,7 @@ modules:
   depends: ['svec_util']
 - name: kmeans
   depends: ['array_ops', 'svec_util', 'sample']
+- name: kmedoids
 - name: lda
   depends: ['array_ops']
 - name: linalg
diff --git a/src/ports/postgres/modules/kmedoids/__init__.py_in b/src/ports/postgres/modules/kmedoids/__init__.py_in
new file mode 100644
index 000..e69de29
diff --git a/src/ports/postgres/modules/kmedoids/kmedoids.py_in b/src/ports/postgres/modules/kmedoids/kmedoids.py_in
new file mode 100644
index 000..e6e6167
--- /dev/null
+++ b/src/ports/postgres/modules/kmedoids/kmedoids.py_in
@@ -0,0 +1,38 @@
+import plpy
+
+from utilities.validate_args import table_exists
+from utilities.validate_args import table_is_empty
+
+# --
+
+
+# TODO:refactor (directly copied from kmeans module)
+def kmedoids_validate_src(schema_madlib, rel_source, **kwargs):
+if rel_source is None or rel_source.strip().lower() in ('null', ''):
+plpy.error(kmeans error: Invalid data table name!)
+if not table_exists(rel_source):
+plpy.error(kmeans error: Data table does not exist!)
+if table_is_empty(rel_source):
+plpy.error(kmeans error: Data table is empty!)
+
+# --
+
+
+def compute_kmedoids(schema_madlib, rel_args

[HACKERS] [GSoC] Clustering in MADlib - status update

2014-05-25 Thread Maxence Ahlouche
Hi,

Here is my first report. You can also find it on my Gitlab [0].
Week 1 - 2014/05/25

For this first week, I have written a test script that generates some
simple datasets, and produces an image containing the output of the MADlib
clustering algorithms.

This script can be called like this:

./clustering_test.py new ds0 -n 8 # generates a dataset called ds0
with 8 clusters
./clustering_test.py query ds0 -o output.png # outputs the result of
the clustering algorithms applied to ds0 in output.png

See ./clustering_test.py -h for all the available options.

An example of output can be found here
[1].http://git.viod.eu/viod/gsoc_2014/blob/master/clustering_test/example_dataset.png

Of course, I will keep improving this test script, as it is still far from
perfect; but for now, it does approximately what I want.

For next week, I'll start working on the implementation of k-medoids in
MADlib. As a reminder, according to the timeline I suggested for the
project, this step must be done on May 30. Depending on the problems I will
face (mostly lack of knowledge of the codebase, I guess), this might not be
finished on time, but it should be done a few days later (by the end of
next week, hopefully).

Attached is the patch containing everything I have done this week, though
the git log might be more convenient to read.

Regards,

Maxence A.

[0] http://git.viod.eu/viod/gsoc_2014/blob/master/reports.rst
[1]
http://git.viod.eu/viod/gsoc_2014/blob/master/clustering_test/example_dataset.png

-- 
Maxence Ahlouche
06 06 66 97 00
diff --git a/.gitignore b/.gitignore
new file mode 100644
index 000..97de20e
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,2 @@
+***/__pycache__/
+**.png
\ No newline at end of file
diff --git a/autogen_results.py b/autogen_results.py
deleted file mode 100755
index 033c309..000
--- a/autogen_results.py
+++ /dev/null
@@ -1,6 +0,0 @@
-#!/usr/bin/python
-
-import os
-
-while(True):
-os.system(./k-means_test.py --regen -o results/$(date | md5sum | cut -d ' ' -f 1).png)
diff --git a/clustering_test.py b/clustering_test.py
deleted file mode 100755
index 2afc0d1..000
--- a/clustering_test.py
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/usr/bin/env python3
-
-import argparse
-import psycopg2 as pg
-import sys
-
-
-class DatabaseConnection():
-db_name = 'madlib'
-user = 'madlib'
-host = 'localhost'
-port = 5432
-table_name = 'tmp_points'
-field_name = 'coords'
-
-def __init__(self):
-self.conn = pg.connect(database=self.db_name, user=self.user, host=self.host, port=5432)
-self.cur = self.conn.cursor()
-self.cur.execute('DROP TABLE IF EXISTS %s CASCADE;' % self.table_name)
-self.cur.execute('CREATE TABLE %s (id SERIAL PRIMARY KEY, coords INT[]);' % self.table_name)
-self.conn.commit()
-
-def __del__(self):
-self.cur.close()
-self.conn.close()
-
-
-def main(args):
-parser = argparse.ArgumentParser(description='Visualize output of the clustering algorithms provided by MADlib, in PostgreSQL.')
-parser.add_argument('-n', metavar='number of clusters', type=int)
-
-dc = DatabaseConnection()
-
-if __name__ == '__main__':
-main(sys.argv[1:])
diff --git a/clustering_test/autogen_results.py b/clustering_test/autogen_results.py
new file mode 100755
index 000..033c309
--- /dev/null
+++ b/clustering_test/autogen_results.py
@@ -0,0 +1,6 @@
+#!/usr/bin/python
+
+import os
+
+while(True):
+os.system(./k-means_test.py --regen -o results/$(date | md5sum | cut -d ' ' -f 1).png)
diff --git a/clustering_test/clustering_test.py b/clustering_test/clustering_test.py
new file mode 100755
index 000..248b5cf
--- /dev/null
+++ b/clustering_test/clustering_test.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python3
+
+import argparse
+
+import database as db
+import dataset_generator as ds
+import visualiser as vs
+
+
+if __name__ == '__main__':
+parser = argparse.ArgumentParser(
+description='Visualize output of the clustering algorithms provided by '
+'MADlib, in PostgreSQL. You should start by adding a dataset. You need '
+'a PostgreSQL running.')
+subparsers = parser.add_subparsers(help='subparsers help', dest='action')
+
+parser_dataset = subparsers.add_parser('new', help='generate a dataset')
+parser_dataset.add_argument(
+'dataset_name',
+help='the name of the dataset to create',
+)
+parser_dataset.add_argument(
+'-n',
+'--nb_clusters',
+type=int,
+help='the number of clusters composing the new dataset. Defaults to a '
+'random value between 2 and 10.',
+)
+parser_dataset.add_argument(
+'-d',
+'--distribution',
+choices = ds.gen_cluster.keys(),
+help='the distribution of the points in the clusters. Defaults to '
+'uniform.',
+)
+
+parser_query = subparsers.add_parser('query', help='apply clustering algorithms on a dataset')
+parser_query.add_argument

[HACKERS] GSoC 2014: Implementing clustering algorithms in MADlib

2014-04-05 Thread Maxence Ahlouche
 densities.

OPTICS
--

OPTICS (*Ordering Points to Identify the Clustering Structure*)
addresses this issue.

The first step in OPTICS is to order the points so that two points
that are near each other will be in nearby positions, and store the
smallest reachability distance of each point (i.e. the distance at
which the point would be considered dense by DBSCAN).

The next step is to determine the clusters from the data obtained at
the previous step. This can be done visually: by plotting the ordered
points on *x* and their smallest reachability distance on *y*, the
clusters are the valleys formed by the curve.

I'll add more details to this section when I understand better how all
of this works.

Bibliography

http://en.wikipedia.org : where clustering algorithms are well
detailed

http://www.stat.cmu.edu/~ryantibs/datamining/lectures/04-clus1-marked.pdf
: introduction to k-means and k-medoids

http://www.vitavonni.de/blog/201211/2012110201-dbscan-and-optics-clustering.html
: short introduction to DBSCAN and OPTICS

http://www.cs.uiuc.edu/class/fa05/cs591han/papers/ankerst99.pdf :
original paper presenting OPTICS



-- 
Maxence Ahlouche
06 06 66 97 00


Re: [HACKERS] GSoC application: MADlib k-medoids clustering

2014-03-25 Thread Maxence Ahlouche
Hi,

2014-03-25 1:08 GMT+01:00 Robert Haas robertmh...@gmail.com:



 As far as I know, this mailing list isn't the right place to discuss
 anything related to MADlib.


As MADlib projects were mentionned on the GSoC thread (see here:
http://postgresql.1045698.n5.nabble.com/GSoC-2014-mentors-students-and-admins-td5789287.html),
I thought it was the  correct place to submit my proposal. But indeed, I
probably should have posted on the MADlib mailing list too.

By the way, has anyone any comment on this proposal?

Regards,
Maxence Ahlouche

-- 
Maxence Ahlouche
06 06 66 97 00


Re: [HACKERS] GSoC application: MADlib k-medoids clustering

2014-03-20 Thread Maxence Ahlouche
Hi,

My proposal is now available on Google melange website:
http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/viod/5668600916475904
There seems to be a formatting issue: half of the text is a link to the
page I mentionned during my registration on my website. I don't know how to
fix it though.

Regards,
Maxence

-- 
Maxence Ahlouche
06 06 66 97 00


[HACKERS] GSoC application: MADlib k-medoids clustering

2014-03-18 Thread Maxence Ahlouche
Hi all,

Some of you may remember me from last year: I had applied to implement the
k-medoids clustering method in MADlib, a Postgres and GreenPlum library.
As this project could not be selected last year, I'm trying again now!

You can find my application at this address:
https://github.com/viodlen/gsoc_2014
I've also pasted it at the end of this email.

As for me: I'm Maxence Ahlouche, a French student in computer science. I've
been studying IT for almost 4 years now, the first two of them being in a
technical degree. I'm currently in an engineering school, and am supposed
to graduate next year.


However, there's a problem with my application: the GSoC is supposed to
begin on may 19, but I have lectures and exams until at least June 4, and
the official end of my school year is on June 13. The period between June 4
and June 14 is intended for students to finish their school projects, if
they haven't already.
In the worst case, I won't have time to do anything significant for the
first 4 weeks.

Still, I think I'm able to handle this situation, and this worst case is
quite unlikely. Between May 19 and June 4, I'll have lectures and exams,
but it should still be a quiet time, and I'll have time to work on my GSoC
project the evenings and week-ends.  After June 4, I'll only have my school
projects to finish and present, and I think I'll be able to spend most of
my time on GSoC.
I've taken it into account in my proposal's schedule, that's why the first
part takes almost as long as the second, even though it looks easier to me.

Of course, if the community doesn't want to take this risk, I'll understand.

Regards,
Maxence Ahlouche

My proposal:

 GSoC 2014 proposal: implementing clustering algorithms in MADlib
 

 Synopsis
 

 This project aims to implement some clustering algorithms in MADlib,
 which is a data analytics and machine learning library for PostgreSQL,
 Greenplum and HAWQ.

 Benefits to the PostgreSQL community
 

 Currently, only the k-means clustering algorithm is implemented in
 MADlib (see the doc:
 http://doc.madlib.net/latest/group__grp__clustering.html ). The
 k-medoids algorithm, while being computationnally more intensive, is
 much less sensitive to outliers (points that don't belong obviously to
 one cluster or another). This is interesting on noisy datasets, that's
 why I'm planning to implement it during the first part of the GSoC.

 Still, these algorithms are based on distance computation, therefore
 they can only find convex clusters. That's why I'm proposing to
 implement the OPTICS (*ordering points to identify the clustering
 structure*, see http://en.wikipedia.org/wiki/OPTICS_algorithm ), which
 addresses this issue, as the second part of this GSoC project.

 The PostgreSQL community would benefit from these features, as it
 would make available clustering algorithms more powerful than simple
 k-means.

 Project details
 ---

 k-medoids
 

 The first goal of this project is to implement the k-medoids
 clustering algorithm. For this, I'll first spend some time studying
 the k-means algorithm, as both will probably be pretty similar. This
 will also allow me to get familiar with the codebase, the conventions,
 the data structures I'll need, etc.

 Then I'll implement, test and debug the algorithm. If relevant, I'll
 also provide a k-medoids++ version, which, similarly to the
 k-means++ function in MADlib, will chose the initial centroids
 depending on the dataset, instead of chosing them randomly. This
 allows to detect small clusters located far from the others (which are
 usually detected as part of an other bigger cluster using the standard
 algorithm).

 The final step would be to refactor the code from k-means and
 k-medoids to remove any code duplication introduced in this first
 part.

 OPTICS
 

 The second part of this project would be to implement the
 density-based clustering algorithm OPTICS, which would overcome the
 main problem of both the k-means and k-medoids algorithm: non-convex
 clusters. This algorithm has been preferred over DBSCAN
 (http://en.wikipedia.org/wiki/DBSCAN ) as it is able to detect
 clusters of different densities, and, consequently, overlapping
 clusters.

 I'll first take some time to understand full well the algorithm, and
 make a prototype in Python, to be sure I know how it works. Then I'll
 actually implement it, test it, and debug it in MADlib.

 If, after that, any time's left, I'll consider implementing some
 of the improvements of k-means and k-medoids that we can find in the
 litterature.

 Deliverables
 

 * the k-medoids algorithm in MADlib;
 * the OPTICS algorithm, also in MADlib;
 * optionnally, some improvements on k-means and/or k-medoids.

 Project Schedule
 

 #. Implementation of the k-medoids algorithm: from 19/05 to 30/05
 #. Tests, debug and doc of k-medoids: from 31/05 to 13

Re: [HACKERS] [pgsql-advocacy] GSoC 2014 - mentors, students and admins

2014-01-28 Thread Maxence AHLOUCHE
Hi all,

I'd very interesting in taking part in the GSoC with PostgreSQL, as a
student.
Being quite busy at the moment (exam time), I haven't had time to think of
a subject yet, even though I could see some topics I found interesting.

2014-01-28 Josh Berkus j...@agliodbs.com

 On 01/28/2014 09:46 AM, Atri Sharma wrote:
  I would like to bring up the addition to MADLIB algorithms again this
 year.
 We can only take MADLIB this year if we have confirmed mentors who are
 MADLIB committers before the end of the application period (Feb 15).  We
 can't have a repeat of last year.


Indeed, I've been unlucky last year when I tried to apply to a subject with
Madlib, mostly due to their lack of reactivity. Even though their subjects
interest me, I don't intend to try my luck again.

Atri Sharma said:

 Also, some work on the foreign table constraints could be helpful.


What kind of work do you have in mind?

Stephen Frost said:

 I'm interested in mentoring and, unlike previous years, I've been
 collecting a personal list of things that I'd like to see worked on for
 PG which could be GSoC projects and will provide such in the next few
 days to this list (unless there's a different list that people want such
 posted to..?).

IMO, the best place would be the wiki page for GSoC (
http://wiki.postgresql.org/wiki/GSoC_2014), it would avoid interested
students (including me :) ) having to look for possible subjects in lots of
different places.


Regards,
Maxence
-- 
Maxence Ahlouche
06 06 66 97 00