Sorry for the delay in responding. By now you may have already figured this out. If not:
1. Did you specify the -cl option on Dirichlet to emit the clusteredPoints directory? The default is not to do so. 2. Did you specify the -p option on ClusterDumper to use that directory? 3. Which model are you using on Dirichlet? The default GaussianCluster doesn't do well with wide (e.g. text-clustering) vectors due to numerical instabilities. See examples/bin/build-reuters.sh for the incantation to use DistanceMeasureCluster+CosineDistanceMeasure instead. 4. Never seen the NPE you describe. Can you include your command line and a stack dump? 5. With your small data set size, you should be using -xm sequential and not the default mapreduce execution mode. Easier to debug the NPE if it reoccurs too. 6. Check out DisplayDirichlet in examples which can visualize 2-d points and their clusters 7. I'd be interested to see if your experiment produces any results you can share. This sounds like a very unusual clustering application. Jeff -----Original Message----- From: praneet mhatre [mailto:[email protected]] Sent: Tuesday, November 08, 2011 4:12 PM To: [email protected] Subject: Dirichlet Clustering Output Hello All, I am trying to use Clustering algorithms to recover Software Architecture by using static features of code (e.g. method invocations, field accesses, etc). To start with, I ran the TestClusterDumper ( using testDirichlet2() function) on the sample example given. But I am not able to interpret/visualize the results correctly as* *I don't see any assignment of input vectors to clusters, just a model of attributes. Is there an additional step to be performed to generate the final assignment? Here's my input and output for Number of Clusters=10 and Number of Iterations=10. *Input: * private static final String[] DOCS = { "The quick red fox jumped over the lazy brown dogs.", "The quick brown fox jumped over the lazy red dogs.", "The quick red cat jumped over the lazy brown dogs.", "The quick brown cat jumped over the lazy red dogs.", "Mary had a little lamb whose fleece was white as snow.", "Mary had a little goat whose fleece was white as snow.", "Mary had a little lamb whose fleece was black as tar.", "Dick had a little goat whose fleece was white as snow.", "Moby Dick is a story of a whale and a man obsessed.", "Moby Bob is a story of a walrus and a man obsessed.", "Moby Dick is a story of a whale and a crazy man.", "The robber wore a black fleece jacket and a baseball cap.", "The robber wore a red fleece jacket and a baseball cap.", "The robber wore a white fleece jacket and a baseball cap.", "The English Springer Spaniel is the best of all dogs.", "Hitesh Crista Crista Joel Thomas Arthur Praneet Hitesh Crista.", "Hitesh Crista Thomas Yasser Arthur Arthur Praneet Hitesh Crista.", "Hitesh Crista Thomas Sara Maryam Arthur Praneet Hitesh Crista."}; *Output:* Complete output: https://docs.google.com/document/d/1ApOj-XwNMei1JYwcAoj7Vgzg_6sLtJqT55SBhZxL_V0/edit?hl=en_US First two clusters: DC-0 total= 110 model= GC:0{n=11 c=[arthur:0.777, baseball:0.683, black:0.508, brown:0.415, cap:0.683, cat:0.508, crista:1.038, dogs:0.382, fleece:0.988, goat:0.254, had:0.622, hitesh:0.966, jacket:0.683, joel:0.291, jumped:0.415, lamb:0.508, lazy:0.415, little:0.622, mary:0.683, maryam:0.291, over:0.415, praneet:0.683, quick:0.415, red:0.572, robber:0.683, sara:0.291, snow:0.455, tar:0.291, thomas:0.683, white:0.622, whose:0.622, wore:0.683, yasser:0.291] r=[arthur:1.295, baseball:1.115, black:1.077, brown:0.880, cap:1.115, cat:1.077, crista:1.707, dogs:0.809, fleece:0.902, goat:0.803, had:1.016, hitesh:1.577, jacket:1.115, joel:0.919, jumped:0.880, lamb:1.077, lazy:0.880, little:1.016, mary:1.115, maryam:0.919, over:0.880, praneet:1.115, quick:0.880, red:0.935, robber:1.115, sara:0.919, snow:0.966, tar:0.919, thomas:1.115, white:1.016, whose:1.016, wore:1.115, yasser:0.919]} Top Terms: crista => 1.0381627082824707 fleece => 0.987780137495561 hitesh => 0.9658091718500311 arthur => 0.7772231968966398 wore => 0.6829302094199441 thomas => 0.6829302094199441 robber => 0.6829302094199441 praneet => 0.6829302094199441 mary => 0.6829302094199441 jacket => 0.6829302094199441 DC-1 total= 0 model= GC:1{n=0 c=[all:1.064, arthur:1.312, baseball:-0.362, best:-1.437, black:1.155, bob:0.798, brown:0.708, cap:0.154, cat:-1.008, crazy:0.891, crista:-0.032, dick:1.358, dogs:0.254, english:-0.159, fleece:0.047, fox:-0.397, goat:0.353, had:-0.217, hitesh:-0.722, jacket:-0.794, joel:0.906, jumped:0.511, lamb:-0.742, lazy:-1.627, little:0.259, man:1.254, mary:1.073, maryam:-0.979, moby:1.377, obsessed:1.655, over:-2.704, praneet:2.064, quick:-1.444, red:0.212, robber:-0.880, sara:-0.788, snow:-2.024, spaniel:-2.043, springer:-0.129, story:-0.556, tar:0.036, thomas:-0.539, walrus:-0.663, whale:-0.449, white:-0.872, whose:-1.372, wore:1.300, yasser:-1.198] r=[all:1.739, arthur:0.544, baseball:0.344, best:0.583, black:2.614, bob:1.700, brown:0.289, cap:-0.749, cat:2.273, crazy:2.075, crista:0.912, dick:-2.777, dogs:1.587, english:1.792, fleece:1.370, fox:-1.535, goat:-0.910, had:3.608, hitesh:1.639, jacket:1.127, joel:0.604, jumped:1.631, lamb:0.786, lazy:2.790, little:2.492, man:0.151, mary:1.611, maryam:-0.466, moby:1.370, obsessed:1.017, over:0.066, praneet:0.194, quick:1.352, red:0.450, robber:1.414, sara:1.427, snow:1.350, spaniel:-0.446, springer:1.615, story:1.330, tar:0.477, thomas:0.619, walrus:1.990, whale:1.013, white:1.335, whose:0.218, wore:0.231, yasser:1.284]} Top Terms: praneet => 2.064261989527272 obsessed => 1.6554940510057867 moby => 1.3767884191330173 dick => 1.3584694137334954 arthur => 1.31230884601195 wore => 1.3000443458409314 man => 1.2543030335395073 black => 1.155114056222531 mary => 1.0725645217854314 all => 1.0641885117052403 * 2) Also, on a related note, I get a NullPointerException for a relatively higher number of iterations. For instance, with the same set of data points, I encounter the exception when I use 10 clusters and 15 iterations*. *Any thoughts on that?* *3) There is an example clearly visualizing the clusters in case of numerical 2-D data points. Is there a similar way to visualize text data clusters?* *If it matters, scalability is currently not a big concern as I am only dealing with a few hundred input vectors and attributes at this point. * Thank you, -- Praneet Mhatre Graduate Student Donald Bren School of ICS University of California, Irvine
