Re: Query on clusterdumper output and clusteredPoints

Paritosh Ranjan Thu, 15 Dec 2011 21:18:56 -0800

Are you using 0,6-snapshot, for every thing i.e. clustering, postprocessing, and clusterdumer?

And are you keeping the parameter sequential same forclutering/postprocessing?


On 15-12-2011 20:01, ipshita chatterji wrote:

Hi,
I wrote my own code to read member variables from one of the
directories generated by the postprocessor. I still get a mismatch
between the number of clusters generated by clusterdumper and after
reading the members.

Please see my code snippet below. What am I missing here?
Clusterdumper displays:

MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053, 4:0.001,.................]

which means there are 3 members belonging to this centroid where as
the code below generates 412 points.
<code>
     Configuration conf = new Configuration();
     //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf);
     FileSystem fs = pointsDir.getFileSystem(conf);
     Path mypath = new Path("output1512/pp/115");
     //System.out.println(" fs "+fs.getName());
     try{
           process(mypath,fs,conf);
        }
     catch(Exception e)
     {
           System.out.println("Exception :: "+e.getMessage());
           e.printStackTrace();
     }


public void process(Path clusteredPoints, FileSystem
fileSystem,Configuration conf)throws Exception {
      FileStatus[] partFiles =
getAllClusteredPointPartFiles(clusteredPoints,fileSystem);
      for (FileStatus partFile : partFiles) {
          SequenceFile.Reader clusteredPointsReader = new
SequenceFile.Reader(fileSystem, partFile.getPath(), conf);
          WritableComparable clusterIdAsKey = (WritableComparable)
clusteredPointsReader.getKeyClass().newInstance();
          Writable vector = (Writable)
clusteredPointsReader.getValueClass().newInstance();
          while (clusteredPointsReader.next(clusterIdAsKey, vector)) {
              //use clusterId and vector here to write to a local file.
              //IntWritable clusterIdAsKey1 = new IntWritable();
              Text clusterIdAsKey1 = new Text();
              //WeightedVectorWritable point1 = new WeightedVectorWritable();
              VectorWritable point1 = new VectorWritable();

              findClusterAndAddVector(clusteredPointsReader,
clusterIdAsKey1, point1);
          }
          clusteredPointsReader.close();
      }
    }

   private void findClusterAndAddVector(SequenceFile.Reader
clusteredPointsReader,
                                        //IntWritable clusterIdAsKey1,
                                        Text clusterIdAsKey1,
                                        VectorWritable point1) throws
IOException {
     while (clusteredPointsReader.next(clusterIdAsKey1, point1)) {
       //String clusterId = clusterIdAsKey1.toString().trim();
       //String point = point1.toString();
       //System.out.println("Adding point to cluster " + clusterId);
       org.apache.mahout.math.Vector vec= point1.get();
       System.out.println(vec.asFormatString());
     }
   }


private FileStatus[] getAllClusteredPointPartFiles(Path
clusteredPoints, FileSystem fileSystem) throws IOException {
      System.out.println(" clusteredPoints :: "+clusteredPoints.getName());

      System.out.println(" fileSystem:: "+fileSystem.getName());

      //Path[] partFilePaths =
FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints,
PathFilters.partFilter()));
      Path[] partFilePaths =
FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints));

      int size=partFilePaths.length;
      System.out.println("Lenght :: "+size);
      FileStatus[] partFileStatuses =
fileSystem.listStatus(partFilePaths, PathFilters.partFilter());
      return partFileStatuses;
    }



On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan<[email protected]>  wrote:

If you want to put data in to local file system, I think you will have to
read the data in the cluster directories (output of postprocessor), one by
one and write it on your local system.

I am not sure what ClusterDumper does, if it also does the same thing(reads
clusters output and writes output on local file system), then you can use it
on all the directories produced by postprocessor.


On 15-12-2011 14:07, ipshita chatterji wrote:

I have used ClusterOutputPostProcessorDriver. Now how do I read the
output generated by postprocessor? Is there a tool for that too?

On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<[email protected]>
  wrote:

Some typo in previous mail. Please read :

...which will post process your clustering output and group vectors
belonging to different clusters in their respective directories...


On 15-12-2011 10:34, Paritosh Ranjan wrote:

You don't need to write your own code for analyzing clustered points.
You
can use ClusterOutputPostProcessorDriver which will post process your
clusters and group clusters belonging to different clusters in their
respective directories. You won't get any OOM here.

Example of using it is here
https://cwiki.apache.org/MAHOUT/top-down-clustering.html.

And I would advice to use the current 0.6-snapshot snapshot to do
clustering as well as post processing it.
Using 0.5 to use clustering and 0.6-snapshot to write code to post
process
might create problems.

Paritosh

On 15-12-2011 08:37, ipshita chatterji wrote:

Actually clustering was done using 0.5 version of mahout but I am
using the clusterterdumper code from current version of mahout present
in "trunk" to analyze the clusters. To make it run I renamed the final
cluster by appending "-final".
I got the OOM error even after increasing the mahout heapsize and
hence had written a code of my own to analyze the clusters by reading
"-clusteredPoints".

Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<[email protected]>
  wrote:

Ok.  See if you can get the --pointsDir working and post what you get.
  Also for seqFileDir do you have a directory with the word 'final' in
it?

On Dec 14, 2011, at 12:37 PM, ipshita chatterji<[email protected]>
  wrote:

For clusterdumper I had following commandline:

$MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
--output clusteranalyze.txt

Have written a separate program to read clusteredOutput directory as
clusterdumper with "--pointsDir output/clusteredPoints " was giving
OOM exception.

Thanks

On Wed, Dec 14, 2011 at 10:06 PM, Gary
Snider<[email protected]>
  wrote:

What was on your command line?  e.g. seqFileDir, pointsDir, etc
On Wed, Dec 14, 2011 at 10:54 AM, ipshita
chatterji<[email protected]>wrote:

Hi,

I am a newbie in Mahout and also have elementary knowledge of
clustering. I managed to cluster my data using meanshift and then
ran
clusterdumper, I get following output:

MSV-21{n=1 c=[1:0...........]

So I asssume that the cluster above has converged and n=1 indicates
that there is only one point associated with the cluster above.

Now I try to read the members of this cluster from
"clusteredPoints"
directory. I see from the output that number of points belonging
this
cluster is 173.

Why is this mismatch happening? Am I missing something here?

Thanks,
Ipshita

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11


-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 10.0.1415 / Virus Database: 2108/4082 - Release Date: 12/15/11

Re: Query on clusterdumper output and clusteredPoints

Reply via email to