Re: Query on clusterdumper output and clusteredPoints

ipshita chatterji Thu, 15 Dec 2011 06:31:50 -0800

Hi,
I wrote my own code to read member variables from one of the
directories generated by the postprocessor. I still get a mismatch
between the number of clusters generated by clusterdumper and after
reading the members.


Please see my code snippet below. What am I missing here?
Clusterdumper displays:

MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053, 4:0.001,.................]

which means there are 3 members belonging to this centroid where as
the code below generates 412 points.
<code>
    Configuration conf = new Configuration();
    //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf);
    FileSystem fs = pointsDir.getFileSystem(conf);
    Path mypath = new Path("output1512/pp/115");
    //System.out.println(" fs "+fs.getName());
    try{
          process(mypath,fs,conf);
       }
    catch(Exception e)
    {
          System.out.println("Exception :: "+e.getMessage());
          e.printStackTrace();
    }


public void process(Path clusteredPoints, FileSystem
fileSystem,Configuration conf)throws Exception {
     FileStatus[] partFiles =
getAllClusteredPointPartFiles(clusteredPoints,fileSystem);
     for (FileStatus partFile : partFiles) {
          SequenceFile.Reader clusteredPointsReader = new
SequenceFile.Reader(fileSystem, partFile.getPath(), conf);
         WritableComparable clusterIdAsKey = (WritableComparable)
clusteredPointsReader.getKeyClass().newInstance();
         Writable vector = (Writable)
clusteredPointsReader.getValueClass().newInstance();
         while (clusteredPointsReader.next(clusterIdAsKey, vector)) {
             //use clusterId and vector here to write to a local file.
             //IntWritable clusterIdAsKey1 = new IntWritable();
             Text clusterIdAsKey1 = new Text();
             //WeightedVectorWritable point1 = new WeightedVectorWritable();
             VectorWritable point1 = new VectorWritable();

             findClusterAndAddVector(clusteredPointsReader,
clusterIdAsKey1, point1);
         }
         clusteredPointsReader.close();
     }
   }

  private void findClusterAndAddVector(SequenceFile.Reader
clusteredPointsReader,
                                       //IntWritable clusterIdAsKey1,
                                       Text clusterIdAsKey1,
                                       VectorWritable point1) throws
IOException {
    while (clusteredPointsReader.next(clusterIdAsKey1, point1)) {
      //String clusterId = clusterIdAsKey1.toString().trim();
      //String point = point1.toString();
      //System.out.println("Adding point to cluster " + clusterId);
      org.apache.mahout.math.Vector vec= point1.get();
      System.out.println(vec.asFormatString());
    }
  }


private FileStatus[] getAllClusteredPointPartFiles(Path
clusteredPoints, FileSystem fileSystem) throws IOException {
     System.out.println(" clusteredPoints :: "+clusteredPoints.getName());
     System.out.println(" fileSystem:: "+fileSystem.getName());

     //Path[] partFilePaths =
FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints,
PathFilters.partFilter()));
     Path[] partFilePaths =
FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints));

     int size=partFilePaths.length;
     System.out.println("Lenght :: "+size);
     FileStatus[] partFileStatuses =
fileSystem.listStatus(partFilePaths, PathFilters.partFilter());
     return partFileStatuses;
   }



On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan <[email protected]> wrote:
> If you want to put data in to local file system, I think you will have to
> read the data in the cluster directories (output of postprocessor), one by
> one and write it on your local system.
>
> I am not sure what ClusterDumper does, if it also does the same thing(reads
> clusters output and writes output on local file system), then you can use it
> on all the directories produced by postprocessor.
>
>
> On 15-12-2011 14:07, ipshita chatterji wrote:
>>
>> I have used ClusterOutputPostProcessorDriver. Now how do I read the
>> output generated by postprocessor? Is there a tool for that too?
>>
>> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<[email protected]>
>>  wrote:
>>>
>>> Some typo in previous mail. Please read :
>>>
>>> ...which will post process your clustering output and group vectors
>>> belonging to different clusters in their respective directories...
>>>
>>>
>>> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>>>
>>>> You don't need to write your own code for analyzing clustered points.
>>>> You
>>>> can use ClusterOutputPostProcessorDriver which will post process your
>>>> clusters and group clusters belonging to different clusters in their
>>>> respective directories. You won't get any OOM here.
>>>>
>>>> Example of using it is here
>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>>>
>>>> And I would advice to use the current 0.6-snapshot snapshot to do
>>>> clustering as well as post processing it.
>>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post
>>>> process
>>>> might create problems.
>>>>
>>>> Paritosh
>>>>
>>>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>>>
>>>>> Actually clustering was done using 0.5 version of mahout but I am
>>>>> using the clusterterdumper code from current version of mahout present
>>>>> in "trunk" to analyze the clusters. To make it run I renamed the final
>>>>> cluster by appending "-final".
>>>>> I got the OOM error even after increasing the mahout heapsize and
>>>>> hence had written a code of my own to analyze the clusters by reading
>>>>> "-clusteredPoints".
>>>>>
>>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<[email protected]>
>>>>>  wrote:
>>>>>
>>>>>> Ok.  See if you can get the --pointsDir working and post what you get.
>>>>>>  Also for seqFileDir do you have a directory with the word 'final' in
>>>>>> it?
>>>>>>
>>>>>> On Dec 14, 2011, at 12:37 PM, ipshita chatterji<[email protected]>
>>>>>>  wrote:
>>>>>>
>>>>>>> For clusterdumper I had following commandline:
>>>>>>>
>>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>>>> --output clusteranalyze.txt
>>>>>>>
>>>>>>> Have written a separate program to read clusteredOutput directory as
>>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>>>> OOM exception.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary
>>>>>>> Snider<[email protected]>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>>>
>>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>>>> chatterji<[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>>>> clustering. I managed to cluster my data using meanshift and then
>>>>>>>>> ran
>>>>>>>>> clusterdumper, I get following output:
>>>>>>>>>
>>>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>>>
>>>>>>>>> So I asssume that the cluster above has converged and n=1 indicates
>>>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>>>
>>>>>>>>> Now I try to read the members of this cluster from
>>>>>>>>> "clusteredPoints"
>>>>>>>>> directory. I see from the output that number of points belonging
>>>>>>>>> this
>>>>>>>>> cluster is 173.
>>>>>>>>>
>>>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ipshita
>>>>>>>>>
>>>>> -----
>>>>> No virus found in this message.
>>>>> Checked by AVG - www.avg.com
>>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: 12/14/11
>>>>>
>>>>
>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>
>

Re: Query on clusterdumper output and clusteredPoints

Reply via email to