Re: Query on clusterdumper output and clusteredPoints

ipshita chatterji Thu, 15 Dec 2011 21:27:17 -0800

yes I am using  0.6 for all of the steps mentioned below..


On Fri, Dec 16, 2011 at 10:47 AM, Paritosh Ranjan <[email protected]> wrote:
> Are you using 0,6-snapshot, for every thing i.e. clustering, post
> processing, and clusterdumer?
>
> And are you keeping the parameter sequential same for
> clutering/postprocessing?
>
>
> On 15-12-2011 20:01, ipshita chatterji wrote:
>>
>> Hi,
>> I wrote my own code to read member variables from one of the
>> directories generated by the postprocessor. I still get a mismatch
>> between the number of clusters generated by clusterdumper and after
>> reading the members.
>>
>> Please see my code snippet below. What am I missing here?
>> Clusterdumper displays:
>>
>> MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053,
>> 4:0.001,.................]
>>
>> which means there are 3 members belonging to this centroid where as
>> the code below generates 412 points.
>> <code>
>>     Configuration conf = new Configuration();
>>     //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf);
>>     FileSystem fs = pointsDir.getFileSystem(conf);
>>     Path mypath = new Path("output1512/pp/115");
>>     //System.out.println(" fs "+fs.getName());
>>     try{
>>           process(mypath,fs,conf);
>>        }
>>     catch(Exception e)
>>     {
>>           System.out.println("Exception :: "+e.getMessage());
>>           e.printStackTrace();
>>     }
>>
>>
>> public void process(Path clusteredPoints, FileSystem
>> fileSystem,Configuration conf)throws Exception {
>>      FileStatus[] partFiles =
>> getAllClusteredPointPartFiles(clusteredPoints,fileSystem);
>>      for (FileStatus partFile : partFiles) {
>>          SequenceFile.Reader clusteredPointsReader = new
>> SequenceFile.Reader(fileSystem, partFile.getPath(), conf);
>>          WritableComparable clusterIdAsKey = (WritableComparable)
>> clusteredPointsReader.getKeyClass().newInstance();
>>          Writable vector = (Writable)
>> clusteredPointsReader.getValueClass().newInstance();
>>          while (clusteredPointsReader.next(clusterIdAsKey, vector)) {
>>              //use clusterId and vector here to write to a local file.
>>              //IntWritable clusterIdAsKey1 = new IntWritable();
>>              Text clusterIdAsKey1 = new Text();
>>              //WeightedVectorWritable point1 = new
>> WeightedVectorWritable();
>>              VectorWritable point1 = new VectorWritable();
>>
>>              findClusterAndAddVector(clusteredPointsReader,
>> clusterIdAsKey1, point1);
>>          }
>>          clusteredPointsReader.close();
>>      }
>>    }
>>
>>   private void findClusterAndAddVector(SequenceFile.Reader
>> clusteredPointsReader,
>>                                        //IntWritable clusterIdAsKey1,
>>                                        Text clusterIdAsKey1,
>>                                        VectorWritable point1) throws
>> IOException {
>>     while (clusteredPointsReader.next(clusterIdAsKey1, point1)) {
>>       //String clusterId = clusterIdAsKey1.toString().trim();
>>       //String point = point1.toString();
>>       //System.out.println("Adding point to cluster " + clusterId);
>>       org.apache.mahout.math.Vector vec= point1.get();
>>       System.out.println(vec.asFormatString());
>>     }
>>   }
>>
>>
>> private FileStatus[] getAllClusteredPointPartFiles(Path
>> clusteredPoints, FileSystem fileSystem) throws IOException {
>>      System.out.println(" clusteredPoints :: "+clusteredPoints.getName());
>>
>>      System.out.println(" fileSystem:: "+fileSystem.getName());
>>
>>      //Path[] partFilePaths =
>> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints,
>> PathFilters.partFilter()));
>>      Path[] partFilePaths =
>> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints));
>>
>>      int size=partFilePaths.length;
>>      System.out.println("Lenght :: "+size);
>>      FileStatus[] partFileStatuses =
>> fileSystem.listStatus(partFilePaths, PathFilters.partFilter());
>>      return partFileStatuses;
>>    }
>>
>>
>>
>> On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan<[email protected]>
>>  wrote:
>>>
>>> If you want to put data in to local file system, I think you will have to
>>> read the data in the cluster directories (output of postprocessor), one
>>> by
>>> one and write it on your local system.
>>>
>>> I am not sure what ClusterDumper does, if it also does the same
>>> thing(reads
>>> clusters output and writes output on local file system), then you can use
>>> it
>>> on all the directories produced by postprocessor.
>>>
>>>
>>> On 15-12-2011 14:07, ipshita chatterji wrote:
>>>>
>>>> I have used ClusterOutputPostProcessorDriver. Now how do I read the
>>>> output generated by postprocessor? Is there a tool for that too?
>>>>
>>>> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<[email protected]>
>>>>  wrote:
>>>>>
>>>>> Some typo in previous mail. Please read :
>>>>>
>>>>> ...which will post process your clustering output and group vectors
>>>>> belonging to different clusters in their respective directories...
>>>>>
>>>>>
>>>>> On 15-12-2011 10:34, Paritosh Ranjan wrote:
>>>>>>
>>>>>> You don't need to write your own code for analyzing clustered points.
>>>>>> You
>>>>>> can use ClusterOutputPostProcessorDriver which will post process your
>>>>>> clusters and group clusters belonging to different clusters in their
>>>>>> respective directories. You won't get any OOM here.
>>>>>>
>>>>>> Example of using it is here
>>>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
>>>>>>
>>>>>> And I would advice to use the current 0.6-snapshot snapshot to do
>>>>>> clustering as well as post processing it.
>>>>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post
>>>>>> process
>>>>>> might create problems.
>>>>>>
>>>>>> Paritosh
>>>>>>
>>>>>> On 15-12-2011 08:37, ipshita chatterji wrote:
>>>>>>>
>>>>>>> Actually clustering was done using 0.5 version of mahout but I am
>>>>>>> using the clusterterdumper code from current version of mahout
>>>>>>> present
>>>>>>> in "trunk" to analyze the clusters. To make it run I renamed the
>>>>>>> final
>>>>>>> cluster by appending "-final".
>>>>>>> I got the OOM error even after increasing the mahout heapsize and
>>>>>>> hence had written a code of my own to analyze the clusters by reading
>>>>>>> "-clusteredPoints".
>>>>>>>
>>>>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<[email protected]>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>> Ok.  See if you can get the --pointsDir working and post what you
>>>>>>>> get.
>>>>>>>>  Also for seqFileDir do you have a directory with the word 'final'
>>>>>>>> in
>>>>>>>> it?
>>>>>>>>
>>>>>>>> On Dec 14, 2011, at 12:37 PM, ipshita
>>>>>>>> chatterji<[email protected]>
>>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>> For clusterdumper I had following commandline:
>>>>>>>>>
>>>>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6
>>>>>>>>> --output clusteranalyze.txt
>>>>>>>>>
>>>>>>>>> Have written a separate program to read clusteredOutput directory
>>>>>>>>> as
>>>>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving
>>>>>>>>> OOM exception.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary
>>>>>>>>> Snider<[email protected]>
>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>> What was on your command line?  e.g. seqFileDir, pointsDir, etc
>>>>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita
>>>>>>>>>> chatterji<[email protected]>wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of
>>>>>>>>>>> clustering. I managed to cluster my data using meanshift and then
>>>>>>>>>>> ran
>>>>>>>>>>> clusterdumper, I get following output:
>>>>>>>>>>>
>>>>>>>>>>> MSV-21{n=1 c=[1:0...........]
>>>>>>>>>>>
>>>>>>>>>>> So I asssume that the cluster above has converged and n=1
>>>>>>>>>>> indicates
>>>>>>>>>>> that there is only one point associated with the cluster above.
>>>>>>>>>>>
>>>>>>>>>>> Now I try to read the members of this cluster from
>>>>>>>>>>> "clusteredPoints"
>>>>>>>>>>> directory. I see from the output that number of points belonging
>>>>>>>>>>> this
>>>>>>>>>>> cluster is 173.
>>>>>>>>>>>
>>>>>>>>>>> Why is this mismatch happening? Am I missing something here?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Ipshita
>>>>>>>>>>>
>>>>>>> -----
>>>>>>> No virus found in this message.
>>>>>>> Checked by AVG - www.avg.com
>>>>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date:
>>>>>>> 12/14/11
>>>>>>>
>>>>>>
>>>>>> -----
>>>>>> No virus found in this message.
>>>>>> Checked by AVG - www.avg.com
>>>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date:
>>>>>> 12/14/11
>>>>>
>>>>>
>>>> -----
>>>> No virus found in this message.
>>>> Checked by AVG - www.avg.com
>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11
>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1415 / Virus Database: 2108/4082 - Release Date: 12/15/11
>>
>

Re: Query on clusterdumper output and clusteredPoints

Reply via email to