yes I am using 0.6 for all of the steps mentioned below..
On Fri, Dec 16, 2011 at 10:47 AM, Paritosh Ranjan <[email protected]> wrote: > Are you using 0,6-snapshot, for every thing i.e. clustering, post > processing, and clusterdumer? > > And are you keeping the parameter sequential same for > clutering/postprocessing? > > > On 15-12-2011 20:01, ipshita chatterji wrote: >> >> Hi, >> I wrote my own code to read member variables from one of the >> directories generated by the postprocessor. I still get a mismatch >> between the number of clusters generated by clusterdumper and after >> reading the members. >> >> Please see my code snippet below. What am I missing here? >> Clusterdumper displays: >> >> MSV-115{n=3 c=[0:-0.030, 1:0.003, 2:-0.032, 3:-0.053, >> 4:0.001,.................] >> >> which means there are 3 members belonging to this centroid where as >> the code below generates 412 points. >> <code> >> Configuration conf = new Configuration(); >> //FileSystem fs = FileSystem.get(pointsDir.toUri(), conf); >> FileSystem fs = pointsDir.getFileSystem(conf); >> Path mypath = new Path("output1512/pp/115"); >> //System.out.println(" fs "+fs.getName()); >> try{ >> process(mypath,fs,conf); >> } >> catch(Exception e) >> { >> System.out.println("Exception :: "+e.getMessage()); >> e.printStackTrace(); >> } >> >> >> public void process(Path clusteredPoints, FileSystem >> fileSystem,Configuration conf)throws Exception { >> FileStatus[] partFiles = >> getAllClusteredPointPartFiles(clusteredPoints,fileSystem); >> for (FileStatus partFile : partFiles) { >> SequenceFile.Reader clusteredPointsReader = new >> SequenceFile.Reader(fileSystem, partFile.getPath(), conf); >> WritableComparable clusterIdAsKey = (WritableComparable) >> clusteredPointsReader.getKeyClass().newInstance(); >> Writable vector = (Writable) >> clusteredPointsReader.getValueClass().newInstance(); >> while (clusteredPointsReader.next(clusterIdAsKey, vector)) { >> //use clusterId and vector here to write to a local file. >> //IntWritable clusterIdAsKey1 = new IntWritable(); >> Text clusterIdAsKey1 = new Text(); >> //WeightedVectorWritable point1 = new >> WeightedVectorWritable(); >> VectorWritable point1 = new VectorWritable(); >> >> findClusterAndAddVector(clusteredPointsReader, >> clusterIdAsKey1, point1); >> } >> clusteredPointsReader.close(); >> } >> } >> >> private void findClusterAndAddVector(SequenceFile.Reader >> clusteredPointsReader, >> //IntWritable clusterIdAsKey1, >> Text clusterIdAsKey1, >> VectorWritable point1) throws >> IOException { >> while (clusteredPointsReader.next(clusterIdAsKey1, point1)) { >> //String clusterId = clusterIdAsKey1.toString().trim(); >> //String point = point1.toString(); >> //System.out.println("Adding point to cluster " + clusterId); >> org.apache.mahout.math.Vector vec= point1.get(); >> System.out.println(vec.asFormatString()); >> } >> } >> >> >> private FileStatus[] getAllClusteredPointPartFiles(Path >> clusteredPoints, FileSystem fileSystem) throws IOException { >> System.out.println(" clusteredPoints :: "+clusteredPoints.getName()); >> >> System.out.println(" fileSystem:: "+fileSystem.getName()); >> >> //Path[] partFilePaths = >> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints, >> PathFilters.partFilter())); >> Path[] partFilePaths = >> FileUtil.stat2Paths(fileSystem.globStatus(clusteredPoints)); >> >> int size=partFilePaths.length; >> System.out.println("Lenght :: "+size); >> FileStatus[] partFileStatuses = >> fileSystem.listStatus(partFilePaths, PathFilters.partFilter()); >> return partFileStatuses; >> } >> >> >> >> On Thu, Dec 15, 2011 at 2:19 PM, Paritosh Ranjan<[email protected]> >> wrote: >>> >>> If you want to put data in to local file system, I think you will have to >>> read the data in the cluster directories (output of postprocessor), one >>> by >>> one and write it on your local system. >>> >>> I am not sure what ClusterDumper does, if it also does the same >>> thing(reads >>> clusters output and writes output on local file system), then you can use >>> it >>> on all the directories produced by postprocessor. >>> >>> >>> On 15-12-2011 14:07, ipshita chatterji wrote: >>>> >>>> I have used ClusterOutputPostProcessorDriver. Now how do I read the >>>> output generated by postprocessor? Is there a tool for that too? >>>> >>>> On Thu, Dec 15, 2011 at 10:37 AM, Paritosh Ranjan<[email protected]> >>>> wrote: >>>>> >>>>> Some typo in previous mail. Please read : >>>>> >>>>> ...which will post process your clustering output and group vectors >>>>> belonging to different clusters in their respective directories... >>>>> >>>>> >>>>> On 15-12-2011 10:34, Paritosh Ranjan wrote: >>>>>> >>>>>> You don't need to write your own code for analyzing clustered points. >>>>>> You >>>>>> can use ClusterOutputPostProcessorDriver which will post process your >>>>>> clusters and group clusters belonging to different clusters in their >>>>>> respective directories. You won't get any OOM here. >>>>>> >>>>>> Example of using it is here >>>>>> https://cwiki.apache.org/MAHOUT/top-down-clustering.html. >>>>>> >>>>>> And I would advice to use the current 0.6-snapshot snapshot to do >>>>>> clustering as well as post processing it. >>>>>> Using 0.5 to use clustering and 0.6-snapshot to write code to post >>>>>> process >>>>>> might create problems. >>>>>> >>>>>> Paritosh >>>>>> >>>>>> On 15-12-2011 08:37, ipshita chatterji wrote: >>>>>>> >>>>>>> Actually clustering was done using 0.5 version of mahout but I am >>>>>>> using the clusterterdumper code from current version of mahout >>>>>>> present >>>>>>> in "trunk" to analyze the clusters. To make it run I renamed the >>>>>>> final >>>>>>> cluster by appending "-final". >>>>>>> I got the OOM error even after increasing the mahout heapsize and >>>>>>> hence had written a code of my own to analyze the clusters by reading >>>>>>> "-clusteredPoints". >>>>>>> >>>>>>> Thu, Dec 15, 2011 at 2:58 AM, Gary Snider<[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Ok. See if you can get the --pointsDir working and post what you >>>>>>>> get. >>>>>>>> Also for seqFileDir do you have a directory with the word 'final' >>>>>>>> in >>>>>>>> it? >>>>>>>> >>>>>>>> On Dec 14, 2011, at 12:37 PM, ipshita >>>>>>>> chatterji<[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> For clusterdumper I had following commandline: >>>>>>>>> >>>>>>>>> $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-6 >>>>>>>>> --output clusteranalyze.txt >>>>>>>>> >>>>>>>>> Have written a separate program to read clusteredOutput directory >>>>>>>>> as >>>>>>>>> clusterdumper with "--pointsDir output/clusteredPoints " was giving >>>>>>>>> OOM exception. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> On Wed, Dec 14, 2011 at 10:06 PM, Gary >>>>>>>>> Snider<[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> What was on your command line? e.g. seqFileDir, pointsDir, etc >>>>>>>>>> On Wed, Dec 14, 2011 at 10:54 AM, ipshita >>>>>>>>>> chatterji<[email protected]>wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am a newbie in Mahout and also have elementary knowledge of >>>>>>>>>>> clustering. I managed to cluster my data using meanshift and then >>>>>>>>>>> ran >>>>>>>>>>> clusterdumper, I get following output: >>>>>>>>>>> >>>>>>>>>>> MSV-21{n=1 c=[1:0...........] >>>>>>>>>>> >>>>>>>>>>> So I asssume that the cluster above has converged and n=1 >>>>>>>>>>> indicates >>>>>>>>>>> that there is only one point associated with the cluster above. >>>>>>>>>>> >>>>>>>>>>> Now I try to read the members of this cluster from >>>>>>>>>>> "clusteredPoints" >>>>>>>>>>> directory. I see from the output that number of points belonging >>>>>>>>>>> this >>>>>>>>>>> cluster is 173. >>>>>>>>>>> >>>>>>>>>>> Why is this mismatch happening? Am I missing something here? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Ipshita >>>>>>>>>>> >>>>>>> ----- >>>>>>> No virus found in this message. >>>>>>> Checked by AVG - www.avg.com >>>>>>> Version: 10.0.1415 / Virus Database: 2102/4080 - Release Date: >>>>>>> 12/14/11 >>>>>>> >>>>>> >>>>>> ----- >>>>>> No virus found in this message. >>>>>> Checked by AVG - www.avg.com >>>>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: >>>>>> 12/14/11 >>>>> >>>>> >>>> ----- >>>> No virus found in this message. >>>> Checked by AVG - www.avg.com >>>> Version: 10.0.1415 / Virus Database: 2108/4081 - Release Date: 12/14/11 >> >> >> ----- >> No virus found in this message. >> Checked by AVG - www.avg.com >> Version: 10.0.1415 / Virus Database: 2108/4082 - Release Date: 12/15/11 >> >
