RE: Clustering Questions

Severance, Steve Mon, 16 Aug 2010 12:06:11 -0700

I updated to the current revision of trunk. It does not package correctly as 
some of the tests fail.


Failed tests: 
  testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)
  testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)

Tests in error: 
  
testLoglikelihood(org.apache.mahout.math.hadoop.similarity.vector.DistributedLoglikelihoodVectorSimilarityTest)
  
testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmeansClustering)
  testCompleteJob(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest)
  
testCompleteJobBoolean(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest)
  
testTanimoto(org.apache.mahout.math.hadoop.similarity.vector.DistributedTanimotoCoefficientVectorSimilarityTest)
  testStartParallelFPGrowth(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest)
  
testCanopyMapperManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testCanopyMapperEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testCanopyReducerManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testCanopyReducerEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testCanopyGenManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testCanopyGenEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testClusterMapperManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testClusterMapperEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testClusteringManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testClusteringEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  
testUserDefinedDistanceMeasure(org.apache.mahout.clustering.canopy.TestCanopyCreation)
  testCanopyEuclideanMRJob(org.apache.mahout.clustering.meanshift.TestMeanShift)
  
testCompleteJob(org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest)
  
testMaxSimilaritiesPerItem(org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest)
  
testRowWeightMapper(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  
testSimilarityReducer(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  
testSimilarityReducerSelfSimilarity(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  
testSmallSampleMatrix(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  
testLimitEntriesInSimilarityMatrix(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob)
  testEvaluate(org.apache.mahout.ga.watchmaker.MahoutEvaluatorTest)
  testMaxHeapFPGrowth(org.apache.mahout.fpm.pfpgrowth.FPGrowthTest)
  
testFuzzyKMeansMRJob(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering)
  testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testMatrixTimesVector(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  
testMatrixTimesSquaredVector(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testMatrixTimesMatrix(org.apache.mahout.math.hadoop.TestDistributedRowMatrix)
  testSelfTestBayes(org.apache.mahout.classifier.bayes.BayesClassifierSelfTest)
  testSelfTestCBayes(org.apache.mahout.classifier.bayes.BayesClassifierSelfTest)
  
testDistributedLanczosSolver(org.apache.mahout.math.hadoop.decomposer.TestDistributedLanczosSolver)

I can provide any extra info needed. 

My other build of trunk which was from August 4th fails to run seq2sparse 
because the lucene standard analyzer cannot be found. Which Job file should 
contain this?

Thanks.

Steve

-----Original Message-----
From: Ted Dunning [mailto:[email protected]] 
Sent: Monday, August 16, 2010 11:20 AM
To: [email protected]
Subject: Re: Clustering Questions

On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <[email protected]>wrote:

>
> 1.       It appears that the .job files contain the code that is actually
> needed to run. How do I build these? They don't seem to be built when 
> I build mahout with Maven.
>

Which version are you using?  I recommend trunk for pretty much everything.


> 2.       The Mahout 0.3 tag line numbers don't seem to match with the
> compiled jars. What revision number is 0.3 built from?
>

It should have been what was tagged.  But, even so, I recommend using trunk.


> 3.       It looks like the format of the cluster files changed between
> 0.3and 0.4. Is this true?
>

Others can say for sure, but this is very likely.   0.4 is going to be a
major change.

4.       I was never able to get the Cluster dumping tool to work. I wrotemy
> own to export the clusters to hive for analysis. Are there any plans 
> for=  better Hive integration?
>

This has been substantially improved.  Is there something that can be done to 
facilitate Hive integration without making Hive a dependency?

RE: Clustering Questions

Reply via email to