Silent blow up inside Mahout code while tfidf vector data .
By using the command line tools, (mahout seqdumper), I have found that
though the TF vectors are generated, the TFIDF vectors are NOT being
generated:
Tail of TF vector seq dump:
Key: /reut2-021.sgm-99.txt: Value:
/reut2-021.sgm-99.txt:{291:0.08873565094161139,448:0.08873565094161139,990:0
.08873565094161139,1438:0.08873565094161139,1551:0.08873565094161139,1707:0.
08873565094161139,1873:0.08873565094161139,2413:0.08873565094161139,3144:0.0
8873565094161139,4196:0.08873565094161139,4818:0.08873565094161139,8603:0.17
747130188322277,9399:0.08873565094161139,11013:0.08873565094161139,12257:0.1
40642679219536,12626:0.08873565094161139,14594:0.08873565094161139,14667:0.0
8873565094161139,14803:0.17747130188322277,14870:0.08873565094161139,14968:0
.08873565094161139,15099:0.08873565094161139,16627:0.08873565094161139,16858
:0.140642679219536,17442:0.08873565094161139,17481:0.140642679219536,18327:0
.140642679219536,20009:0.140642679219536,20091:0.08873565094161139,20721:0.2
4911246643291818,21660:0.08873565094161139,22568:0.08873565094161139,22661:0
.08873565094161139,22868:0.08873565094161139,23235:0.08873565094161139,23497
:0.08873565094161139,23765:0.08873565094161139,27610:0.08873565094161139,276
21:0.140642679219536,28488:0.08873565094161139,30408:0.08873565094161139,345
85:0.08873565094161139,34793:0.08873565094161139,34942:0.08873565094161139,3
5075:0.08873565094161139,35170:0.08873565094161139,36979:0.08873565094161139
,37224:0.140642679219536,37660:0.08873565094161139,38766:0.08873565094161139
,40624:0.140642679219536,41334:0.08873565094161139,}
Count: 21578
Tail of TFIDF vector seq dump:
Input Path: part-r-00000
Key class: class org.apache.hadoop.io.Text Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0
NOTE 1: I continued on past my step 3 while waiting for answers hoping that
I could surmount the problem on my own. Philosophically, I changed the
statements inside NewsKMeansClustering so that the steps have true for named
vectors and true for run sequential. Also, modified the distance measure to
one that produced results for me when running the mahout XXXX command line
version of this exercise CosineDistanceMeasure with t1 at .4 and t2 at .8
. (all in conjunction with StandardAnalyzer instead of MyAnalyzer).
NOTE 2: Remember that I updated the pom.xml from mahout version .7 to .8
Here is the message on the console from when I execute the
NewsKMeansClustering (source listed in subsequent block after console dump):
tokenizing the documents
2013-12-27 15:42:13 NativeCodeLoader [WARN] Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
2013-12-27 15:42:13 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
creating the term frequency vectors from tokenized documents
2013-12-27 15:42:16 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
2013-12-27 15:43:12 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
2013-12-27 15:43:14 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
2013-12-27 15:43:24 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
calculating document frequencies from tf vectors
2013-12-27 15:43:27 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
creating the tfidf vectors
2013-12-27 15:43:31 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
2013-12-27 15:43:33 JobClient [WARN] Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
Deriving canopy clusters from the tfidf vectors
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0,
Size: 0
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.mahout.clustering.classify.ClusterClassifier.<init>(ClusterClassi
fier.java:85)
at
org.apache.mahout.clustering.classify.ClusterClassificationDriver.classifyCl
usterSeq(ClusterClassificationDriver.java:146)
at
org.apache.mahout.clustering.classify.ClusterClassificationDriver.run(Cluste
rClassificationDriver.java:282)
at
org.apache.mahout.clustering.canopy.CanopyDriver.clusterData(CanopyDriver.ja
va:374)
at
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:157)
at
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:168)
at
org.apache.mahout.clustering.canopy.CanopyDriver.run(CanopyDriver.java:195)
at
mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:81)
Here is the slightly modified source from Alex Ott's .7 version of
NewsKMeansClustering:
/*
* Source code for Listing 9.4
*/
package mia.clustering.ch09;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.mahout.clustering.Cluster;
import org.apache.mahout.clustering.canopy.CanopyDriver;
import org.apache.mahout.clustering.classify.WeightedVectorWritable;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.common.HadoopUtil;
import org.apache.mahout.common.Pair;
import org.apache.mahout.common.distance.CosineDistanceMeasure;
import org.apache.mahout.vectorizer.DictionaryVectorizer;
import org.apache.mahout.vectorizer.DocumentProcessor;
import org.apache.mahout.vectorizer.tfidf.TFIDFConverter;
public class NewsKMeansClustering
{
public static void main(String args[]) throws Exception
{
//
// changes from Alex Otts Source:
//
// 1. changed booleans that indicate the use of named vectors from false to
true
// 2. changed sequential access booleans from false to true
// 3. changed MyAnalyzer to StandardAnalyzer
// 4. added system.out.println statements to provide console guidance on
progress
// 5. Changed Input dir to reuters-seqfiles to make use of output from
command line approach in tour
//
int minSupport = 5;
int minDf = 5;
int maxDFPercent = 95;
int maxNGramSize = 2;
int minLLRValue = 50;
int reduceTasks = 1;
int chunkSize = 200;
int norm = 2;
boolean sequentialAccessOutput = true;
// String inputDir = "inputDir";
String inputDir = "reuters-seqfiles";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String outputDir = "newsClusters";
HadoopUtil.delete(conf, new Path(outputDir));
Path tokenizedPath = new Path(outputDir,
DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER);
// MyAnalyzer analyzer = new MyAnalyzer();
System.out.println("tokenizing the documents");
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT,
StandardAnalyzer.STOP_WORDS_SET);
DocumentProcessor.tokenizeDocuments(new Path(inputDir),
analyzer.getClass().asSubclass(Analyzer.class),
tokenizedPath, conf);
//
//
System.out.println("creating the term frequency vectors from tokenized
documents");
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, new
Path(outputDir),
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport,
maxNGramSize, minLLRValue, 2, true,
reduceTasks, chunkSize, sequentialAccessOutput, true);
//
//
System.out.println("calculating document frequencies from tf vectors");
Pair<Long[], List<Path>> dfData = TFIDFConverter.calculateDF(new
Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(outputDir),
conf, chunkSize);
System.out.println("creating the tfidf vectors");
TFIDFConverter.processTfIdf(new Path(outputDir,
DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER), new Path(
outputDir), conf, dfData, minDf, maxDFPercent, norm, true,
sequentialAccessOutput, true, reduceTasks);
//
//
Path vectorsFolder = new Path(outputDir, "tfidf-vectors");
Path canopyCentroids = new Path(outputDir, "canopy-centroids");
Path clusterOutput = new Path(outputDir, "clusters");
System.out.println("Deriving canopy clusters from the tfidf vectors");
// CanopyDriver.run(vectorsFolder, canopyCentroids, new
EuclideanDistanceMeasure(), 250, 120, false, 0.0, false);
CanopyDriver.run(vectorsFolder, canopyCentroids, new
CosineDistanceMeasure(), .4, .8, true, 0.0, true);
//
//
System.out.println("running cluster kmean");
// KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0-final"), clusterOutput,
// new TanimotoDistanceMeasure(), 0.01, 20, true, 0.0, false);
KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0-final"), clusterOutput,
new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
Path(clusterOutput + Cluster.CLUSTERED_POINTS_DIR
+ "/part-00000"), conf);
IntWritable key = new IntWritable();
WeightedVectorWritable value = new WeightedVectorWritable();
while ( reader.next(key, value) )
{
System.out.println(key.toString() + " belongs to cluster " +
value.toString());
}
reader.close();
}
}
I'm running out of ideas.
SCott
From: "Scott C. Cote" <[email protected]>
Date: Friday, December 27, 2013 1:56 PM
To: "[email protected]" <[email protected]>
Subject: Mahout In Action - NewsKMeansClustering sample not generating
clusters
Hello Mahout Trainers and Gurus:
I am plowing through the sample code from Mahout in Action. Have been
trying to run the example NewsKMeansClustering using the Reuters dataset.
Found Alex Ott's Blog
http://alexott.blogspot.co.uk/2012/07/getting-started-with-examples-from.htm
l
And downloaded the updated examples for 0.7 mahout. I took the exploded zip
and modified the pom.xml so that it referenced 0.8 mahout instead of 0.7
mahout.
Of course, there are compile errors (expected), but the only "seemingly"
significant problems are in the helper class called MyAnalyzer.
NOTE: am NOT complaining about the fact that the samples don't compile
properly in 0.8 . If my efforts to make it work results in sharable code
then I have helped (or that person who helps me helped).
I need help in potentially two different parts: Revision of MyAnalyzer
(steps 1 and 2) and/or sidestepping it (step 3)
Steps Taken (total of 3 steps):
Step 1. Performed the sgml2text conversion of reuters data and then
converted the text to sequence files.
Step 2. Attempted to run the java NewsKMeansClustering with MyAnalyzer -
attempted to modify MyAnalyzer to fit into the 0.8 mahout world
When I try to run the program, the sample blows up with this message:
> 2013-12-27 12:59:29.870 java[86219:1203] Unable to load realm info from
> SCDynamicStore
>
> SLF4J: Class path contains multiple SLF4J bindings.
>
> SLF4J: Found binding in
> [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-jcl/1.7.5/slf4j-jcl
> -1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
> [jar:file:/Users/scottccote/.m2/repository/org/slf4j/slf4j-log4j12/1.5.11/slf4
> j-log4j12-1.5.11.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
>
> SLF4J: Actual binding is of type [org.slf4j.impl.JCLLoggerFactory]
>
> 2013-12-27 12:59:30 NativeCodeLoader [WARN] Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
>
> 2013-12-27 12:59:30 JobClient [WARN] Use GenericOptionsParser for parsing the
> arguments. Applications should implement Tool for the same.
>
> 2013-12-27 12:59:30 LocalJobRunner [WARN] job_local_0001
>
> java.lang.NullPointerException
>
> at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.fill(Charac
> terUtils.java:209)
>
> at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.jav
> a:135)
>
> at
> org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence
> FileTokenizerMapper.java:49)
>
> at
> org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper.map(Sequence
> FileTokenizerMapper.java:38)
>
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:214)
>
> Exception in thread "main" java.lang.IllegalStateException: Job failed!
>
> at
> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProce
> ssor.java:95)
>
> at mia.clustering.ch09.NewsKMeansClustering.main(NewsKMeansClustering.java:53)
Here is the source code to my revised MyAnalyzer I tried to stay as true
to form of the original "MyAnalyzer" but I'm sure that
I misunderstood "something" in this class when I ported it to the new Lucene
Analyzer interface api.
> public class MyAnalyzer extends Analyzer
>
> {
>
>
>
> private final Pattern alphabets = Pattern.compile("[a-z]+");
>
>
>
> /*
>
> * (non-Javadoc)
>
> * @see org.apache.lucene.analysis.Analyzer#createComponents(java.lang.String,
> java.io.Reader)
>
> */
>
> @Override
>
> protected TokenStreamComponents createComponents(String fieldName, Reader
> reader)
>
> {
>
> final Tokenizer source = new StandardTokenizer(Version.LUCENE_CURRENT,
> reader);
>
> TokenStream result = new StandardFilter(Version.LUCENE_CURRENT, source);
>
> result = new LowerCaseFilter(Version.LUCENE_CURRENT, result);
>
> result = new StopFilter(Version.LUCENE_CURRENT, result,
> StandardAnalyzer.STOP_WORDS_SET);
>
> CharTermAttribute termAtt = result.addAttribute(CharTermAttribute.class);
>
> StringBuilder buf = new StringBuilder();
>
>
>
> try
>
> {
>
> result.reset();
>
> while ( result.incrementToken() )
>
> {
>
> if ( termAtt.length() < 3 )
>
> continue;
>
> String word = new String(termAtt.buffer(), 0, termAtt.length());
>
> Matcher m = alphabets.matcher(word);
>
>
>
> if ( m.matches() )
>
> {
>
> buf.append(word).append(" ");
>
> }
>
> }
>
> }
>
> catch ( IOException e )
>
> {
>
> e.printStackTrace();
>
> }
>
>
>
> TokenStream ts = new WhitespaceTokenizer(Version.LUCENE_CURRENT, new
> StringReader(buf.toString()));
>
> return new TokenStreamComponents(source, ts);
>
> }
>
> }
Step 3. Since I wasn't progressing with "MyAnalyzer" - I commented out the
MyAnalyzer reference inside NewsKMeansClustering and replaced with
> // MyAnalyzer analyzer = new MyAnalyzer();
>
> System.out.println("tokenizing the documents");
>
> StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT,
> StandardAnalyzer.STOP_WORDS_SET);
>
And then I get past the problem mentioned in step 2 - all the way to
calculating the means clusters based on canopy data.
Unfortunately, - no clusters are generated out of canopy process. I
confirmed by navigating to the folder titled:
> newsClusters/canopy-centroids/clusters-0-final
And issued the command
> mahout seqdumper -i part-r-00000
To see the result
> Input Path: part-r-00000
> Key class: class org.apache.hadoop.io.Text Value Class: class
> org.apache.mahout.clustering.iterator.ClusterWritable
> Count: 0
So what do I need to do in order for the sample to generate clusters?
NOTE: I was able to generate clusters using the manual process (command
line methods).