I am attempting to use the dirichlet clusterer, and I am getting an error like:
org.apache.mahout.math.IndexException: Index 204 is outside allowable range of
[0,99)
at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:172)
I could use an extra pair of eyes on my process, very glad for any pointers in
the right direction.
I am executing the clusterer with:
bin/mahout dirichlet --input 2_vect.out --output 2_cluster --maxIter 100
--numClusters 5
I have prepared a vector file from a directory of files using the following
code, modeled from the LastfmDataConverter and a SequenceFileWriteDemo (snipped
stuff that is obvious like the imports and the 'main' method):
public static Map<String, List<Integer>> convertToItemFeatures(String
inputFile, Map<String
, List<Integer>> itemFeatures, Map<String, Integer> featureIdxM) throws
IOException {
BufferedReader br = Files.newReader(new File(inputFile), Charsets.UTF_8);
try {
String line;
System.out.print("Reading " + inputFile + "\n");
while ((line = br.readLine()) != null) {
// get the featureIdx
Integer featureIdx = featureIdxM.get(line);
if (featureIdx == null) {
featureIdx = featureIdxM.size() + 1;
featureIdxM.put(line, featureIdx);
}
// add it to the corresponding feature idx map
List<Integer> features = itemFeatures.get(inputFile);
if (features == null) {
features = Lists.newArrayList();
itemFeatures.put(inputFile, features);
}
features.add(featureIdx);
}
} finally {
Closeables.closeQuietly(br);
}
return itemFeatures;
}
/**
* Converts each record in (item,features) map into Mahout vector format and
* writes it into sequencefile for minhash clustering
*/
public static boolean writeToSequenceFile(Map<String, List<Integer>>
itemFeaturesMap, Path outputPath)
throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
fs.mkdirs(outputPath.getParent());
long totalRecords = itemFeaturesMap.size();
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath,
Text.class, VectorWritable.class);
try {
String msg = "Now writing vectorized data in sequence file format: ";
System.out.print(msg);
Text itemWritable = new Text();
VectorWritable featuresWritable = new VectorWritable();
for (Map.Entry<String, List<Integer>> itemFeature :
itemFeaturesMap.entrySet()) {
int numfeatures = itemFeature.getValue().size();
itemWritable.set(itemFeature.getKey());
Vector featureVector = new SequentialAccessSparseVector(numfeatures);
int i = 0;
for (Integer feature : itemFeature.getValue()) {
featureVector.setQuick(i++, feature);
}
featuresWritable.set(featureVector);
writer.append(itemWritable, featuresWritable);
}
} finally {
Closeables.closeQuietly(writer);
}
return true;
}
public static Map<String, List<Integer>> listFilesForFolder(final File
folder) {
Map<String, Integer> featureIdxMap = Maps.newHashMap();
Map<String, List<Integer>> itemFeaturesMap = Maps.newHashMap();
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
try {
String files = listOfFiles[i].getCanonicalPath();
System.out.println(files);
convertToItemFeatures(files, itemFeaturesMap, featureIdxMap);
System.out.print("Size of features == " + featureIdxMap.size() +
"\n");
System.out.print("Size of itemFeatures == " +
itemFeaturesMap.size() + "\n");
}
catch (Exception e) {
System.out.print("Ugh " + e.getMessage());
}
}
}
return itemFeaturesMap;
}
Any ideas why my vector file cannot be read? Do I need to run seq2sparse on it
still?