Hard to tell without seeing the stack dump that raised the exception but
consider: In order to create the ModelDistribution, a Vector prototype
is created of the size of the first data record read. This then
configures the size of the corresponding Models created by the
distribution. If any of your input vectors are larger than this
prototype size it could cause the index exception you are seeing.
Suggest you create your sparse vectors with Integer.MAX_INT size to work
around this. They won't take up any more space and the algorithm will be
more forgiving.
On 10/3/12 1:04 PM, David Swift wrote:
I am attempting to use the dirichlet clusterer, and I am getting an error like:
org.apache.mahout.math.IndexException: Index 204 is outside allowable range of
[0,99)
at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:172)
I could use an extra pair of eyes on my process, very glad for any pointers in
the right direction.
I am executing the clusterer with:
bin/mahout dirichlet --input 2_vect.out --output 2_cluster --maxIter 100
--numClusters 5
I have prepared a vector file from a directory of files using the following
code, modeled from the LastfmDataConverter and a SequenceFileWriteDemo (snipped
stuff that is obvious like the imports and the 'main' method):
public static Map<String, List<Integer>> convertToItemFeatures(String
inputFile, Map<String
, List<Integer>> itemFeatures, Map<String, Integer> featureIdxM) throws
IOException {
BufferedReader br = Files.newReader(new File(inputFile), Charsets.UTF_8);
try {
String line;
System.out.print("Reading " + inputFile + "\n");
while ((line = br.readLine()) != null) {
// get the featureIdx
Integer featureIdx = featureIdxM.get(line);
if (featureIdx == null) {
featureIdx = featureIdxM.size() + 1;
featureIdxM.put(line, featureIdx);
}
// add it to the corresponding feature idx map
List<Integer> features = itemFeatures.get(inputFile);
if (features == null) {
features = Lists.newArrayList();
itemFeatures.put(inputFile, features);
}
features.add(featureIdx);
}
} finally {
Closeables.closeQuietly(br);
}
return itemFeatures;
}
/**
* Converts each record in (item,features) map into Mahout vector format and
* writes it into sequencefile for minhash clustering
*/
public static boolean writeToSequenceFile(Map<String, List<Integer>>
itemFeaturesMap, Path outputPath)
throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
fs.mkdirs(outputPath.getParent());
long totalRecords = itemFeaturesMap.size();
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath,
Text.class, VectorWritable.class);
try {
String msg = "Now writing vectorized data in sequence file format: ";
System.out.print(msg);
Text itemWritable = new Text();
VectorWritable featuresWritable = new VectorWritable();
for (Map.Entry<String, List<Integer>> itemFeature :
itemFeaturesMap.entrySet()) {
int numfeatures = itemFeature.getValue().size();
itemWritable.set(itemFeature.getKey());
Vector featureVector = new SequentialAccessSparseVector(numfeatures);
int i = 0;
for (Integer feature : itemFeature.getValue()) {
featureVector.setQuick(i++, feature);
}
featuresWritable.set(featureVector);
writer.append(itemWritable, featuresWritable);
}
} finally {
Closeables.closeQuietly(writer);
}
return true;
}
public static Map<String, List<Integer>> listFilesForFolder(final File
folder) {
Map<String, Integer> featureIdxMap = Maps.newHashMap();
Map<String, List<Integer>> itemFeaturesMap = Maps.newHashMap();
File[] listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
try {
String files = listOfFiles[i].getCanonicalPath();
System.out.println(files);
convertToItemFeatures(files, itemFeaturesMap, featureIdxMap);
System.out.print("Size of features == " + featureIdxMap.size() +
"\n");
System.out.print("Size of itemFeatures == " + itemFeaturesMap.size() +
"\n");
}
catch (Exception e) {
System.out.print("Ugh " + e.getMessage());
}
}
}
return itemFeaturesMap;
}
Any ideas why my vector file cannot be read? Do I need to run seq2sparse on it
still?