I am attempting to use the dirichlet clusterer, and I am getting an error like:
org.apache.mahout.math.IndexException: Index 204 is outside allowable range of 
[0,99)
        at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:172)

I could use an extra pair of eyes on my process, very glad for any pointers in 
the right direction.

I am executing the clusterer with:
 bin/mahout dirichlet --input 2_vect.out --output 2_cluster --maxIter 100 
--numClusters 5


I have prepared a vector file from a directory of files using the following 
code, modeled from the LastfmDataConverter and a SequenceFileWriteDemo (snipped 
stuff that is obvious like the imports and the 'main' method):

  public static Map<String, List<Integer>> convertToItemFeatures(String 
inputFile, Map<String
, List<Integer>> itemFeatures, Map<String, Integer> featureIdxM) throws 
IOException {
    BufferedReader br = Files.newReader(new File(inputFile), Charsets.UTF_8);
    try {
      String line;
      System.out.print("Reading " + inputFile + "\n");
      while ((line = br.readLine()) != null) {
        // get the featureIdx
        Integer featureIdx = featureIdxM.get(line);
        if (featureIdx == null) {
          featureIdx = featureIdxM.size() + 1;
          featureIdxM.put(line, featureIdx);
        }
        // add it to the corresponding feature idx map
        List<Integer> features = itemFeatures.get(inputFile);
        if (features == null) {
          features = Lists.newArrayList();
          itemFeatures.put(inputFile, features);
        }
        features.add(featureIdx);
      }
    } finally {
      Closeables.closeQuietly(br);
    }
    return itemFeatures;
  }

  /**
   * Converts each record in (item,features) map into Mahout vector format and
   * writes it into sequencefile for minhash clustering
   */
  public static boolean writeToSequenceFile(Map<String, List<Integer>> 
itemFeaturesMap, Path outputPath)
    throws IOException {
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    fs.mkdirs(outputPath.getParent());
    long totalRecords = itemFeaturesMap.size();
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath, 
Text.class, VectorWritable.class);
    try {
      String msg = "Now writing vectorized data in sequence file format: ";
      System.out.print(msg);

      Text itemWritable = new Text();
      VectorWritable featuresWritable = new VectorWritable();

      for (Map.Entry<String, List<Integer>> itemFeature : 
itemFeaturesMap.entrySet()) {
        int numfeatures = itemFeature.getValue().size();
        itemWritable.set(itemFeature.getKey());
        Vector featureVector = new SequentialAccessSparseVector(numfeatures);
        int i = 0;
        for (Integer feature : itemFeature.getValue()) {
          featureVector.setQuick(i++, feature);
        }
        featuresWritable.set(featureVector);
        writer.append(itemWritable, featuresWritable);
      }
    } finally {
      Closeables.closeQuietly(writer);
    }
    return true;
  }

  public static Map<String, List<Integer>>  listFilesForFolder(final File 
folder) {
    Map<String, Integer> featureIdxMap = Maps.newHashMap();
    Map<String, List<Integer>> itemFeaturesMap = Maps.newHashMap();

    File[] listOfFiles = folder.listFiles();
    for (int i = 0; i < listOfFiles.length; i++) {
        if (listOfFiles[i].isFile()) {
           try {
               String files = listOfFiles[i].getCanonicalPath();
               System.out.println(files);
               convertToItemFeatures(files, itemFeaturesMap, featureIdxMap);
               System.out.print("Size of features == " + featureIdxMap.size() + 
"\n");
               System.out.print("Size of itemFeatures == " + 
itemFeaturesMap.size() + "\n");
           }
           catch (Exception e) {
               System.out.print("Ugh " + e.getMessage());
           }
        }
    }

    return itemFeaturesMap;
  }

Any ideas why my vector file cannot be read?  Do I need to run seq2sparse on it 
still?

Reply via email to