Re: Question regarding clustering vector files

Jeff Eastman Wed, 03 Oct 2012 18:25:37 -0700

Hard to tell without seeing the stack dump that raised the exception butconsider: In order to create the ModelDistribution, a Vector prototypeis created of the size of the first data record read. This thenconfigures the size of the corresponding Models created by thedistribution. If any of your input vectors are larger than thisprototype size it could cause the index exception you are seeing.Suggest you create your sparse vectors with Integer.MAX_INT size to workaround this. They won't take up any more space and the algorithm will bemore forgiving.


On 10/3/12 1:04 PM, David Swift wrote:

I am attempting to use the dirichlet clusterer, and I am getting an error like:
org.apache.mahout.math.IndexException: Index 204 is outside allowable range of 
[0,99)
         at org.apache.mahout.math.AbstractVector.get(AbstractVector.java:172)

I could use an extra pair of eyes on my process, very glad for any pointers in 
the right direction.

I am executing the clusterer with:
  bin/mahout dirichlet --input 2_vect.out --output 2_cluster --maxIter 100 
--numClusters 5


I have prepared a vector file from a directory of files using the following 
code, modeled from the LastfmDataConverter and a SequenceFileWriteDemo (snipped 
stuff that is obvious like the imports and the 'main' method):

   public static Map<String, List<Integer>> convertToItemFeatures(String 
inputFile, Map<String
, List<Integer>> itemFeatures, Map<String, Integer> featureIdxM) throws 
IOException {
     BufferedReader br = Files.newReader(new File(inputFile), Charsets.UTF_8);
     try {
       String line;
       System.out.print("Reading " + inputFile + "\n");
       while ((line = br.readLine()) != null) {
         // get the featureIdx
         Integer featureIdx = featureIdxM.get(line);
         if (featureIdx == null) {
           featureIdx = featureIdxM.size() + 1;
           featureIdxM.put(line, featureIdx);
         }
         // add it to the corresponding feature idx map
         List<Integer> features = itemFeatures.get(inputFile);
         if (features == null) {
           features = Lists.newArrayList();
           itemFeatures.put(inputFile, features);
         }
         features.add(featureIdx);
       }
     } finally {
       Closeables.closeQuietly(br);
     }
     return itemFeatures;
   }

   /**
    * Converts each record in (item,features) map into Mahout vector format and
    * writes it into sequencefile for minhash clustering
    */
   public static boolean writeToSequenceFile(Map<String, List<Integer>> 
itemFeaturesMap, Path outputPath)
     throws IOException {
     Configuration conf = new Configuration();
     FileSystem fs = FileSystem.get(conf);
     fs.mkdirs(outputPath.getParent());
     long totalRecords = itemFeaturesMap.size();
     SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, outputPath, 
Text.class, VectorWritable.class);
     try {
       String msg = "Now writing vectorized data in sequence file format: ";
       System.out.print(msg);

       Text itemWritable = new Text();
       VectorWritable featuresWritable = new VectorWritable();

       for (Map.Entry<String, List<Integer>> itemFeature : 
itemFeaturesMap.entrySet()) {
         int numfeatures = itemFeature.getValue().size();
         itemWritable.set(itemFeature.getKey());
         Vector featureVector = new SequentialAccessSparseVector(numfeatures);
         int i = 0;
         for (Integer feature : itemFeature.getValue()) {
           featureVector.setQuick(i++, feature);
         }
         featuresWritable.set(featureVector);
         writer.append(itemWritable, featuresWritable);
       }
     } finally {
       Closeables.closeQuietly(writer);
     }
     return true;
   }

   public static Map<String, List<Integer>>  listFilesForFolder(final File 
folder) {
     Map<String, Integer> featureIdxMap = Maps.newHashMap();
     Map<String, List<Integer>> itemFeaturesMap = Maps.newHashMap();

     File[] listOfFiles = folder.listFiles();
     for (int i = 0; i < listOfFiles.length; i++) {
         if (listOfFiles[i].isFile()) {
            try {
                String files = listOfFiles[i].getCanonicalPath();
                System.out.println(files);
                convertToItemFeatures(files, itemFeaturesMap, featureIdxMap);
                System.out.print("Size of features == " + featureIdxMap.size() + 
"\n");
                System.out.print("Size of itemFeatures == " + itemFeaturesMap.size() + 
"\n");
            }
            catch (Exception e) {
                System.out.print("Ugh " + e.getMessage());
            }
         }
     }

     return itemFeaturesMap;
   }

Any ideas why my vector file cannot be read?  Do I need to run seq2sparse on it 
still?

Re: Question regarding clustering vector files

Reply via email to