Is your input very small? It is probably getting mostly pruned as a result, as most of it looks like low-count data. And then there is almost no info on which to compute similarity.
On Tue, Jun 5, 2012 at 7:13 PM, Something Something <[email protected]> wrote: > One thing I noticed is that in step 4 of this process > (RowSimilarityJob-VectorNormMapper-Reducer) > > Mapper input: 6,925 > Mapper output: 3 > > Reducer input: 3 > Reducer output: 0 > > Most of the values going into the RowSimilarityJob are defaults. Here's > what I see in the code: > > if (shouldRunNextPhase(parsedArgs, currentPhase)) { > int numberOfUsers = HadoopUtil.readInt(new Path(prepPath, > PreparePreferenceMatrixJob.NUM_USERS), > getConf()); > > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] { > "--input", new Path(prepPath, > PreparePreferenceMatrixJob.RATING_MATRIX).toString(), > "--output", similarityMatrixPath.toString(), > "--numberOfColumns", String.valueOf(numberOfUsers), > "--similarityClassname", similarityClassName, > "--maxSimilaritiesPerRow", String.valueOf(maxSimilarItemsPerItem), > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), > "--threshold", String.valueOf(threshold), > "--tempDir", getTempPath().toString() }); > } > > > Any ideas? > > > On Mon, Jun 4, 2012 at 7:36 PM, Something Something < > [email protected]> wrote: > >> My job setup is really simple. It looks like this: >> >> public int run(String[] args) throws Exception { >> String datasetDate = args[0]; >> String inputPath = args[1]; >> String configFile = args[2]; >> String ouputLocation = args[3]; >> >> Configuration config = getConf(); >> config.addResource(new Path(configFile)); >> logger.error("config: " + config.toString()); >> >> File inputFile = new File(inputPath); >> File outputDir = new File(ouputLocation); >> outputDir.delete(); >> File tmpDir = new File("/tmp"); >> >> ItemSimilarityJob similarityJob = new ItemSimilarityJob(); >> >> Configuration conf = new Configuration(); >> conf.set("mapred.input.dir", inputFile.getAbsolutePath()); >> conf.set("mapred.output.dir", outputDir.getAbsolutePath()); >> conf.setBoolean("mapred.output.compress", false); >> >> similarityJob.setConf(conf); >> >> similarityJob.run(new String[]{"--tempDir", >> tmpDir.getAbsolutePath(), "--similarityClassname", >> PearsonCorrelationSimilarity.class.getName(),}); >> >> return 0; >> } >> >> >> The input file is sorted by UserId, ItemId & Preference. Preference is >> always '1'. A few lines from the file look like this: >> >> -1000000334008648908 1 1 >> -1000000334008648908 70 1 >> -1000000334008648908 2090 1 >> -1000000334008648908 12872 1 >> -1000000334008648908 32790 1 >> -1000000334008648908 32799 1 >> -1000000334008648908 32969 1 >> -1000000397028994738 1 1 >> -1000000397028994738 12872 1 >> -1000000397028994738 32790 1 >> -1000000397028994738 32796 1 >> -1000000397028994738 32939 1 >> -100000083781885705 1 1 >> -100000083781885705 12872 1 >> -100000083781885705 32790 1 >> -100000083781885705 32837 1 >> -100000083781885705 33723 1 >> -1000001014586220418 1 1 >> -1000001014586220418 12872 1 >> -1000001014586220418 32790 1 >> & so on... >> >> (UserId is created using MemoryIDMigrator) >> >> >> The job internally runs following 7 Hadoop jobs which all run successfully: >> >> PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer >> PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer >> PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer >> RowSimilarityJob-VectorNormMapper-Reducer >> RowSimilarityJob-CooccurrencesMapper-Reducer >> RowSimilarityJob-UnsymmetrifyMapper-Reducer >> ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer >> >> >> Problem is that the output file is empty! What am I missing? Please >> help. Thanks. >> >>
