One thing I noticed is that in step 4 of this process
(RowSimilarityJob-VectorNormMapper-Reducer)
Mapper input: 6,925
Mapper output: 3
Reducer input: 3
Reducer output: 0
Most of the values going into the RowSimilarityJob are defaults. Here's
what I see in the code:
if (shouldRunNextPhase(parsedArgs, currentPhase)) {
int numberOfUsers = HadoopUtil.readInt(new Path(prepPath,
PreparePreferenceMatrixJob.NUM_USERS),
getConf());
ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] {
"--input", new Path(prepPath,
PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
"--output", similarityMatrixPath.toString(),
"--numberOfColumns", String.valueOf(numberOfUsers),
"--similarityClassname", similarityClassName,
"--maxSimilaritiesPerRow", String.valueOf(maxSimilarItemsPerItem),
"--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
"--threshold", String.valueOf(threshold),
"--tempDir", getTempPath().toString() });
}
Any ideas?
On Mon, Jun 4, 2012 at 7:36 PM, Something Something <
[email protected]> wrote:
> My job setup is really simple. It looks like this:
>
> public int run(String[] args) throws Exception {
> String datasetDate = args[0];
> String inputPath = args[1];
> String configFile = args[2];
> String ouputLocation = args[3];
>
> Configuration config = getConf();
> config.addResource(new Path(configFile));
> logger.error("config: " + config.toString());
>
> File inputFile = new File(inputPath);
> File outputDir = new File(ouputLocation);
> outputDir.delete();
> File tmpDir = new File("/tmp");
>
> ItemSimilarityJob similarityJob = new ItemSimilarityJob();
>
> Configuration conf = new Configuration();
> conf.set("mapred.input.dir", inputFile.getAbsolutePath());
> conf.set("mapred.output.dir", outputDir.getAbsolutePath());
> conf.setBoolean("mapred.output.compress", false);
>
> similarityJob.setConf(conf);
>
> similarityJob.run(new String[]{"--tempDir",
> tmpDir.getAbsolutePath(), "--similarityClassname",
> PearsonCorrelationSimilarity.class.getName(),});
>
> return 0;
> }
>
>
> The input file is sorted by UserId, ItemId & Preference. Preference is
> always '1'. A few lines from the file look like this:
>
> -1000000334008648908 1 1
> -1000000334008648908 70 1
> -1000000334008648908 2090 1
> -1000000334008648908 12872 1
> -1000000334008648908 32790 1
> -1000000334008648908 32799 1
> -1000000334008648908 32969 1
> -1000000397028994738 1 1
> -1000000397028994738 12872 1
> -1000000397028994738 32790 1
> -1000000397028994738 32796 1
> -1000000397028994738 32939 1
> -100000083781885705 1 1
> -100000083781885705 12872 1
> -100000083781885705 32790 1
> -100000083781885705 32837 1
> -100000083781885705 33723 1
> -1000001014586220418 1 1
> -1000001014586220418 12872 1
> -1000001014586220418 32790 1
> & so on...
>
> (UserId is created using MemoryIDMigrator)
>
>
> The job internally runs following 7 Hadoop jobs which all run successfully:
>
> PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
> PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
> PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer
> RowSimilarityJob-VectorNormMapper-Reducer
> RowSimilarityJob-CooccurrencesMapper-Reducer
> RowSimilarityJob-UnsymmetrifyMapper-Reducer
> ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer
>
>
> Problem is that the output file is empty! What am I missing? Please
> help. Thanks.
>
>