The input size was about 6 Million so I was expecting to find some
similarities. Anyway, I have started a test with the real dataset that
contains 700 million lines. We shall see how that goes. One quick
question, though:
I am using MemoryIDMigrator to convert UserIds from String to Long as
follows:
static UpdatableIDMigrator migrator = new MemoryIDMigrator();
<some code omitted here...>
migrator.toLongID(strUserID);
Question: If I pass the same userId multiple times to this method, I am
guaranteed to get the same 'Long' number back, correct?
On Tue, Jun 5, 2012 at 10:59 PM, Sean Owen <[email protected]> wrote:
> Is your input very small? It is probably getting mostly pruned as a
> result, as most of it looks like low-count data. And then there is
> almost no info on which to compute similarity.
>
> On Tue, Jun 5, 2012 at 7:13 PM, Something Something
> <[email protected]> wrote:
> > One thing I noticed is that in step 4 of this process
> > (RowSimilarityJob-VectorNormMapper-Reducer)
> >
> > Mapper input: 6,925
> > Mapper output: 3
> >
> > Reducer input: 3
> > Reducer output: 0
> >
> > Most of the values going into the RowSimilarityJob are defaults. Here's
> > what I see in the code:
> >
> > if (shouldRunNextPhase(parsedArgs, currentPhase)) {
> > int numberOfUsers = HadoopUtil.readInt(new Path(prepPath,
> > PreparePreferenceMatrixJob.NUM_USERS),
> > getConf());
> >
> > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] {
> > "--input", new Path(prepPath,
> > PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
> > "--output", similarityMatrixPath.toString(),
> > "--numberOfColumns", String.valueOf(numberOfUsers),
> > "--similarityClassname", similarityClassName,
> > "--maxSimilaritiesPerRow",
> String.valueOf(maxSimilarItemsPerItem),
> > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
> > "--threshold", String.valueOf(threshold),
> > "--tempDir", getTempPath().toString() });
> > }
> >
> >
> > Any ideas?
> >
> >
> > On Mon, Jun 4, 2012 at 7:36 PM, Something Something <
> > [email protected]> wrote:
> >
> >> My job setup is really simple. It looks like this:
> >>
> >> public int run(String[] args) throws Exception {
> >> String datasetDate = args[0];
> >> String inputPath = args[1];
> >> String configFile = args[2];
> >> String ouputLocation = args[3];
> >>
> >> Configuration config = getConf();
> >> config.addResource(new Path(configFile));
> >> logger.error("config: " + config.toString());
> >>
> >> File inputFile = new File(inputPath);
> >> File outputDir = new File(ouputLocation);
> >> outputDir.delete();
> >> File tmpDir = new File("/tmp");
> >>
> >> ItemSimilarityJob similarityJob = new ItemSimilarityJob();
> >>
> >> Configuration conf = new Configuration();
> >> conf.set("mapred.input.dir", inputFile.getAbsolutePath());
> >> conf.set("mapred.output.dir", outputDir.getAbsolutePath());
> >> conf.setBoolean("mapred.output.compress", false);
> >>
> >> similarityJob.setConf(conf);
> >>
> >> similarityJob.run(new String[]{"--tempDir",
> >> tmpDir.getAbsolutePath(), "--similarityClassname",
> >> PearsonCorrelationSimilarity.class.getName(),});
> >>
> >> return 0;
> >> }
> >>
> >>
> >> The input file is sorted by UserId, ItemId & Preference. Preference is
> >> always '1'. A few lines from the file look like this:
> >>
> >> -1000000334008648908 1 1
> >> -1000000334008648908 70 1
> >> -1000000334008648908 2090 1
> >> -1000000334008648908 12872 1
> >> -1000000334008648908 32790 1
> >> -1000000334008648908 32799 1
> >> -1000000334008648908 32969 1
> >> -1000000397028994738 1 1
> >> -1000000397028994738 12872 1
> >> -1000000397028994738 32790 1
> >> -1000000397028994738 32796 1
> >> -1000000397028994738 32939 1
> >> -100000083781885705 1 1
> >> -100000083781885705 12872 1
> >> -100000083781885705 32790 1
> >> -100000083781885705 32837 1
> >> -100000083781885705 33723 1
> >> -1000001014586220418 1 1
> >> -1000001014586220418 12872 1
> >> -1000001014586220418 32790 1
> >> & so on...
> >>
> >> (UserId is created using MemoryIDMigrator)
> >>
> >>
> >> The job internally runs following 7 Hadoop jobs which all run
> successfully:
> >>
> >> PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
> >> PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
> >> PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer
> >> RowSimilarityJob-VectorNormMapper-Reducer
> >> RowSimilarityJob-CooccurrencesMapper-Reducer
> >> RowSimilarityJob-UnsymmetrifyMapper-Reducer
> >> ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer
> >>
> >>
> >> Problem is that the output file is empty! What am I missing? Please
> >> help. Thanks.
> >>
> >>
>