Re: ItemSimilarityJob creates no output

Something Something Wed, 06 Jun 2012 08:58:04 -0700

The input size was about 6 Million so I was expecting to find some
similarities.  Anyway, I have started a test with the real dataset that
contains 700 million lines.  We shall see how that goes.  One quick
question, though:


I am using MemoryIDMigrator to convert UserIds from String to Long as
follows:

    static UpdatableIDMigrator migrator = new MemoryIDMigrator();
<some code omitted here...>
    migrator.toLongID(strUserID);

Question:  If I pass the same userId multiple times to this method, I am
guaranteed to get the same 'Long' number back, correct?


On Tue, Jun 5, 2012 at 10:59 PM, Sean Owen <[email protected]> wrote:

> Is your input very small? It is probably getting mostly pruned as a
> result, as most of it looks like low-count data. And then there is
> almost no info on which to compute similarity.
>
> On Tue, Jun 5, 2012 at 7:13 PM, Something Something
> <[email protected]> wrote:
> > One thing I noticed is that in step 4 of this process
> > (RowSimilarityJob-VectorNormMapper-Reducer)
> >
> > Mapper input:  6,925
> > Mapper output: 3
> >
> > Reducer input: 3
> > Reducer output: 0
> >
> > Most of the values going into the RowSimilarityJob are defaults.  Here's
> > what I see in the code:
> >
> >    if (shouldRunNextPhase(parsedArgs, currentPhase)) {
> >      int numberOfUsers = HadoopUtil.readInt(new Path(prepPath,
> > PreparePreferenceMatrixJob.NUM_USERS),
> >          getConf());
> >
> >      ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] {
> >          "--input", new Path(prepPath,
> > PreparePreferenceMatrixJob.RATING_MATRIX).toString(),
> >          "--output", similarityMatrixPath.toString(),
> >          "--numberOfColumns", String.valueOf(numberOfUsers),
> >          "--similarityClassname", similarityClassName,
> >          "--maxSimilaritiesPerRow",
> String.valueOf(maxSimilarItemsPerItem),
> >          "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE),
> >          "--threshold", String.valueOf(threshold),
> >          "--tempDir", getTempPath().toString() });
> >    }
> >
> >
> > Any ideas?
> >
> >
> > On Mon, Jun 4, 2012 at 7:36 PM, Something Something <
> > [email protected]> wrote:
> >
> >> My job setup is really simple.  It looks like this:
> >>
> >>     public int run(String[] args) throws Exception {
> >>         String datasetDate = args[0];
> >>         String inputPath = args[1];
> >>         String configFile = args[2];
> >>         String ouputLocation = args[3];
> >>
> >>         Configuration config = getConf();
> >>         config.addResource(new Path(configFile));
> >>         logger.error("config: " + config.toString());
> >>
> >>         File inputFile = new File(inputPath);
> >>         File outputDir = new File(ouputLocation);
> >>         outputDir.delete();
> >>         File tmpDir = new File("/tmp");
> >>
> >>         ItemSimilarityJob similarityJob = new ItemSimilarityJob();
> >>
> >>         Configuration conf = new Configuration();
> >>         conf.set("mapred.input.dir", inputFile.getAbsolutePath());
> >>         conf.set("mapred.output.dir", outputDir.getAbsolutePath());
> >>         conf.setBoolean("mapred.output.compress", false);
> >>
> >>         similarityJob.setConf(conf);
> >>
> >>         similarityJob.run(new String[]{"--tempDir",
> >> tmpDir.getAbsolutePath(), "--similarityClassname",
> >>                 PearsonCorrelationSimilarity.class.getName(),});
> >>
> >>         return 0;
> >>     }
> >>
> >>
> >> The input file is sorted by UserId, ItemId & Preference.  Preference is
> >> always '1'.  A few lines from the file look like this:
> >>
> >> -1000000334008648908    1    1
> >> -1000000334008648908    70    1
> >> -1000000334008648908    2090    1
> >> -1000000334008648908    12872    1
> >> -1000000334008648908    32790    1
> >> -1000000334008648908    32799    1
> >> -1000000334008648908    32969    1
> >> -1000000397028994738    1    1
> >> -1000000397028994738    12872    1
> >> -1000000397028994738    32790    1
> >> -1000000397028994738    32796    1
> >> -1000000397028994738    32939    1
> >> -100000083781885705    1    1
> >> -100000083781885705    12872    1
> >> -100000083781885705    32790    1
> >> -100000083781885705    32837    1
> >> -100000083781885705    33723    1
> >> -1000001014586220418    1    1
> >> -1000001014586220418    12872    1
> >> -1000001014586220418    32790    1
> >> & so on...
> >>
> >> (UserId is created using MemoryIDMigrator)
> >>
> >>
> >> The job internally runs following 7 Hadoop jobs which all run
> successfully:
> >>
> >> PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer
> >> PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer
> >> PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer
> >> RowSimilarityJob-VectorNormMapper-Reducer
> >> RowSimilarityJob-CooccurrencesMapper-Reducer
> >> RowSimilarityJob-UnsymmetrifyMapper-Reducer
> >> ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer
> >>
> >>
> >> Problem is that the output file is empty!  What am I missing?  Please
> >> help.  Thanks.
> >>
> >>
>

Re: ItemSimilarityJob creates no output

Reply via email to