Re: Question about Item Based Collaborative Filtering

Something Something Sun, 24 Jun 2012 21:50:12 -0700

Changed it to LoglikelihoodSimilarity, but in step 4
(RowSimilarityJob-VectorNormMapper-Reducer)  get the following error
related to Java heap space:


12/06/24 23:40:52 INFO mapred.JobClient: Task Id :
attempt_201202041116_64039_m_000005_0, Status : FAILED
Error: Java heap space
attempt_201202041116_64039_m_000005_0: Exception in thread "Timer thread
for monitoring jvm" java.lang.IllegalArgumentException: unresolved address
attempt_201202041116_64039_m_000005_0:  at
java.net.DatagramPacket.setSocketAddress(DatagramPacket.java:295)
attempt_201202041116_64039_m_000005_0:  at
java.net.DatagramPacket.<init>(DatagramPacket.java:123)
attempt_201202041116_64039_m_000005_0:  at
java.net.DatagramPacket.<init>(DatagramPacket.java:158)
attempt_201202041116_64039_m_000005_0:  at
org.apache.hadoop.metrics.ganglia.GangliaContext31.emitMetric(GangliaContext31.java:118)
attempt_201202041116_64039_m_000005_0:  at
org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:127)
attempt_201202041116_64039_m_000005_0:  at
org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:313)
attempt_201202041116_64039_m_000005_0:  at
org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:299)
attempt_201202041116_64039_m_000005_0:  at
org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:53)
attempt_201202041116_64039_m_000005_0:  at
org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:258)
attempt_201202041116_64039_m_000005_0:  at
java.util.TimerThread.mainLoop(Timer.java:512)
attempt_201202041116_64039_m_000005_0:  at
java.util.TimerThread.run(Timer.java:462)



I have set the following properties:

    <property>
        <name>mapred.task.timeout</name>
        <value>1800000</value> <!-- 30 minutes -->
    </property>
    <property>
        <name>mapred.child.java.opts</name>
        <value>-Xmx4g</value>
    </property>
    <property>
        <name>mapred.map.child.java.opts</name>
        <value>-Xmx4g</value>
    </property>
    <property>
        <name>mapred.reduce.child.java.opts</name>
        <value>-Xmx4g</value>
    </property>
    <property>
        <name>mapred.reduce.tasks</name>
        <value>50</value>
    </property>


On Sun, Jun 24, 2012 at 1:19 AM, Sean Owen <[email protected]> wrote:

> Try LoglikelihoodSimilarity.
>
> Where do you run into memory issues? Did you change worker heap
> settings from the default?
>
> On Sat, Jun 23, 2012 at 10:24 PM, Something Something
> <[email protected]> wrote:
> > Thank you so much Sean.  It was great to get confirmation from you
> > regarding the choice of algorithm.
> >
> > As suggested, I used the following params:
> >
> >            similarityJob.run(new String[]{"--tempDir",
> > tmpDir.getAbsolutePath(), "--similarityClassname",
> >
> > CooccurrenceCountSimilarity.class.getName(),"--booleanData",
> > String.valueOf(Boolean.TRUE)});
> >
> > and got output!!!!   Horray.
> >
> > Question:  Is CooccurenceCountSimilarity best in this case?
> >
> >
> > Anyway, now I am going to try on our production cluster with Billions of
> > lines.  Last time I tried, I ran into OutOfMemoryExceptions.  Any
> > suggestions regarding memory settings?
> >
> > Thanks once again for your help.
> >
> >
> > On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <[email protected]> wrote:
> >
> >> Using 1 is just fine for the reasons you give. You would be surprised
> how
> >> OK it is to use this even for dislikes. In fact just omit the third
> field
> >> in your CSV.
> >>
> >> However you need to set the boolean data flag and choose a similarity
> >> metric that is defined over such data. Pearson / cosine is not for
> example
> >> since every value is 1. This is why there is no output.
> >> On Jun 23, 2012 1:33 AM, "Something Something" <
> [email protected]>
> >> wrote:
> >>
> >> > I tested my setup of ItemSimilarityJob using the MovieLens dataset &
> got
> >> > the expected results.  It looks like my setup is good.
> >> >
> >> > Here's what I have:
> >> >
> >> > I have data coming in the following format: UserId, GroupId, Frequency
> >> (how
> >> > many times the user chose the group), Max timestamp (the last time the
> >> user
> >> > chose the group).
> >> >
> >> > Based on this dataset we need to figure out which groups look alike. I
> >> > decided to use "item based collaborative filtering" but I have 3
> >> concerns:
> >> >
> >> > 1)  We don't have any knowledge of "Dislikes"; we only know which
> groups
> >> > users "Like".
> >> > 2)  We don't really have ratings. In other words, users don't rate the
> >> > group. Either they choose OR they don't.
> >> > 3)  Frequency doesn't really imply interest level.
> >> >
> >> >
> >> > I decided to try 'ItemSimilarityJob' by using a CSV file in the
> following
> >> > format:
> >> >
> >> > UserId, GroupId, "1"
> >> >
> >> > In other words, the rating value is always 1.  There are NO rows with
> >> value
> >> > "0".  This is producing NO OUTPUT, but the job finishes successfully.
> >> >
> >> > Is this the right way to solve the problem?  Is there some other
> >> Algorithm
> >> > that I should be using?  Thanks for the help.
> >> >
> >>
>

Re: Question about Item Based Collaborative Filtering

Reply via email to