Please ignore the latest email.  When I increased the memory size to 8g,
all steps worked.  Now validating output.  Thanks a lot for all your help.

On Sun, Jun 24, 2012 at 9:49 PM, Something Something <
[email protected]> wrote:

> Changed it to LoglikelihoodSimilarity, but in step 4
> (RowSimilarityJob-VectorNormMapper-Reducer)  get the following error
> related to Java heap space:
>
> 12/06/24 23:40:52 INFO mapred.JobClient: Task Id :
> attempt_201202041116_64039_m_000005_0, Status : FAILED
> Error: Java heap space
> attempt_201202041116_64039_m_000005_0: Exception in thread "Timer thread
> for monitoring jvm" java.lang.IllegalArgumentException: unresolved address
> attempt_201202041116_64039_m_000005_0:  at
> java.net.DatagramPacket.setSocketAddress(DatagramPacket.java:295)
> attempt_201202041116_64039_m_000005_0:  at
> java.net.DatagramPacket.<init>(DatagramPacket.java:123)
> attempt_201202041116_64039_m_000005_0:  at
> java.net.DatagramPacket.<init>(DatagramPacket.java:158)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.ganglia.GangliaContext31.emitMetric(GangliaContext31.java:118)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:127)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:313)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:299)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:53)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:258)
> attempt_201202041116_64039_m_000005_0:  at
> java.util.TimerThread.mainLoop(Timer.java:512)
> attempt_201202041116_64039_m_000005_0:  at
> java.util.TimerThread.run(Timer.java:462)
>
>
>
> I have set the following properties:
>
>     <property>
>         <name>mapred.task.timeout</name>
>         <value>1800000</value> <!-- 30 minutes -->
>     </property>
>     <property>
>         <name>mapred.child.java.opts</name>
>         <value>-Xmx4g</value>
>     </property>
>     <property>
>         <name>mapred.map.child.java.opts</name>
>         <value>-Xmx4g</value>
>     </property>
>     <property>
>         <name>mapred.reduce.child.java.opts</name>
>         <value>-Xmx4g</value>
>     </property>
>     <property>
>         <name>mapred.reduce.tasks</name>
>         <value>50</value>
>     </property>
>
>
>
> On Sun, Jun 24, 2012 at 1:19 AM, Sean Owen <[email protected]> wrote:
>
>> Try LoglikelihoodSimilarity.
>>
>> Where do you run into memory issues? Did you change worker heap
>> settings from the default?
>>
>> On Sat, Jun 23, 2012 at 10:24 PM, Something Something
>> <[email protected]> wrote:
>> > Thank you so much Sean.  It was great to get confirmation from you
>> > regarding the choice of algorithm.
>> >
>> > As suggested, I used the following params:
>> >
>> >            similarityJob.run(new String[]{"--tempDir",
>> > tmpDir.getAbsolutePath(), "--similarityClassname",
>> >
>> > CooccurrenceCountSimilarity.class.getName(),"--booleanData",
>> > String.valueOf(Boolean.TRUE)});
>> >
>> > and got output!!!!   Horray.
>> >
>> > Question:  Is CooccurenceCountSimilarity best in this case?
>> >
>> >
>> > Anyway, now I am going to try on our production cluster with Billions of
>> > lines.  Last time I tried, I ran into OutOfMemoryExceptions.  Any
>> > suggestions regarding memory settings?
>> >
>> > Thanks once again for your help.
>> >
>> >
>> > On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <[email protected]> wrote:
>> >
>> >> Using 1 is just fine for the reasons you give. You would be surprised
>> how
>> >> OK it is to use this even for dislikes. In fact just omit the third
>> field
>> >> in your CSV.
>> >>
>> >> However you need to set the boolean data flag and choose a similarity
>> >> metric that is defined over such data. Pearson / cosine is not for
>> example
>> >> since every value is 1. This is why there is no output.
>> >> On Jun 23, 2012 1:33 AM, "Something Something" <
>> [email protected]>
>> >> wrote:
>> >>
>> >> > I tested my setup of ItemSimilarityJob using the MovieLens dataset &
>> got
>> >> > the expected results.  It looks like my setup is good.
>> >> >
>> >> > Here's what I have:
>> >> >
>> >> > I have data coming in the following format: UserId, GroupId,
>> Frequency
>> >> (how
>> >> > many times the user chose the group), Max timestamp (the last time
>> the
>> >> user
>> >> > chose the group).
>> >> >
>> >> > Based on this dataset we need to figure out which groups look alike.
>> I
>> >> > decided to use "item based collaborative filtering" but I have 3
>> >> concerns:
>> >> >
>> >> > 1)  We don't have any knowledge of "Dislikes"; we only know which
>> groups
>> >> > users "Like".
>> >> > 2)  We don't really have ratings. In other words, users don't rate
>> the
>> >> > group. Either they choose OR they don't.
>> >> > 3)  Frequency doesn't really imply interest level.
>> >> >
>> >> >
>> >> > I decided to try 'ItemSimilarityJob' by using a CSV file in the
>> following
>> >> > format:
>> >> >
>> >> > UserId, GroupId, "1"
>> >> >
>> >> > In other words, the rating value is always 1.  There are NO rows with
>> >> value
>> >> > "0".  This is producing NO OUTPUT, but the job finishes successfully.
>> >> >
>> >> > Is this the right way to solve the problem?  Is there some other
>> >> Algorithm
>> >> > that I should be using?  Thanks for the help.
>> >> >
>> >>
>>
>
>

Reply via email to