Please ignore the latest email. When I increased the memory size to 8g, all steps worked. Now validating output. Thanks a lot for all your help.
On Sun, Jun 24, 2012 at 9:49 PM, Something Something < [email protected]> wrote: > Changed it to LoglikelihoodSimilarity, but in step 4 > (RowSimilarityJob-VectorNormMapper-Reducer) get the following error > related to Java heap space: > > 12/06/24 23:40:52 INFO mapred.JobClient: Task Id : > attempt_201202041116_64039_m_000005_0, Status : FAILED > Error: Java heap space > attempt_201202041116_64039_m_000005_0: Exception in thread "Timer thread > for monitoring jvm" java.lang.IllegalArgumentException: unresolved address > attempt_201202041116_64039_m_000005_0: at > java.net.DatagramPacket.setSocketAddress(DatagramPacket.java:295) > attempt_201202041116_64039_m_000005_0: at > java.net.DatagramPacket.<init>(DatagramPacket.java:123) > attempt_201202041116_64039_m_000005_0: at > java.net.DatagramPacket.<init>(DatagramPacket.java:158) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.ganglia.GangliaContext31.emitMetric(GangliaContext31.java:118) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:127) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:313) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:299) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:53) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:258) > attempt_201202041116_64039_m_000005_0: at > java.util.TimerThread.mainLoop(Timer.java:512) > attempt_201202041116_64039_m_000005_0: at > java.util.TimerThread.run(Timer.java:462) > > > > I have set the following properties: > > <property> > <name>mapred.task.timeout</name> > <value>1800000</value> <!-- 30 minutes --> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx4g</value> > </property> > <property> > <name>mapred.map.child.java.opts</name> > <value>-Xmx4g</value> > </property> > <property> > <name>mapred.reduce.child.java.opts</name> > <value>-Xmx4g</value> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>50</value> > </property> > > > > On Sun, Jun 24, 2012 at 1:19 AM, Sean Owen <[email protected]> wrote: > >> Try LoglikelihoodSimilarity. >> >> Where do you run into memory issues? Did you change worker heap >> settings from the default? >> >> On Sat, Jun 23, 2012 at 10:24 PM, Something Something >> <[email protected]> wrote: >> > Thank you so much Sean. It was great to get confirmation from you >> > regarding the choice of algorithm. >> > >> > As suggested, I used the following params: >> > >> > similarityJob.run(new String[]{"--tempDir", >> > tmpDir.getAbsolutePath(), "--similarityClassname", >> > >> > CooccurrenceCountSimilarity.class.getName(),"--booleanData", >> > String.valueOf(Boolean.TRUE)}); >> > >> > and got output!!!! Horray. >> > >> > Question: Is CooccurenceCountSimilarity best in this case? >> > >> > >> > Anyway, now I am going to try on our production cluster with Billions of >> > lines. Last time I tried, I ran into OutOfMemoryExceptions. Any >> > suggestions regarding memory settings? >> > >> > Thanks once again for your help. >> > >> > >> > On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <[email protected]> wrote: >> > >> >> Using 1 is just fine for the reasons you give. You would be surprised >> how >> >> OK it is to use this even for dislikes. In fact just omit the third >> field >> >> in your CSV. >> >> >> >> However you need to set the boolean data flag and choose a similarity >> >> metric that is defined over such data. Pearson / cosine is not for >> example >> >> since every value is 1. This is why there is no output. >> >> On Jun 23, 2012 1:33 AM, "Something Something" < >> [email protected]> >> >> wrote: >> >> >> >> > I tested my setup of ItemSimilarityJob using the MovieLens dataset & >> got >> >> > the expected results. It looks like my setup is good. >> >> > >> >> > Here's what I have: >> >> > >> >> > I have data coming in the following format: UserId, GroupId, >> Frequency >> >> (how >> >> > many times the user chose the group), Max timestamp (the last time >> the >> >> user >> >> > chose the group). >> >> > >> >> > Based on this dataset we need to figure out which groups look alike. >> I >> >> > decided to use "item based collaborative filtering" but I have 3 >> >> concerns: >> >> > >> >> > 1) We don't have any knowledge of "Dislikes"; we only know which >> groups >> >> > users "Like". >> >> > 2) We don't really have ratings. In other words, users don't rate >> the >> >> > group. Either they choose OR they don't. >> >> > 3) Frequency doesn't really imply interest level. >> >> > >> >> > >> >> > I decided to try 'ItemSimilarityJob' by using a CSV file in the >> following >> >> > format: >> >> > >> >> > UserId, GroupId, "1" >> >> > >> >> > In other words, the rating value is always 1. There are NO rows with >> >> value >> >> > "0". This is producing NO OUTPUT, but the job finishes successfully. >> >> > >> >> > Is this the right way to solve the problem? Is there some other >> >> Algorithm >> >> > that I should be using? Thanks for the help. >> >> > >> >> >> > >
