How many items do you have? RowSimilarityJob loads a dense vector with #item entries into RAM to avoid a costly join. Maybe this vector becomes to big.
On 25.06.2012 06:49, Something Something wrote: > Changed it to LoglikelihoodSimilarity, but in step 4 > (RowSimilarityJob-VectorNormMapper-Reducer) get the following error > related to Java heap space: > 12/06/24 23:40:52 INFO mapred.JobClient: Task Id : > attempt_201202041116_64039_m_000005_0, Status : FAILED > Error: Java heap space > attempt_201202041116_64039_m_000005_0: Exception in thread "Timer thread > for monitoring jvm" java.lang.IllegalArgumentException: unresolved address > attempt_201202041116_64039_m_000005_0: at > java.net.DatagramPacket.setSocketAddress(DatagramPacket.java:295) > attempt_201202041116_64039_m_000005_0: at > java.net.DatagramPacket.<init>(DatagramPacket.java:123) > attempt_201202041116_64039_m_000005_0: at > java.net.DatagramPacket.<init>(DatagramPacket.java:158) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.ganglia.GangliaContext31.emitMetric(GangliaContext31.java:118) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:127) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:313) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:299) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:53) > attempt_201202041116_64039_m_000005_0: at > org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:258) > attempt_201202041116_64039_m_000005_0: at > java.util.TimerThread.mainLoop(Timer.java:512) > attempt_201202041116_64039_m_000005_0: at > java.util.TimerThread.run(Timer.java:462) > > > > I have set the following properties: > > <property> > <name>mapred.task.timeout</name> > <value>1800000</value> <!-- 30 minutes --> > </property> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx4g</value> > </property> > <property> > <name>mapred.map.child.java.opts</name> > <value>-Xmx4g</value> > </property> > <property> > <name>mapred.reduce.child.java.opts</name> > <value>-Xmx4g</value> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>50</value> > </property> > > > On Sun, Jun 24, 2012 at 1:19 AM, Sean Owen <[email protected]> wrote: > >> Try LoglikelihoodSimilarity. >> >> Where do you run into memory issues? Did you change worker heap >> settings from the default? >> >> On Sat, Jun 23, 2012 at 10:24 PM, Something Something >> <[email protected]> wrote: >>> Thank you so much Sean. It was great to get confirmation from you >>> regarding the choice of algorithm. >>> >>> As suggested, I used the following params: >>> >>> similarityJob.run(new String[]{"--tempDir", >>> tmpDir.getAbsolutePath(), "--similarityClassname", >>> >>> CooccurrenceCountSimilarity.class.getName(),"--booleanData", >>> String.valueOf(Boolean.TRUE)}); >>> >>> and got output!!!! Horray. >>> >>> Question: Is CooccurenceCountSimilarity best in this case? >>> >>> >>> Anyway, now I am going to try on our production cluster with Billions of >>> lines. Last time I tried, I ran into OutOfMemoryExceptions. Any >>> suggestions regarding memory settings? >>> >>> Thanks once again for your help. >>> >>> >>> On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <[email protected]> wrote: >>> >>>> Using 1 is just fine for the reasons you give. You would be surprised >> how >>>> OK it is to use this even for dislikes. In fact just omit the third >> field >>>> in your CSV. >>>> >>>> However you need to set the boolean data flag and choose a similarity >>>> metric that is defined over such data. Pearson / cosine is not for >> example >>>> since every value is 1. This is why there is no output. >>>> On Jun 23, 2012 1:33 AM, "Something Something" < >> [email protected]> >>>> wrote: >>>> >>>>> I tested my setup of ItemSimilarityJob using the MovieLens dataset & >> got >>>>> the expected results. It looks like my setup is good. >>>>> >>>>> Here's what I have: >>>>> >>>>> I have data coming in the following format: UserId, GroupId, Frequency >>>> (how >>>>> many times the user chose the group), Max timestamp (the last time the >>>> user >>>>> chose the group). >>>>> >>>>> Based on this dataset we need to figure out which groups look alike. I >>>>> decided to use "item based collaborative filtering" but I have 3 >>>> concerns: >>>>> >>>>> 1) We don't have any knowledge of "Dislikes"; we only know which >> groups >>>>> users "Like". >>>>> 2) We don't really have ratings. In other words, users don't rate the >>>>> group. Either they choose OR they don't. >>>>> 3) Frequency doesn't really imply interest level. >>>>> >>>>> >>>>> I decided to try 'ItemSimilarityJob' by using a CSV file in the >> following >>>>> format: >>>>> >>>>> UserId, GroupId, "1" >>>>> >>>>> In other words, the rating value is always 1. There are NO rows with >>>> value >>>>> "0". This is producing NO OUTPUT, but the job finishes successfully. >>>>> >>>>> Is this the right way to solve the problem? Is there some other >>>> Algorithm >>>>> that I should be using? Thanks for the help. >>>>> >>>> >> >
