Re: Question about Item Based Collaborative Filtering

Sebastian Schelter Sun, 24 Jun 2012 23:21:48 -0700

How many items do you have? RowSimilarityJob loads a dense vector with
#item entries into RAM to avoid a costly join. Maybe this vector becomes
to big.


On 25.06.2012 06:49, Something Something wrote:
> Changed it to LoglikelihoodSimilarity, but in step 4
> (RowSimilarityJob-VectorNormMapper-Reducer)  get the following error
> related to Java heap space:
> 12/06/24 23:40:52 INFO mapred.JobClient: Task Id :
> attempt_201202041116_64039_m_000005_0, Status : FAILED
> Error: Java heap space
> attempt_201202041116_64039_m_000005_0: Exception in thread "Timer thread
> for monitoring jvm" java.lang.IllegalArgumentException: unresolved address
> attempt_201202041116_64039_m_000005_0:  at
> java.net.DatagramPacket.setSocketAddress(DatagramPacket.java:295)
> attempt_201202041116_64039_m_000005_0:  at
> java.net.DatagramPacket.<init>(DatagramPacket.java:123)
> attempt_201202041116_64039_m_000005_0:  at
> java.net.DatagramPacket.<init>(DatagramPacket.java:158)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.ganglia.GangliaContext31.emitMetric(GangliaContext31.java:118)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:127)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:313)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:299)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:53)
> attempt_201202041116_64039_m_000005_0:  at
> org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:258)
> attempt_201202041116_64039_m_000005_0:  at
> java.util.TimerThread.mainLoop(Timer.java:512)
> attempt_201202041116_64039_m_000005_0:  at
> java.util.TimerThread.run(Timer.java:462)
> 
> 
> 
> I have set the following properties:
> 
>     <property>
>         <name>mapred.task.timeout</name>
>         <value>1800000</value> <!-- 30 minutes -->
>     </property>
>     <property>
>         <name>mapred.child.java.opts</name>
>         <value>-Xmx4g</value>
>     </property>
>     <property>
>         <name>mapred.map.child.java.opts</name>
>         <value>-Xmx4g</value>
>     </property>
>     <property>
>         <name>mapred.reduce.child.java.opts</name>
>         <value>-Xmx4g</value>
>     </property>
>     <property>
>         <name>mapred.reduce.tasks</name>
>         <value>50</value>
>     </property>
> 
> 
> On Sun, Jun 24, 2012 at 1:19 AM, Sean Owen <[email protected]> wrote:
> 
>> Try LoglikelihoodSimilarity.
>>
>> Where do you run into memory issues? Did you change worker heap
>> settings from the default?
>>
>> On Sat, Jun 23, 2012 at 10:24 PM, Something Something
>> <[email protected]> wrote:
>>> Thank you so much Sean.  It was great to get confirmation from you
>>> regarding the choice of algorithm.
>>>
>>> As suggested, I used the following params:
>>>
>>>            similarityJob.run(new String[]{"--tempDir",
>>> tmpDir.getAbsolutePath(), "--similarityClassname",
>>>
>>> CooccurrenceCountSimilarity.class.getName(),"--booleanData",
>>> String.valueOf(Boolean.TRUE)});
>>>
>>> and got output!!!!   Horray.
>>>
>>> Question:  Is CooccurenceCountSimilarity best in this case?
>>>
>>>
>>> Anyway, now I am going to try on our production cluster with Billions of
>>> lines.  Last time I tried, I ran into OutOfMemoryExceptions.  Any
>>> suggestions regarding memory settings?
>>>
>>> Thanks once again for your help.
>>>
>>>
>>> On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <[email protected]> wrote:
>>>
>>>> Using 1 is just fine for the reasons you give. You would be surprised
>> how
>>>> OK it is to use this even for dislikes. In fact just omit the third
>> field
>>>> in your CSV.
>>>>
>>>> However you need to set the boolean data flag and choose a similarity
>>>> metric that is defined over such data. Pearson / cosine is not for
>> example
>>>> since every value is 1. This is why there is no output.
>>>> On Jun 23, 2012 1:33 AM, "Something Something" <
>> [email protected]>
>>>> wrote:
>>>>
>>>>> I tested my setup of ItemSimilarityJob using the MovieLens dataset &
>> got
>>>>> the expected results.  It looks like my setup is good.
>>>>>
>>>>> Here's what I have:
>>>>>
>>>>> I have data coming in the following format: UserId, GroupId, Frequency
>>>> (how
>>>>> many times the user chose the group), Max timestamp (the last time the
>>>> user
>>>>> chose the group).
>>>>>
>>>>> Based on this dataset we need to figure out which groups look alike. I
>>>>> decided to use "item based collaborative filtering" but I have 3
>>>> concerns:
>>>>>
>>>>> 1)  We don't have any knowledge of "Dislikes"; we only know which
>> groups
>>>>> users "Like".
>>>>> 2)  We don't really have ratings. In other words, users don't rate the
>>>>> group. Either they choose OR they don't.
>>>>> 3)  Frequency doesn't really imply interest level.
>>>>>
>>>>>
>>>>> I decided to try 'ItemSimilarityJob' by using a CSV file in the
>> following
>>>>> format:
>>>>>
>>>>> UserId, GroupId, "1"
>>>>>
>>>>> In other words, the rating value is always 1.  There are NO rows with
>>>> value
>>>>> "0".  This is producing NO OUTPUT, but the job finishes successfully.
>>>>>
>>>>> Is this the right way to solve the problem?  Is there some other
>>>> Algorithm
>>>>> that I should be using?  Thanks for the help.
>>>>>
>>>>
>>
>

Re: Question about Item Based Collaborative Filtering

Reply via email to