Re: spark-itemsimilarity out of memory problem

Pat Ferrel Sun, 04 Jan 2015 10:35:03 -0800

The data structure is a HashBiMap from Guava. Yes they could be replaced with 
joins but there is some extra complexity. The code would have to replace each 
HashBiMap with some RDD backed collection. But if there is memory available 
perhaps something else is causing the error. Let’s think this through.

Do you know the physical memory required for your user and item ID HashBiMap? 
The HashBiMap is Int <-> String. How many users and items do you have in your 
complete dataset? You say that the error occurs when the HashBiMap is being 
written to disk? It is never explicitly written, do you mean serialized as in 
broadcast to another executor? But you only have one. Can you attach logs and 
send them to my email address?

One problem we in Mahout have is accessibility to public large datasets. There 
is no public large ecom dataset I know of with multiple actions. We use the 
epinions dataset because it has two actions and is non-trivial but not extra 
large either. Not sure of the size, I’ll look into it. It requires on the order 
of 6g of executor memory. There is only one copy of the broadcast HashBiMaps 
created on each node machine where all local tasks use it read-only.

As to using only one executor, I’ve seen that too. It seems to be related to 
how data splits are created in Spark. You may have enough memory that no other 
executor is needed. Odd because in some cases you might want the executors for 
CPU bound problems so there is probably some config to force more executors. I 
doubt very much that you are CPU bound though so it may be ok here.

If we really do have HashBiMaps that are too large then there are ways to 
remove them.

The special problem in spark-itemsimilarity is getting one collection of unique 
user IDs that span all cross-cooccurrence indicators. A matrix multiply is 
performed for each cross-cooccurrence indicator so the row space of _all_ 
matrices must be the same. This means that as new data for the secondary 
actions are read in, the dimensionality of the previously read in matrices must 
be updated and the user ID collection must be updated. 

There at least two ways to solve the user and item ID mapping that don’t 
require a HashMap. 1) do it the way legacy hadoop Mahout did, ignore the issue 
and use only internal Mahout IDs, which means the developer must perform the 
mapping before and after the job. This would be relatively easy to do in 
spark-itemsimilarity, in fact it is noted as a “todo" in the code, for 
optimization purposes. 2) Another way is to restructure the input pipeline to 
read in all data before the Mahout spark DRMs are created. This would allow for 
easier use of joins and rdd.distinct for managing very large ID sets. I think 
the input would have to use extrernal IDs initially then a join the distinct 
IDs with Mahout ID to create a DRM. Then another join would be required before 
output to get external IDs again. A partial solution might come from recent 
work to allow DRMs with non-Int ids I’ll ask about that but it would only solve 
the user ID problem not the Item IDs--that may be enough for you.

#1 just puts the problem on the user of Mahout and this has been a constant 
issue with pervious versions so unless someone is already doing the translation 
of IDs, it’s not very satisfying.
#2 would cause a fair bit longer runtime since joins are much much slower than 
hashes. But it may be an option since there is probably no better one given the 
constraints. Optimizing to a hash when memory is not a problem, then using 
joins when memory is a constraint may be the best solution.

On Jan 4, 2015, at 3:12 AM, AlShater, Hani <halsha...@souq.com> wrote:

Hi Pat,

Thanks again, spark-1.1.0 works without compilations and the errors have
gone. But still, there is out of memory problem. The error occurred when
spark is trying to write broadcast variable to desk. I tried to give each
executer 25g of memory but the same error occurs again. Also, I noticed
that when memory is increased, spark uses only one executer instead of
multiple. And surprisingly, the out of memory error occurs although there
is free memory available to Yarn.

Do you have examples of dataset size (number of items, users, actions) and
a cluster memory used to fit it?

If I understand you correctly, there are large broadcast variable for
mapping ids, is it kind of map side join to map recommendations results
with ids? Can it be avoided using spark joins?

best regards

Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
Mob: +962 790471101 | Phone: +962 65821236 | Skype:
hani.alsha...@outlook.com | halsha...@souq.com <lgha...@souq.com> |
www.souq.com
Nouh Al Romi Street, Building number 8, Amman, Jordan

On Tue, Dec 23, 2014 at 7:42 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> First of all you need to index that indicator matrix with a search engine.
> Then the query will be your user’s history. The search engine weights with
> TF-IDF and the query is based on cosine similarity of doc to query terms.
> So the weights won’t be the ones you have below, they will be TF-IDF
> weights. This is as expected.
> 
> In a real-world setting you will have a great deal more data than below
> and the downsampling, which uses the LLR weights, will take only the
> highest weighted items and toss the lower weighted ones so the difference
> in weight will not really matter. The reason for downsampling is that the
> lower weighted items add very little value to the results. Leaving them all
> in will cause the algorithm to approach O(n^2) runtime.
> 
> In short the answer to the question of how to interpret the data below is:
> you don’t have enough data for real-world recs.  Intuitions in the
> microscopic do not always scale up to real-world data.
> 
> 
> On Dec 23, 2014, at 9:18 AM, hlqv <hlqvu...@gmail.com> wrote:
> 
> Thank you for your explanation
> 
> There is a situation that I'm not clear, I have the result of item
> similarity
> 
> iphone    nexus:1 ipad:10
> surface   nexus:10 ipad:1 galaxy:1
> 
> Omit LLR weights then
> If a user A has the purchase history : 'nexus', which one the
> recommendation engine should prefer - 'iphone' or 'surface'
> If a user B has the purchase history: 'ipad', 'galaxy' then I think the
> recommendation engine should recommend 'iphone' instead of 'surface' (if
> apply TF-IDF weight then the recommendation engine will return 'surface')
> 
> I really don't know whether my understanding here has some mistake
> 
> On 23 December 2014 at 23:14, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> Why do you say it will lead to less accuracy?
>> 
>> The weights are LLR weights and they are used to filter and downsample
> the
>> indicator matrix. Once the downsampling is done they are not needed. When
>> you index the indicators in a search engine they will get TF-IDF weights
>> and this is a good effect. It will downweight very popular items which
> hold
>> little value as an indicator of user’s taste.
>> 
>> On Dec 23, 2014, at 1:17 AM, hlqv <hlqvu...@gmail.com> wrote:
>> 
>> Hi Pat Ferrel
>> Use option --omitStrength to output indexable data but this lead to less
>> accuracy while querying due to omit similar values between items.
>> Whether can put these values in order to improve accuracy in a search
>> engine
>> 
>> On 23 December 2014 at 02:17, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>>> Also Ted has an ebook you can download:
>>> mapr.com/practical-machine-learning
>>> 
>>> On Dec 22, 2014, at 10:52 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>> 
>>> Hi Hani,
>>> 
>>> I recently read about Souq.com. A vey promising project.
>>> 
>>> If you are looking at the spark-itemsimilarity for ecommerce type
>>> recommendations you may be interested in some slide decs and blog posts
>>> I’ve done on the subject.
>>> Check out:
>>> 
>>> 
>> 
> http://occamsmachete.com/ml/2014/10/07/creating-a-unified-recommender-with-mahout-and-a-search-engine/
>>> 
>>> 
>> 
> http://occamsmachete.com/ml/2014/08/11/mahout-on-spark-whats-new-in-recommenders/
>>> 
>>> 
>> 
> http://occamsmachete.com/ml/2014/09/09/mahout-on-spark-whats-new-in-recommenders-part-2/
>>> 
>>> Also I put up a demo site that uses some of these techniques:
>>> https://guide.finderbots.com
>>> 
>>> Good luck,
>>> Pat
>>> 
>>> On Dec 21, 2014, at 11:44 PM, AlShater, Hani <halsha...@souq.com>
> wrote:
>>> 
>>> Hi All,
>>> 
>>> I am trying to use spark-itemsimilarity on 160M user interactions
>> dataset.
>>> The job launches and running successfully for small data 1M action.
>>> However, when trying for the larger dataset, some spark stages
>> continuously
>>> fail with out of memory exception.
>>> 
>>> I tried to change the spark.storage.memoryFraction from spark default
>>> configuration, but I face the same issue again. How could I configure
>> spark
>>> when using spark-itemsimilarity, or how to overcome this out of memory
>>> issue.
>>> 
>>> Can you please advice ?
>>> 
>>> Thanks,
>>> Hani.
>>> 
>>> 
>>> Hani Al-Shater | Data Science Manager - Souq.com <http://souq.com/>
>>> Mob: +962 790471101 | Phone: +962 65821236 | Skype:
>>> hani.alsha...@outlook.com | halsha...@souq.com <lgha...@souq.com> |
>>> www.souq.com
>>> Nouh Al Romi Street, Building number 8, Amman, Jordan
>>> 
>>> --
>>> 
>>> 
>>> *Download free Souq.com <http://souq.com/> mobile apps for iPhone
>>> <https://itunes.apple.com/us/app/id675000850>, iPad
>>> <https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android
>>> <https://play.google.com/store/apps/details?id=com.souq.app> or Windows
>>> Phone
>>> <
>>> 
>> 
> http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc
>>> 
>>> **and never
>>> miss a deal! *
>>> 
>>> 
>>> 
>> 
>> 
> 
> 

-- 

*Download free Souq.com <http://souq.com/> mobile apps for iPhone 
<https://itunes.apple.com/us/app/id675000850>, iPad 
<https://itunes.apple.com/ae/app/souq.com/id941561129?mt=8>, Android 
<https://play.google.com/store/apps/details?id=com.souq.app> or Windows 
Phone 
<http://www.windowsphone.com/en-gb/store/app/souq/63803e57-4aae-42c7-80e0-f9e60e33b1bc>
 **and never 
miss a deal! *

Re: spark-itemsimilarity out of memory problem

Reply via email to