Re: spark-itemsimilarity out of memory problem

2015-01-04 Thread AlShater, Hani
Hi Pat, Thanks again, spark-1.1.0 works without compilations and the errors have gone. But still, there is out of memory problem. The error occurred when spark is trying to write broadcast variable to desk. I tried to give each executer 25g of memory but the same error occurs again. Also, I

Re: spark-itemsimilarity out of memory problem

2015-01-04 Thread Pat Ferrel
The data structure is a HashBiMap from Guava. Yes they could be replaced with joins but there is some extra complexity. The code would have to replace each HashBiMap with some RDD backed collection. But if there is memory available perhaps something else is causing the error. Let’s think this

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread hlqv
Hi Pat Ferrel Use option --omitStrength to output indexable data but this lead to less accuracy while querying due to omit similar values between items. Whether can put these values in order to improve accuracy in a search engine On 23 December 2014 at 02:17, Pat Ferrel p...@occamsmachete.com

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread AlShater, Hani
@Pat, Thanks for your answers. It seems that I have cloned the snapshot before the feature of configuring spark was added. It worked now in the local mode. Unfortunately, after trying the new snapshot and spark, submitting to the cluster in yarn-client mode raise the following error: Exception in

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread AlShater, Hani
@Pat, I am aware of your blog and of Ted practical machine learning books and webinars. I have learn a lot from you guys ;) @Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and yarn is configured accordingly. I am trying to avoid spark memory caching. @Simon, I am using

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Ted Dunning
On Tue, Dec 23, 2014 at 7:39 AM, AlShater, Hani halsha...@souq.com wrote: @Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and yarn is configured accordingly. I am trying to avoid spark memory caching. Have you tried the map-reduce version?

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
Why do you say it will lead to less accuracy? The weights are LLR weights and they are used to filter and downsample the indicator matrix. Once the downsampling is done they are not needed. When you index the indicators in a search engine they will get TF-IDF weights and this is a good effect.

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
Both errors happen when the Spark Context is created using Yarn. I have no experience with Yarn and so would try it in standalone clustered mode first. Then if all is well check this page to make sure the Spark cluster is configured correctly for Yarn

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread hlqv
Thank you for your explanation There is a situation that I'm not clear, I have the result of item similarity iphonenexus:1 ipad:10 surface nexus:10 ipad:1 galaxy:1 Omit LLR weights then If a user A has the purchase history : 'nexus', which one the recommendation engine should prefer -

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
There is a large-ish data structure in the Spark version of this algorithm. Each slave has a copy of several BiMaps that handle translation of your IDs into and out of Mahout IDs. One of these is created for user IDs, and one for each item ID set. For a single action that would be 2 BiMaps.

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Ted Dunning
On Tue, Dec 23, 2014 at 9:16 AM, Pat Ferrel p...@occamsmachete.com wrote: To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the cross-cooccurrence indicators and you’ll have to translate your IDs into Mahout IDs. This means mapping user and item IDs from your values into

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
First of all you need to index that indicator matrix with a search engine. Then the query will be your user’s history. The search engine weights with TF-IDF and the query is based on cosine similarity of doc to query terms. So the weights won’t be the ones you have below, they will be TF-IDF

Re: spark-itemsimilarity out of memory problem

2014-12-22 Thread Pat Ferrel
The job has an option -sem to set the spark.executor.memory config. Also you can change runtime job config with -D:key=value to access any of the Spark config values. On Dec 21, 2014, at 11:44 PM, AlShater, Hani halsha...@souq.com wrote: Hi All, I am trying to use spark-itemsimilarity on 160M

Re: spark-itemsimilarity out of memory problem

2014-12-22 Thread Ted Dunning
Can you say what kind of cluster you have? How many machines? How much memory? How much memory is given to Spark? On Sun, Dec 21, 2014 at 11:44 PM, AlShater, Hani halsha...@souq.com wrote: Hi All, I am trying to use spark-itemsimilarity on 160M user interactions dataset. The job launches

Re: spark-itemsimilarity out of memory problem

2014-12-22 Thread Pat Ferrel
Hi Hani, I recently read about Souq.com. A vey promising project. If you are looking at the spark-itemsimilarity for ecommerce type recommendations you may be interested in some slide decs and blog posts I’ve done on the subject. Check out:

Re: spark-itemsimilarity out of memory problem

2014-12-22 Thread Pat Ferrel
Also Ted has an ebook you can download: mapr.com/practical-machine-learning On Dec 22, 2014, at 10:52 AM, Pat Ferrel p...@occamsmachete.com wrote: Hi Hani, I recently read about Souq.com. A vey promising project. If you are looking at the spark-itemsimilarity for ecommerce type

spark-itemsimilarity out of memory problem

2014-12-21 Thread AlShater, Hani
Hi All, I am trying to use spark-itemsimilarity on 160M user interactions dataset. The job launches and running successfully for small data 1M action. However, when trying for the larger dataset, some spark stages continuously fail with out of memory exception. I tried to change the