Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread hlqv
Hi Pat Ferrel Use option --omitStrength to output indexable data but this lead to less accuracy while querying due to omit similar values between items. Whether can put these values in order to improve accuracy in a search engine On 23 December 2014 at 02:17, Pat Ferrel p...@occamsmachete.com

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread AlShater, Hani
@Pat, Thanks for your answers. It seems that I have cloned the snapshot before the feature of configuring spark was added. It worked now in the local mode. Unfortunately, after trying the new snapshot and spark, submitting to the cluster in yarn-client mode raise the following error: Exception in

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread AlShater, Hani
@Pat, I am aware of your blog and of Ted practical machine learning books and webinars. I have learn a lot from you guys ;) @Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and yarn is configured accordingly. I am trying to avoid spark memory caching. @Simon, I am using

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Ted Dunning
On Tue, Dec 23, 2014 at 7:39 AM, AlShater, Hani halsha...@souq.com wrote: @Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and yarn is configured accordingly. I am trying to avoid spark memory caching. Have you tried the map-reduce version?

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
Why do you say it will lead to less accuracy? The weights are LLR weights and they are used to filter and downsample the indicator matrix. Once the downsampling is done they are not needed. When you index the indicators in a search engine they will get TF-IDF weights and this is a good effect.

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
Both errors happen when the Spark Context is created using Yarn. I have no experience with Yarn and so would try it in standalone clustered mode first. Then if all is well check this page to make sure the Spark cluster is configured correctly for Yarn

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread hlqv
Thank you for your explanation There is a situation that I'm not clear, I have the result of item similarity iphonenexus:1 ipad:10 surface nexus:10 ipad:1 galaxy:1 Omit LLR weights then If a user A has the purchase history : 'nexus', which one the recommendation engine should prefer -

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
There is a large-ish data structure in the Spark version of this algorithm. Each slave has a copy of several BiMaps that handle translation of your IDs into and out of Mahout IDs. One of these is created for user IDs, and one for each item ID set. For a single action that would be 2 BiMaps.

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Ted Dunning
On Tue, Dec 23, 2014 at 9:16 AM, Pat Ferrel p...@occamsmachete.com wrote: To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the cross-cooccurrence indicators and you’ll have to translate your IDs into Mahout IDs. This means mapping user and item IDs from your values into

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Pat Ferrel
First of all you need to index that indicator matrix with a search engine. Then the query will be your user’s history. The search engine weights with TF-IDF and the query is based on cosine similarity of doc to query terms. So the weights won’t be the ones you have below, they will be TF-IDF

Re: 20 news groups example

2014-12-23 Thread 万代豊
Hi Jakub To label the training data for Bayesian classification in Mahout, all you do is just simply place your text training file into folders with the desired label as folder names. For example, in case of 20-news group, you can place your text into following folders as, [hadoop@localhost