Hi Pat Ferrel
Use option --omitStrength to output indexable data but this lead to less
accuracy while querying due to omit similar values between items.
Whether can put these values in order to improve accuracy in a search engine
On 23 December 2014 at 02:17, Pat Ferrel p...@occamsmachete.com
@Pat, Thanks for your answers. It seems that I have cloned the snapshot
before the feature of configuring spark was added. It worked now in the
local mode. Unfortunately, after trying the new snapshot and spark,
submitting to the cluster in yarn-client mode raise the following error:
Exception in
@Pat, I am aware of your blog and of Ted practical machine learning books
and webinars. I have learn a lot
from you guys ;)
@Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and
yarn is configured accordingly. I am trying to avoid spark memory caching.
@Simon, I am using
On Tue, Dec 23, 2014 at 7:39 AM, AlShater, Hani halsha...@souq.com wrote:
@Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and
yarn is configured accordingly. I am trying to avoid spark memory caching.
Have you tried the map-reduce version?
Why do you say it will lead to less accuracy?
The weights are LLR weights and they are used to filter and downsample the
indicator matrix. Once the downsampling is done they are not needed. When you
index the indicators in a search engine they will get TF-IDF weights and this
is a good effect.
Both errors happen when the Spark Context is created using Yarn. I have no
experience with Yarn and so would try it in standalone clustered mode first.
Then if all is well check this page to make sure the Spark cluster is
configured correctly for Yarn
Thank you for your explanation
There is a situation that I'm not clear, I have the result of item
similarity
iphonenexus:1 ipad:10
surface nexus:10 ipad:1 galaxy:1
Omit LLR weights then
If a user A has the purchase history : 'nexus', which one the
recommendation engine should prefer -
There is a large-ish data structure in the Spark version of this algorithm.
Each slave has a copy of several BiMaps that handle translation of your IDs
into and out of Mahout IDs. One of these is created for user IDs, and one for
each item ID set. For a single action that would be 2 BiMaps.
On Tue, Dec 23, 2014 at 9:16 AM, Pat Ferrel p...@occamsmachete.com wrote:
To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the
cross-cooccurrence indicators and you’ll have to translate your IDs into
Mahout IDs. This means mapping user and item IDs from your values into
First of all you need to index that indicator matrix with a search engine. Then
the query will be your user’s history. The search engine weights with TF-IDF
and the query is based on cosine similarity of doc to query terms. So the
weights won’t be the ones you have below, they will be TF-IDF
Hi Jakub
To label the training data for Bayesian classification in Mahout, all you
do is just simply place your text training file into folders with the
desired label as folder names.
For example, in case of 20-news group, you can place your text into
following folders as,
[hadoop@localhost
11 matches
Mail list logo