A Dropbox link now:

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

And here is the script I use to test different sizes/partitions (example: 10 
parts of 10k):

#!/bin/sh
set -e -u

mkdir -p ratings-split
rm -rf ratings-split/part*
hdfs dfs -rm -r ratings-split spark-itemsimilarity temp
cat ~/input_ratings/part* 2>/dev/null | head -n100k | split -l10k -d - 
"ratings-split/part-" 
hdfs dfs -mkdir -p ratings-split
hdfs dfs -copyFromLocal ratings-split
time mahout spark-itemsimilarity --input ratings-split/ --output 
spark-itemsimilarity \
  --maxSimilaritiesPerItem 10 --master yarn-client |& tee 
spark-itemsimilarity.out

Thanks!

On Thu, 29 Sep 2016 19:46:03 +0200 Arnau Sanchez <pyar...@gmail.com> wrote:

> Hi Sebastian,
> 
> That's weird, it works here. Anyway, a Dropbox link:
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> Thanks!
> 
> On Thu, 29 Sep 2016 18:50:23 +0200 Sebastian <s...@apache.org> wrote:
> 
> > Hi Arnau,
> > 
> > The links to your logfiles don't work for me unfortunately. Are you sure 
> > you correctly setup Spark? That can be a bit tricky in YARN settings, 
> > sometimes one machine idles around...
> > 
> > Best,
> > Sebastian
> > 
> > On 25.09.2016 18:01, Pat Ferrel wrote:  
> > > AWS EMR is usually not very well suited for Spark. Spark get’s most of 
> > > it’s speed from in-memory calculations. So to see speed gains you have to 
> > > have enough memory. Also partitioning will help in many cases. If you 
> > > read in data from a single file—that partitioning will usually follow the 
> > > calculation throughout all intermediate steps. If the data is from a 
> > > single file the partition may be 1 and therefor it will only use one 
> > > machine. The most recent Mahout snapshot (therefore the next release) 
> > > allows you to pass in the partitioning for each event pair (this is only 
> > > in the library use, not CLI). To get this effect in the current release, 
> > > try splitting the input into multiple files.
> > >
> > > I’m. probably the one that reported the 10x speed up and used input from 
> > > Kafka DStreams, which causes very small default partition sizes. Also 
> > > other comparisons for other calculations give a similar speedup result. 
> > > There is little question about Spark being much faster—when used the way 
> > > it is meant to be.
> > >
> > > I use Mahout as a library all the time in the Universal Recommender 
> > > implemented in Apache PredictionIO. As a library we get greater control 
> > > than the CLI. The CLI is really only a proof of concept, not really meant 
> > > for production.
> > >
> > > BTW there is a significant algorithm benefit of the code behind 
> > > spark-itemsimilarity that is probably more important than the speed 
> > > increase and that is Correlated Cross-Occurrence, which allows the use of 
> > > many indicators of user taste, not just the primary/conversion event, 
> > > which is all any other CF-style recommender that I know of can use.
> > >
> > >
> > > On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <pyar...@gmail.com> wrote:
> > >
> > > I've been using the Mahout itemsimilarity job for a while, with good 
> > > results. I read that the new spark-itemsimilarity job is typically 
> > > faster, by a factor of 10, so I wanted to give it a try. I must be doing 
> > > something wrong because, with the same EMR infrastructure, the spark job 
> > > is slower than the old one (6 min vs 16 min) working on the same data. I 
> > > took a small sample dataset (766k rating pairs) to compare numbers, this 
> > > is the result:
> > >
> > > Input ratings: http://download.zaudera.com/public/ratings
> > >
> > > Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)
> > >
> > > Old itemsimilarity:
> > >
> > > $ mahout itemsimilarity --input ratings --output itemsimilarity 
> > > --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname 
> > > SIMILARITY_COOCCURRENCE
> > > [5m54s]
> > >
> > > (logs: http://download.zaudera.com/public/itemsimilarity.out)
> > >
> > > New spark-itemsimilarity:
> > >
> > > $ mahout spark-itemsimilarity --input ratings --output 
> > > spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client
> > > [15m51s]
> > >
> > > (logs: http://download.zaudera.com/public/spark-itemsimilarity.out)
> > >
> > > Any ideas? Thanks!
> > >    

Reply via email to