Re: spark-itemsimilarity slower than itemsimilarity
Except for reading the input it now takes ~5 minutes to train. On Sep 30, 2016, at 5:12 PM, Pat Ferrelwrote: Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines. We are using this with ~1TB of data. Using Mahout as a lib we also set the partitioning to 4x the number of cores in the cluster with --conf spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We also have 11 events so 11 different input matrices. I’m testing a way to train on only what is used in the model creation, which is a way to take pairRDDs as input and filter out every interaction that is not from a user in the primary matrix. With 1TB this reduces the data to a very manageable number and reduces the size of the BiMaps dramatically (they require memory). It may not have as pronounced an effect on different data but will make the computation more efficient. With the Universal Recommender in the PredictionIO template we still use all user input in the queries so you loose no precision and get real-time user interactions in your queries. If you want a more modern recommender than the old Mahout MapReduce I’d strongly suggest you consider it or at least use it as a model for building a recommender with Mahout. actionml.com/docs/ur On Sep 30, 2016, at 10:39 AM, Sebastian wrote: Hi Arnau, I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines. At the moment, it seems to me that the overheads of running in a distributed setting (task scheduling, serialization...) totally dominate the computation. Best, Sebastian On 30.09.2016 11:11, Arnau Sanchez wrote: > Hi! > > Here you go: "ratings-clean" contains only pairs of (user, product) for those > products with 4 or more user interactions (770k -> 465k): > > https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 > > The results: > > 1 part of 465k: 3m41.361s > 5 parts of 100k: 4m20.785s > 24 pars of 20k: 10m44.375s > 47 parts of 10k: 17m39.385s > > On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian wrote: > >> Hi Arnau, >> >> I had a look at your ratings file and its kind of strange. Its pretty >> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out >> of these, only 50k have more than 3 interactions. >> >> So I think the first thing that you should do is throw out all the items >> with so few interactions. Item similarity computations are pretty >> sensitive to the number of unique items, maybe thats why you don't see >> much difference in the run times. >> >> -s >> >> >> On 29.09.2016 22:17, Arnau Sanchez wrote: >>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10
Re: spark-itemsimilarity slower than itemsimilarity
Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines. We are using this with ~1TB of data. Using Mahout as a lib we also set the partitioning to 4x the number of cores in the cluster with --conf spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We also have 11 events so 11 different input matrices. I’m testing a way to train on only what is used in the model creation, which is a way to take pairRDDs as input and filter out every interaction that is not from a user in the primary matrix. With 1TB this reduces the data to a very manageable number and reduces the size of the BiMaps dramatically (they require memory). It may not have as pronounced an effect on different data but will make the computation more efficient. With the Universal Recommender in the PredictionIO template we still use all user input in the queries so you loose no precision and get real-time user interactions in your queries. If you want a more modern recommender than the old Mahout MapReduce I’d strongly suggest you consider it or at least use it as a model for building a recommender with Mahout. actionml.com/docs/ur On Sep 30, 2016, at 10:39 AM, Sebastianwrote: Hi Arnau, I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines. At the moment, it seems to me that the overheads of running in a distributed setting (task scheduling, serialization...) totally dominate the computation. Best, Sebastian On 30.09.2016 11:11, Arnau Sanchez wrote: > Hi! > > Here you go: "ratings-clean" contains only pairs of (user, product) for those > products with 4 or more user interactions (770k -> 465k): > > https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 > > The results: > > 1 part of 465k: 3m41.361s > 5 parts of 100k: 4m20.785s > 24 pars of 20k: 10m44.375s > 47 parts of 10k: 17m39.385s > > On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian wrote: > >> Hi Arnau, >> >> I had a look at your ratings file and its kind of strange. Its pretty >> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out >> of these, only 50k have more than 3 interactions. >> >> So I think the first thing that you should do is throw out all the items >> with so few interactions. Item similarity computations are pretty >> sensitive to the number of unique items, maybe thats why you don't see >> much difference in the run times. >> >> -s >> >> >> On 29.09.2016 22:17, Arnau Sanchez wrote: >>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10
Re: spark-itemsimilarity slower than itemsimilarity
Hi Arnau, I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines. At the moment, it seems to me that the overheads of running in a distributed setting (task scheduling, serialization...) totally dominate the computation. Best, Sebastian On 30.09.2016 11:11, Arnau Sanchez wrote: Hi! Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k): https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 The results: 1 part of 465k: 3m41.361s 5 parts of 100k: 4m20.785s 24 pars of 20k: 10m44.375s 47 parts of 10k: 17m39.385s On Fri, 30 Sep 2016 00:09:13 +0200 Sebastianwrote: Hi Arnau, I had a look at your ratings file and its kind of strange. Its pretty tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out of these, only 50k have more than 3 interactions. So I think the first thing that you should do is throw out all the items with so few interactions. Item similarity computations are pretty sensitive to the number of unique items, maybe thats why you don't see much difference in the run times. -s On 29.09.2016 22:17, Arnau Sanchez wrote: --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10
Re: spark-itemsimilarity slower than itemsimilarity
Hi! Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k): https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 The results: 1 part of 465k: 3m41.361s 5 parts of 100k: 4m20.785s 24 pars of 20k: 10m44.375s 47 parts of 10k: 17m39.385s On Fri, 30 Sep 2016 00:09:13 +0200 Sebastianwrote: > Hi Arnau, > > I had a look at your ratings file and its kind of strange. Its pretty > tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out > of these, only 50k have more than 3 interactions. > > So I think the first thing that you should do is throw out all the items > with so few interactions. Item similarity computations are pretty > sensitive to the number of unique items, maybe thats why you don't see > much difference in the run times. > > -s > > > On 29.09.2016 22:17, Arnau Sanchez wrote: > > --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10
Re: spark-itemsimilarity slower than itemsimilarity
Hi Arnau, I had a look at your ratings file and its kind of strange. Its pretty tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out of these, only 50k have more than 3 interactions. So I think the first thing that you should do is throw out all the items with so few interactions. Item similarity computations are pretty sensitive to the number of unique items, maybe thats why you don't see much difference in the run times. -s On 29.09.2016 22:17, Arnau Sanchez wrote: --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10
Re: spark-itemsimilarity slower than itemsimilarity
A Dropbox link now: https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 And here is the script I use to test different sizes/partitions (example: 10 parts of 10k): #!/bin/sh set -e -u mkdir -p ratings-split rm -rf ratings-split/part* hdfs dfs -rm -r ratings-split spark-itemsimilarity temp cat ~/input_ratings/part* 2>/dev/null | head -n100k | split -l10k -d - "ratings-split/part-" hdfs dfs -mkdir -p ratings-split hdfs dfs -copyFromLocal ratings-split time mahout spark-itemsimilarity --input ratings-split/ --output spark-itemsimilarity \ --maxSimilaritiesPerItem 10 --master yarn-client |& tee spark-itemsimilarity.out Thanks! On Thu, 29 Sep 2016 19:46:03 +0200 Arnau Sanchezwrote: > Hi Sebastian, > > That's weird, it works here. Anyway, a Dropbox link: > > https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0 > > Thanks! > > On Thu, 29 Sep 2016 18:50:23 +0200 Sebastian wrote: > > > Hi Arnau, > > > > The links to your logfiles don't work for me unfortunately. Are you sure > > you correctly setup Spark? That can be a bit tricky in YARN settings, > > sometimes one machine idles around... > > > > Best, > > Sebastian > > > > On 25.09.2016 18:01, Pat Ferrel wrote: > > > AWS EMR is usually not very well suited for Spark. Spark get’s most of > > > it’s speed from in-memory calculations. So to see speed gains you have to > > > have enough memory. Also partitioning will help in many cases. If you > > > read in data from a single file—that partitioning will usually follow the > > > calculation throughout all intermediate steps. If the data is from a > > > single file the partition may be 1 and therefor it will only use one > > > machine. The most recent Mahout snapshot (therefore the next release) > > > allows you to pass in the partitioning for each event pair (this is only > > > in the library use, not CLI). To get this effect in the current release, > > > try splitting the input into multiple files. > > > > > > I’m. probably the one that reported the 10x speed up and used input from > > > Kafka DStreams, which causes very small default partition sizes. Also > > > other comparisons for other calculations give a similar speedup result. > > > There is little question about Spark being much faster—when used the way > > > it is meant to be. > > > > > > I use Mahout as a library all the time in the Universal Recommender > > > implemented in Apache PredictionIO. As a library we get greater control > > > than the CLI. The CLI is really only a proof of concept, not really meant > > > for production. > > > > > > BTW there is a significant algorithm benefit of the code behind > > > spark-itemsimilarity that is probably more important than the speed > > > increase and that is Correlated Cross-Occurrence, which allows the use of > > > many indicators of user taste, not just the primary/conversion event, > > > which is all any other CF-style recommender that I know of can use. > > > > > > > > > On Sep 22, 2016, at 1:49 AM, Arnau Sanchez wrote: > > > > > > I've been using the Mahout itemsimilarity job for a while, with good > > > results. I read that the new spark-itemsimilarity job is typically > > > faster, by a factor of 10, so I wanted to give it a try. I must be doing > > > something wrong because, with the same EMR infrastructure, the spark job > > > is slower than the old one (6 min vs 16 min) working on the same data. I > > > took a small sample dataset (766k rating pairs) to compare numbers, this > > > is the result: > > > > > > Input ratings: http://download.zaudera.com/public/ratings > > > > > > Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2) > > > > > > Old itemsimilarity: > > > > > > $ mahout itemsimilarity --input ratings --output itemsimilarity > > > --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname > > > SIMILARITY_COOCCURRENCE > > > [5m54s] > > > > > > (logs: http://download.zaudera.com/public/itemsimilarity.out) > > > > > > New spark-itemsimilarity: > > > > > > $ mahout spark-itemsimilarity --input ratings --output > > > spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client > > > [15m51s] > > > > > > (logs: http://download.zaudera.com/public/spark-itemsimilarity.out) > > > > > > Any ideas? Thanks! > > >
Re: spark-itemsimilarity slower than itemsimilarity
Hi Arnau, The links to your logfiles don't work for me unfortunately. Are you sure you correctly setup Spark? That can be a bit tricky in YARN settings, sometimes one machine idles around... Best, Sebastian On 25.09.2016 18:01, Pat Ferrel wrote: AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file—that partitioning will usually follow the calculation throughout all intermediate steps. If the data is from a single file the partition may be 1 and therefor it will only use one machine. The most recent Mahout snapshot (therefore the next release) allows you to pass in the partitioning for each event pair (this is only in the library use, not CLI). To get this effect in the current release, try splitting the input into multiple files. I’m. probably the one that reported the 10x speed up and used input from Kafka DStreams, which causes very small default partition sizes. Also other comparisons for other calculations give a similar speedup result. There is little question about Spark being much faster—when used the way it is meant to be. I use Mahout as a library all the time in the Universal Recommender implemented in Apache PredictionIO. As a library we get greater control than the CLI. The CLI is really only a proof of concept, not really meant for production. BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence, which allows the use of many indicators of user taste, not just the primary/conversion event, which is all any other CF-style recommender that I know of can use. On Sep 22, 2016, at 1:49 AM, Arnau Sanchezwrote: I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result: Input ratings: http://download.zaudera.com/public/ratings Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2) Old itemsimilarity: $ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE [5m54s] (logs: http://download.zaudera.com/public/itemsimilarity.out) New spark-itemsimilarity: $ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client [15m51s] (logs: http://download.zaudera.com/public/spark-itemsimilarity.out) Any ideas? Thanks!
Re: spark-itemsimilarity slower than itemsimilarity
The scaling issues with EMR+Spark may explain the weird performance I am seeing with Mahout's spark-itemsimilarity, I compared the running times with different partitions: the more partitions I feed the job, the more parallel processes it creates in the nodes, the more RAM it uses (some 100GB in total), which looks promising... but the larger the elapsed time. With an input of 100k ratings: 1 part of 100k -> 1m50.080s 5 parts of 20k -> 2m03.376s 10 parts of 10k -> 2m48.722s 20 parts of 5k -> 4m22.107s This also happens for bigger inputs. With >= 1M ratings I am just not able to finish the task, it's so slow. > > If you use the temp Spark strategy We run a EMR cluster 24/7. With Spark we expected a faster data-processing so we could return more up-to-date recommendations. Thanks for you hints Pat, much appreciated. On Wed, 28 Sep 2016 15:23:27 -0700 Pat Ferrelwrote: > The problem with EMR is that the Spark driver needs to be as big as the > executors many times and is not handled by EMR. EMR worked fine for Hadoop > MapReduce because the driver usually did not have to be scaled vertically. I > suppose you could say EMR would work but does not solve the whole scaling > problem. > > What we use at my consulting group is custom orchestration code written in > Terraform that creates a driver of the same size as the Spark executors. We > then install Spark using Docker and Docker Swarm. Since the driver machine > needs to have your application on it this will always require some custom > installation. For storage HDFS is used and it needs to be permanent whereas > Spark may only be needed periodically. We save money by creating Spark > Executors and Driver, using them to store to a permanent HDFS, then > destroying the Spark Driver and Executors. This way model creation with > spark-itemsimilarity or some app that uses Mahout as a Library, will only be > paid for at some small duty cycle. > > If you use the temp Spark strategy and create a model every week and need 4 > r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a > permanent cluster. This brings the cost to a quite reasonable range. You are > very unlikely to need machines that large anyway but you could afford it if > you only pay for the time they are actually used. > > > On Sep 26, 2016, at 12:30 AM, Arnau Sanchez wrote: > > On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel wrote: > > > AWS EMR is usually not very well suited for Spark. > > What infrastructure would you recommend? Some EC2 instances provide lots of > memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb > RAM). > > My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), > 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb). > > > If the data is from a single file the partition may be 1 and therefor it > > will only use one machine. > > Indeed, I experienced that also for MR itemsimilarity, it yielded different > times -and results- for different partitions. I'll do more tests on that. > > > The CLI is really only a proof of concept, not really meant for production. > > > > Noted. > > > BTW there is a significant algorithm benefit of the code behind > > spark-itemsimilarity that is probably more important than the speed > > increase and that is Correlated Cross-Occurrence > > Great! I have yet to compare improvements in the recommendations themselves, > I'll have this in mind. > > Thanks for you help.
Re: spark-itemsimilarity slower than itemsimilarity
The problem with EMR is that the Spark driver needs to be as big as the executors many times and is not handled by EMR. EMR worked fine for Hadoop MapReduce because the driver usually did not have to be scaled vertically. I suppose you could say EMR would work but does not solve the whole scaling problem. What we use at my consulting group is custom orchestration code written in Terraform that creates a driver of the same size as the Spark executors. We then install Spark using Docker and Docker Swarm. Since the driver machine needs to have your application on it this will always require some custom installation. For storage HDFS is used and it needs to be permanent whereas Spark may only be needed periodically. We save money by creating Spark Executors and Driver, using them to store to a permanent HDFS, then destroying the Spark Driver and Executors. This way model creation with spark-itemsimilarity or some app that uses Mahout as a Library, will only be paid for at some small duty cycle. If you use the temp Spark strategy and create a model every week and need 4 r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a permanent cluster. This brings the cost to a quite reasonable range. You are very unlikely to need machines that large anyway but you could afford it if you only pay for the time they are actually used. On Sep 26, 2016, at 12:30 AM, Arnau Sanchezwrote: On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel wrote: > AWS EMR is usually not very well suited for Spark. What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM). My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb). > If the data is from a single file the partition may be 1 and therefor it will > only use one machine. Indeed, I experienced that also for MR itemsimilarity, it yielded different times -and results- for different partitions. I'll do more tests on that. > The CLI is really only a proof of concept, not really meant for production. Noted. > BTW there is a significant algorithm benefit of the code behind > spark-itemsimilarity that is probably more important than the speed increase > and that is Correlated Cross-Occurrence Great! I have yet to compare improvements in the recommendations themselves, I'll have this in mind. Thanks for you help.
Re: spark-itemsimilarity slower than itemsimilarity
On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrelwrote: > AWS EMR is usually not very well suited for Spark. What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM). My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb). > If the data is from a single file the partition may be 1 and therefor it will > only use one machine. Indeed, I experienced that also for MR itemsimilarity, it yielded different times -and results- for different partitions. I'll do more tests on that. > The CLI is really only a proof of concept, not really meant for production. Noted. > BTW there is a significant algorithm benefit of the code behind > spark-itemsimilarity that is probably more important than the speed increase > and that is Correlated Cross-Occurrence Great! I have yet to compare improvements in the recommendations themselves, I'll have this in mind. Thanks for you help.
Re: spark-itemsimilarity slower than itemsimilarity
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file—that partitioning will usually follow the calculation throughout all intermediate steps. If the data is from a single file the partition may be 1 and therefor it will only use one machine. The most recent Mahout snapshot (therefore the next release) allows you to pass in the partitioning for each event pair (this is only in the library use, not CLI). To get this effect in the current release, try splitting the input into multiple files. I’m. probably the one that reported the 10x speed up and used input from Kafka DStreams, which causes very small default partition sizes. Also other comparisons for other calculations give a similar speedup result. There is little question about Spark being much faster—when used the way it is meant to be. I use Mahout as a library all the time in the Universal Recommender implemented in Apache PredictionIO. As a library we get greater control than the CLI. The CLI is really only a proof of concept, not really meant for production. BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence, which allows the use of many indicators of user taste, not just the primary/conversion event, which is all any other CF-style recommender that I know of can use. On Sep 22, 2016, at 1:49 AM, Arnau Sanchezwrote: I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result: Input ratings: http://download.zaudera.com/public/ratings Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2) Old itemsimilarity: $ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE [5m54s] (logs: http://download.zaudera.com/public/itemsimilarity.out) New spark-itemsimilarity: $ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client [15m51s] (logs: http://download.zaudera.com/public/spark-itemsimilarity.out) Any ideas? Thanks!
spark-itemsimilarity slower than itemsimilarity
I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result: Input ratings: http://download.zaudera.com/public/ratings Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2) Old itemsimilarity: $ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE [5m54s] (logs: http://download.zaudera.com/public/itemsimilarity.out) New spark-itemsimilarity: $ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client [15m51s] (logs: http://download.zaudera.com/public/spark-itemsimilarity.out) Any ideas? Thanks!