increasing concurrency of saveAsNewAPIHadoopFile?

2014-06-19 Thread Sandeep Parikh
I'm trying to write a JavaPairRDD to a downstream database using
saveAsNewAPIHadoopFile with a custom OutputFormat and the process is pretty
slow.

Is there a way to boost the concurrency of the save process? For example,
something like splitting the RDD into multiple smaller RDDs and using Java
threads to write the data out? That seems foreign to the way Spark works so
not sure if there's a better way.


getting started with mllib.recommendation.ALS

2014-06-10 Thread Sandeep Parikh
Question on the input and output for ALS.train() and
MatrixFactorizationModel.predict().

My input is list of Ratings(user_id, product_id, rating) and my ratings are
one a scale of 1-5 (inclusive). When I compute predictions over the
superset of all (user_id, product_id) pairs, the ratings produced are on a
different scale.

The question is this: do I need to normalize the data coming out of
predict() to my own scale or does the input need to be different?

Thanks!


Re: getting started with mllib.recommendation.ALS

2014-06-10 Thread Sandeep Parikh
Thanks Sean. I realized that I was supplying train() with a very low rank
so I will retry with something higher and then play with lambda as-needed.


On Tue, Jun 10, 2014 at 4:58 PM, Sean Owen so...@cloudera.com wrote:

 For trainImplicit(), the output is an approximation of a matrix of 0s
 and 1s, so the values are generally (not always) in [0,1]

 But for train(), you should be predicting the original input matrix
 as-is, as I understand. You should get output in about the same range
 as the input but again not necessarily 1-5. If it's really different,
 you could be underfitting. Try less lambda, more features?

 On Tue, Jun 10, 2014 at 4:59 PM, Sandeep Parikh sand...@clusterbeep.org
 wrote:
  Question on the input and output for ALS.train() and
  MatrixFactorizationModel.predict().
 
  My input is list of Ratings(user_id, product_id, rating) and my ratings
 are
  one a scale of 1-5 (inclusive). When I compute predictions over the
 superset
  of all (user_id, product_id) pairs, the ratings produced are on a
 different
  scale.
 
  The question is this: do I need to normalize the data coming out of
  predict() to my own scale or does the input need to be different?
 
  Thanks!
 



Re: Java RDD structure for Matrix predict?

2014-05-28 Thread Sandeep Parikh
Wisely, is mapToPair in Spark 0.9.1 or 1.0? I'm running the former and
didn't see that method available.

I think the issue is that predict() is expecting an RDD containing a tuple
of ints and not Integers. So if I use JavaPairRDDObject,Object with my
original code snippet, things seem to at least compile for now.


On Tue, May 27, 2014 at 6:40 PM, giive chen thegi...@gmail.com wrote:

 Hi Sandeep

 I think you should use  testRatings.mapToPair instead of
 testRatings.map.

 So the code should be


 JavaPairRDDInteger,Integer usersProducts = training.mapToPair(
 new PairFunctionRating, Integer, Integer() {
 public Tuple2Integer, Integer call(Rating r) throws
 Exception {
 return new Tuple2Integer, Integer(r.user(),
 r.product());
 }
 }
 );

 It works on my side.


 Wisely Chen


 On Wed, May 28, 2014 at 6:27 AM, Sandeep Parikh 
 sand...@clusterbeep.orgwrote:

 I've got a trained MatrixFactorizationModel via ALS.train(...) and now
 I'm trying to use it to predict some ratings like so:

 JavaRDDRating predictions = model.predict(usersProducts.rdd())

 Where usersProducts is built from an existing Ratings dataset like so:

 JavaPairRDDInteger,Integer usersProducts = testRatings.map(
   new PairFunctionRating, Integer, Integer() {
 public Tuple2Integer, Integer call(Rating r) throws Exception {
   return new Tuple2Integer, Integer(r.user(), r.product());
 }
   }
 );

  The problem is that model.predict(...) doesn't like usersProducts,
 claiming that the method doesn't accept an RDD of type Tuple2 however the
 docs show the method signature as follows:

 def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating]

 Am I missing something? The JavaRDD is just a list of Tuple2 elements,
 which would match the method signature but the compile is complaining.

 Thanks!





Java RDD structure for Matrix predict?

2014-05-27 Thread Sandeep Parikh
I've got a trained MatrixFactorizationModel via ALS.train(...) and now I'm
trying to use it to predict some ratings like so:

JavaRDDRating predictions = model.predict(usersProducts.rdd())

Where usersProducts is built from an existing Ratings dataset like so:

JavaPairRDDInteger,Integer usersProducts = testRatings.map(
  new PairFunctionRating, Integer, Integer() {
public Tuple2Integer, Integer call(Rating r) throws Exception {
  return new Tuple2Integer, Integer(r.user(), r.product());
}
  }
);

The problem is that model.predict(...) doesn't like usersProducts, claiming
that the method doesn't accept an RDD of type Tuple2 however the docs show
the method signature as follows:

def predict(usersProducts: RDD[(Int, Int)]): RDD[Rating]

Am I missing something? The JavaRDD is just a list of Tuple2 elements,
which would match the method signature but the compile is complaining.

Thanks!