Re: [EXTERNAL] - Re: Spark ML / ALS question

2020-12-03 Thread Steve Pruitt
Thanks, I confused myself.  I was looking at 
org.apache.spark.ml.recommendation.ALS Javadoc.  Not sure why it shows up.   I 
didn't notice the Developer API tag, so "fit" it is!

-S

From: Sean Owen 
Sent: Wednesday, December 2, 2020 3:51 PM
To: Steve Pruitt 
Cc: user@spark.apache.org 
Subject: [EXTERNAL] - Re: Spark ML / ALS question

There is only a fit() method in 
spark.ml<https://urldefense.com/v3/__http://spark.ml__;!!Obbck6kTJA!LtadpPpSINZQ3q4vJOXQw0UmOzZShpk98OlNRZWI2LNAXfqDlnNvNbbKRr3kOTt7$>'s
 ALS
http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/ALS.html<https://urldefense.com/v3/__http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/ALS.html__;!!Obbck6kTJA!LtadpPpSINZQ3q4vJOXQw0UmOzZShpk98OlNRZWI2LNAXfqDlnNvNbbKRgVXjH-W$>

The older spark.mllib interface has a train() method. You'd generally use the 
spark.ml<https://urldefense.com/v3/__http://spark.ml__;!!Obbck6kTJA!LtadpPpSINZQ3q4vJOXQw0UmOzZShpk98OlNRZWI2LNAXfqDlnNvNbbKRr3kOTt7$>
 version.

On Wed, Dec 2, 2020 at 2:13 PM Steve Pruitt  
wrote:
I am having a little difficulty finding information on the ALS train(…) method 
in 
spark.ml<https://urldefense.com/v3/__http://spark.ml__;!!Obbck6kTJA!LtadpPpSINZQ3q4vJOXQw0UmOzZShpk98OlNRZWI2LNAXfqDlnNvNbbKRr3kOTt7$>.
  Its unclear when to use it.  In the java doc, the parameters are undocumented.

What is difference between train(..) and fit(..).  When would do you use one or 
the other?


-S



Spark ML / ALS question

2020-12-02 Thread Steve Pruitt
I am having a little difficulty finding information on the ALS train(…) method 
in spark.ml.  Its unclear when to use it.  In the java doc, the parameters are 
undocumented.

What is difference between train(..) and fit(..).  When would do you use one or 
the other?


-S



RE: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Steve Pruitt
I should have mentioned this is a synthetic dataset I create using some 
likelihood distributions of the rating values.  I am only experimenting / 
learning.  In practice though, the list of items is likely to be at least in 
the 10’s if not 100’s.  Are even this item numbers to low?

Thanks.

-S

From: Nick Pentreath 
Sent: Wednesday, June 26, 2019 9:09 AM
To: user@spark.apache.org
Subject: Re: [EXTERNAL] - Re: Problem with the ML ALS algorithm

If the number of items is indeed 4, then another issue is the rank of the 
factors defaults to 10. Setting the "rank" parameter < 4 will help.

However, if you only have 4 items, then I would propose that using ALS (or any 
recommendation model in fact) is not really necessary. There is not really 
enough information as well as sparsity, to make collaborative filtering useful. 
And you could simply recommend all items a user has not rated and the result 
would be the same essentially.


On Wed, Jun 26, 2019 at 3:03 PM Steve Pruitt 
mailto:bpru...@opentext.com>> wrote:
Number of users is 1055
Number of items is 4
Ratings values are either 120, 20, 0


From: Nick Pentreath mailto:nick.pentre...@gmail.com>>
Sent: Wednesday, June 26, 2019 6:03 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: [EXTERNAL] - Re: Problem with the ML ALS algorithm

This means that the matrix that ALS is trying to factor is not positive 
definite. Try increasing regParam (try 0.1, 1.0 for example).

What does the data look like? e.g. number of users, number of items, number of 
ratings, etc?

On Wed, Jun 26, 2019 at 12:06 AM Steve Pruitt 
mailto:bpru...@opentext.com>> wrote:
I get an inexplicable exception when trying to build an ALSModel with the 
implicit set to true.  I can’t find any help online.

Thanks in advance.

My code is:

ALS als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("customer")
.setItemCol("item")
.setImplicitPrefs(true)
.setRatingCol("rating");
ALSModel model = als.fit(training);

The exception is:
org.apache.spark.ml.optim.SingularMatrixException: LAPACK.dppsv returned 6 
because A is not positive definite. Is A derived from a singular matrix (e.g. 
collinear column values)?
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.checkReturnValue(CholeskyDecomposition.scala:65)
 ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:41)
 ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
at 
org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:747) 
~[spark-mllib_2.11-2.3.1.jar:2.3.1]


RE: [EXTERNAL] - Re: Problem with the ML ALS algorithm

2019-06-26 Thread Steve Pruitt
Number of users is 1055
Number of items is 4
Ratings values are either 120, 20, 0


From: Nick Pentreath 
Sent: Wednesday, June 26, 2019 6:03 AM
To: user@spark.apache.org
Subject: [EXTERNAL] - Re: Problem with the ML ALS algorithm

This means that the matrix that ALS is trying to factor is not positive 
definite. Try increasing regParam (try 0.1, 1.0 for example).

What does the data look like? e.g. number of users, number of items, number of 
ratings, etc?

On Wed, Jun 26, 2019 at 12:06 AM Steve Pruitt 
mailto:bpru...@opentext.com>> wrote:
I get an inexplicable exception when trying to build an ALSModel with the 
implicit set to true.  I can’t find any help online.

Thanks in advance.

My code is:

ALS als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("customer")
.setItemCol("item")
.setImplicitPrefs(true)
.setRatingCol("rating");
ALSModel model = als.fit(training);

The exception is:
org.apache.spark.ml.optim.SingularMatrixException: LAPACK.dppsv returned 6 
because A is not positive definite. Is A derived from a singular matrix (e.g. 
collinear column values)?
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.checkReturnValue(CholeskyDecomposition.scala:65)
 ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:41)
 ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
at 
org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:747) 
~[spark-mllib_2.11-2.3.1.jar:2.3.1]


Problem with the ML ALS algorithm

2019-06-25 Thread Steve Pruitt
I get an inexplicable exception when trying to build an ALSModel with the 
implicit set to true.  I can’t find any help online.

Thanks in advance.

My code is:

ALS als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("customer")
.setItemCol("item")
.setImplicitPrefs(true)
.setRatingCol("rating");
ALSModel model = als.fit(training);

The exception is:
org.apache.spark.ml.optim.SingularMatrixException: LAPACK.dppsv returned 6 
because A is not positive definite. Is A derived from a singular matrix (e.g. 
collinear column values)?
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.checkReturnValue(CholeskyDecomposition.scala:65)
 ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:41)
 ~[spark-mllib_2.11-2.3.1.jar:2.3.1]
at 
org.apache.spark.ml.recommendation.ALS$CholeskySolver.solve(ALS.scala:747) 
~[spark-mllib_2.11-2.3.1.jar:2.3.1]


[Spark ML] [Pyspark] [Scenario Beginner] [Level Beginner]

2019-04-02 Thread Steve Pruitt
I am still struggling with getting fit() to work on my dataset.
The Spark ML exception that is the issue is:

LAPACK.dppsv returned 6 because A is not positive definite. Is A derived from a 
singular matrix (e.g. collinear column values)?

Comparing my standardized Weight values with the tutorial's values.  I see I 
have some negative values.  The tutorial values are all positive.  The above 
exception message mentions non positive value, so it's probably my issue.

The calculation for standardizing my Weight values Weight - Weight_Mean / 
Weight_StdDev is producing negative values when the Weight which can between 1 
- 72000 is small.
I have a suggestion to try using MinMaxScaler.  But, it operates on a Vector 
and I have a single value.  Not sure, I see how I make this work.

My stats is very old.  Is there a way to achieve positive values only when 
standardizing something like my Weight values above?

Thanks.

-S



From: Steve Pruitt 
Sent: Monday, April 01, 2019 12:39 PM
To: user 
Subject: [EXTERNAL] - [Spark ML] [Pyspark] [Scenario Beginner] [Level Beginner]

After following a tutorial on Recommender systems using Pyspark / Spark ML.  I 
decided to jump in with my own dataset.  I am specifically trying to predict 
video suggestions based on an implicit feature for the time a video was 
watched.  I wrote a generator to produce my dataset.  I have a total of five 
videos each 1200 seconds in length.  I randomly selected which videos a user 
watched and a random time between 0-1200.  I generated 10k records.  Weight is 
the time watched feature.  It looks a like this.

UserId,VideoId,Weight
0,1,645
0,2,870
0,3,1075
0,4,486
0,5,900
1,1,353
1,2,988
1,3,152
1,4,953
1,5,641
2,3,12
2,4,444
2,5,87
3,2,658
3,4,270
3,5,530
4,2,722
4,3,255
:

After reading the dataset.  I convert all columns to Integer in place.  
Describing Weight produces:

   summary  Weight
0 count   30136
1 mean   597.717945314574
2 stddev 346.475684454489
3 min  0
4 max 1200

Next, I standardized the weight column by:

df = dataset.select(mean('Weight').alias('mean_weight'), 
stddev('Weight').alias('stddev_weight')).crossJoin(dataset).withColumn('weight_scaled',
 (col('Weight') - col('mean_weight')) / col('stddev_weight'))

df.toPandas().head() shows:

   mean_weight  stddev_weight  UserId  VideoId  Weight  weight_scaled
0  597.717945   346.47568401 6450.136466
1  597.717945   346.47568402 8700.785862
2  597.717945   346.475684031075   1.377534
3  597.717945   346.47568404486 
-0.322441
4  597.717945   346.47568405900 0.872448
:
10 597.717945   346.475684   2   3 12  -1.690502
11 597.717945   346.475684   2   4 444-0.443662
12 597.717945   346.475684   2   5 87   
-1.474037
:

After splitting df 80 / 20 to get training / testing

I defined the ALS algo with:

als = ALS(maxIter=10, regParam=0.1, userCol='UserId', itemCol='VideoId', 
implicitPrefs=True, ratingCol='weight_scaled', coldStartStrategy='drop')

and then

model = als.fit(trainingData)

Calling fit() is where I get the following error, I don't understand.

Py4JJavaError Traceback (most recent call last)
 in 
> 1 model = als.fit(trainingData)

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\ml\base.py in fit(self, 
dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a 
list/tuple of param maps, "

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\ml\wrapper.py in 
_fit(self, dataset)
286
287 def _fit(self, dataset):
--> 288 java_model = self._fit_java(dataset)
289 model = self._create_model(java_model)
290 return self._copyValues(model)

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\ml\wrapper.py in 
_fit_java(self, dataset)
283 """
284 self._transfer_params_to_java()
--> 285 return self._java_obj.fit(dataset._jdf)
286
287 def _fit(self, dataset):

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\lib\py4j-0.10.6-src.zip\py4j\java_gateway.py
 in __call__(self, *args)
   1158 answer = self.gateway_client.send_command(command)
   1159 return_value = get_return_value(
-> 1160 answer, self.gateway_client, self.target_id, self.name)
   1161
   1162 for temp_arg in temp_args:

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\utils.py in 
deco(*a, **kw)
61 def deco(*a, **kw):
 62 try:
---> 63   

[Spark ML] [Pyspark] [Scenario Beginner] [Level Beginner]

2019-04-01 Thread Steve Pruitt
After following a tutorial on Recommender systems using Pyspark / Spark ML.  I 
decided to jump in with my own dataset.  I am specifically trying to predict 
video suggestions based on an implicit feature for the time a video was 
watched.  I wrote a generator to produce my dataset.  I have a total of five 
videos each 1200 seconds in length.  I randomly selected which videos a user 
watched and a random time between 0-1200.  I generated 10k records.  Weight is 
the time watched feature.  It looks a like this.

UserId,VideoId,Weight
0,1,645
0,2,870
0,3,1075
0,4,486
0,5,900
1,1,353
1,2,988
1,3,152
1,4,953
1,5,641
2,3,12
2,4,444
2,5,87
3,2,658
3,4,270
3,5,530
4,2,722
4,3,255
:

After reading the dataset.  I convert all columns to Integer in place.  
Describing Weight produces:

   summary  Weight
0 count   30136
1 mean   597.717945314574
2 stddev 346.475684454489
3 min  0
4 max 1200

Next, I standardized the weight column by:

df = dataset.select(mean('Weight').alias('mean_weight'), 
stddev('Weight').alias('stddev_weight')).crossJoin(dataset).withColumn('weight_scaled',
 (col('Weight') - col('mean_weight')) / col('stddev_weight'))

df.toPandas().head() shows:

   mean_weight  stddev_weight  UserId  VideoId  Weight  weight_scaled
0  597.717945   346.47568401 6450.136466
1  597.717945   346.47568402 8700.785862
2  597.717945   346.475684031075   1.377534
3  597.717945   346.47568404486 
-0.322441
4  597.717945   346.47568405900 0.872448
:
10 597.717945   346.475684   2   3 12  -1.690502
11 597.717945   346.475684   2   4 444-0.443662
12 597.717945   346.475684   2   5 87   
-1.474037
:

After splitting df 80 / 20 to get training / testing

I defined the ALS algo with:

als = ALS(maxIter=10, regParam=0.1, userCol='UserId', itemCol='VideoId', 
implicitPrefs=True, ratingCol='weight_scaled', coldStartStrategy='drop')

and then

model = als.fit(trainingData)

Calling fit() is where I get the following error, I don't understand.

Py4JJavaError Traceback (most recent call last)
 in 
> 1 model = als.fit(trainingData)

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\ml\base.py in fit(self, 
dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a 
list/tuple of param maps, "

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\ml\wrapper.py in 
_fit(self, dataset)
286
287 def _fit(self, dataset):
--> 288 java_model = self._fit_java(dataset)
289 model = self._create_model(java_model)
290 return self._copyValues(model)

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\ml\wrapper.py in 
_fit_java(self, dataset)
283 """
284 self._transfer_params_to_java()
--> 285 return self._java_obj.fit(dataset._jdf)
286
287 def _fit(self, dataset):

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\lib\py4j-0.10.6-src.zip\py4j\java_gateway.py
 in __call__(self, *args)
   1158 answer = self.gateway_client.send_command(command)
   1159 return_value = get_return_value(
-> 1160 answer, self.gateway_client, self.target_id, self.name)
   1161
   1162 for temp_arg in temp_args:

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\pyspark\sql\utils.py in 
deco(*a, **kw)
61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

C:\Executables\spark-2.3.0-bin-hadoop2.7\python\lib\py4j-0.10.6-src.zip\py4j\protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
318 raise Py4JJavaError(
319 "An error occurred while calling {0}{1}{2}.\n".
--> 320 format(target_id, ".", name), value)
321 else:
322 raise Py4JError(

Py4JJavaError: An error occurred while calling o211.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 61.0 failed 1 times, most recent failure: Lost task 5.0 in stage 61.0 
(TID 179, localhost, executor driver): 
org.apache.spark.ml.optim.SingularMatrixException: LAPACK.dppsv returned 6 
because A is not positive definite. Is A derived from a singular matrix (e.g. 
collinear column values)?
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.checkReturnValue(CholeskyDecomposition.scala:65)
at 

RE: [EXTERNAL] - Re: testing frameworks

2018-05-22 Thread Steve Pruitt
Something more on the lines of integration I believe.  Run one or more Spark 
jobs and verify the output results.  If this makes sense.

I am very new to the world of Spark.  We want to include pipeline testing from 
the get go.  I will check out spark-testing-base.


Thanks.

From: Holden Karau [mailto:hol...@pigscanfly.ca]
Sent: Monday, May 21, 2018 11:32 AM
To: Steve Pruitt <bpru...@opentext.com>
Cc: user@spark.apache.org
Subject: [EXTERNAL] - Re: testing frameworks

So I’m biased as the author of spark-testing-base but I think it’s pretty ok. 
Are you looking for unit or integration or something else?

On Mon, May 21, 2018 at 5:24 AM Steve Pruitt 
<bpru...@opentext.com<mailto:bpru...@opentext.com>> wrote:
Hi,

Can anyone recommend testing frameworks suitable for Spark jobs.  Something 
that can be integrated into a CI tool would be great.

Thanks.

--
Twitter: 
https://twitter.com/holdenkarau<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_holdenkarau=DwMFaQ=ZgVRmm3mf2P1-XDAyDsu4A=ksx9qnQFG3QvxkP54EBPEzv1HHDjlk-MFO-7EONGCtY=YTdxEm6qmXE1TQvlRzPccMkNLcynfxhC32Uj91HcaXA=a_ORg1aB6eKT2ZYxtSJw3oOQnHmi07gjf9whuROeNYw=>


testing frameworks

2018-05-21 Thread Steve Pruitt
Hi,

Can anyone recommend testing frameworks suitable for Spark jobs.  Something 
that can be integrated into a CI tool would be great.

Thanks.