Re: Re: how to call recommend method from ml.recommendation.ALS

2017-03-15 Thread Yuhao Yang
This is something that was just added to ML and will probably be released
with 2.2. For now you can try to copy from the master code:
https://github.com/apache/spark/blob/70f9d7f71c63d2b1fdfed75cb7a59285c272a62b/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L352
and give it a try.

Yuhao

2017-03-15 21:39 GMT-07:00 lk_spark :

> thanks for your reply , what I exactly want to know is :
> in package mllib.recommendation  , MatrixFactorizationModel have method
> like recommendProducts , but I didn't find it in package ml.recommendation.
> how can I do the samething as mllib when I use ml.
> 2017-03-16
> --
> lk_spark
> --
>
> *发件人:*任弘迪 
> *发送时间:*2017-03-16 10:46
> *主题:*Re: how to call recommend method from ml.recommendation.ALS
> *收件人:*"lk_spark"
> *抄送:*"user.spark"
>
> if the num of user-item pairs to predict aren't too large, say millions,
> you could transform the target dataframe and save the result to a hive
> table, then build cache based on that table for online services.
>
> if it's not the case(such as billions of user item pairs to predict), you
> have to start a service with the model loaded, send user to the service,
> first match several hundreds of items from all items available which could
> itself be another service or cache, then transform this user and all items
> using the model to get prediction, and return items ordered by prediction.
>
> On Thu, Mar 16, 2017 at 9:32 AM, lk_spark  wrote:
>
>> hi,all:
>>under spark2.0 ,I wonder to know after trained a
>> ml.recommendation.ALSModel how I can do the recommend action?
>>
>>I try to save the model and load it by MatrixFactorizationModel
>> but got error.
>>
>> 2017-03-16
>> --
>> lk_spark
>>
>
>


Re: [MLlib] kmeans random initialization, same seed every time

2017-03-15 Thread Yuhao Yang
Hi Julian,

Thanks for reporting this. This is a valid issue and I created
https://issues.apache.org/jira/browse/SPARK-19957 to track it.

Right now the seed is set to this.getClass.getName.hashCode.toLong by
default, which indeed keeps the same among multiple fits. Feel free to
leave your comments or send a PR for the fix.

For your problem, you may add .setSeed(new Random().nextLong()) before
fit() as a workaround.

Thanks,
Yuhao

2017-03-14 5:46 GMT-07:00 Julian Keppel :

> I'm sorry, I missed some important informations. I use Spark version 2.0.2
> in Scala 2.11.8.
>
> 2017-03-14 13:44 GMT+01:00 Julian Keppel :
>
>> Hi everybody,
>>
>> I make some experiments with the Spark kmeans implementation of the new
>> DataFrame-API. I compare clustering results of different runs with
>> different parameters. I recognized that for random initialization mode, the
>> seed value is the same every time. How is it calculated? In my
>> understanding the seed should be random if it is not provided by the user.
>>
>> Thank you for you help.
>>
>> Julian
>>
>
>


Re: how to construct parameter for model.transform() from datafile

2017-03-14 Thread Yuhao Yang
Hi Jinhong,


Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.

>From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).

Regards,
Yuhao


2017-03-13 4:59 GMT-07:00 jinhong lu :

> Anyone help?
>
> > 在 2017年3月13日,19:38,jinhong lu  写道:
> >
> > After train the mode, I got the result look like this:
> >
> >
> >   scala>  predictionResult.show()
> >   +-+++---
> -+--+
> >   |label|features|   rawPrediction|
>  probability|prediction|
> >   +-+++---
> -+--+
> >   |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>  0.0|
> >   |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>  0.0|
> >   |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
>  1.0|
> >
> > And then, I transform() the data by these code:
> >
> >   import org.apache.spark.ml.linalg.Vectors
> >   import org.apache.spark.ml.linalg.Vector
> >   import scala.collection.mutable
> >
> >  def lineToVector(line:String ):Vector={
> >   val seq = new mutable.Queue[(Int,Double)]
> >   val content = line.split(" ");
> >   for( s <- content){
> > val index = s.split(":")(0).toInt
> > val value = s.split(":")(1).toDouble
> >  seq += ((index,value))
> >   }
> >   return Vectors.sparse(144109, seq)
> > }
> >
> >val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => 
> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1.toDF("udid",
> "features")
> >val predictionResult = model.transform(df)
> >predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> >  at scala.Predef$.require(Predef.scala:224)
> >  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> >  at lineToVector(:55)
> >  at $anonfun$4.apply(:50)
> >  at $anonfun$4.apply(:50)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> >  at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> >  at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> >   return Vectors.sparse(144109, seq)
> >
> > to
> >
> >   return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> >   Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> > at scala.Predef$.require(Predef.scala:224)
> > at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> > at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> > at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日,16:31,jinhong lu  写道:
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >>  0 31607:17
> >>  0 111905:36
> >>  0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >>  0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >>  0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >>  0 19109:7 29705:4 123305:32
> >>  0 15309:1 43005:1 108509:1
> >>  1 604:1 6401:1 6503:1 15207:4 31607:40
> >>  0 1807:19
> >>  0 301:14 501:1 1502:14 2507:12 123305:4
> >>  0 607:14 19109:460 123305:448
> >>  0 5406:14 7209:4 10509:3 19109:6 

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-14 Thread Yuhao Yang
Hi Raju,

Have you tried setNumPartitions with a larger number?

2017-03-07 0:30 GMT-08:00 Eli Super :

> Hi
>
> It's area of knowledge , you will need to read online several hours about
> it
>
> What is your programming language ?
>
> Try search online : "machine learning binning %my_programing_langauge%"
> and
> "machine learning feature engineering %my_programing_langauge%"
>
> On Tue, Mar 7, 2017 at 3:39 AM, Raju Bairishetti  wrote:
>
>> @Eli, Thanks for the suggestion. If you do not mind can you please
>> elaborate approaches?
>>
>> On Mon, Mar 6, 2017 at 7:29 PM, Eli Super  wrote:
>>
>>> Hi
>>>
>>> Try to implement binning and/or feature engineering (smart feature
>>> selection for example)
>>>
>>> Good luck
>>>
>>> On Mon, Mar 6, 2017 at 6:56 AM, Raju Bairishetti 
>>> wrote:
>>>
 Hi,
   I am new to Spark ML Lib. I am using FPGrowth model for finding
 related items.

 Number of transactions are 63K and the total number of items in all
 transactions are 200K.

 I am running FPGrowth model to generate frequent items sets. It is
 taking huge amount of time to generate frequent itemsets.* I am
 setting min-support value such that each item appears in at least ~(number
 of items)/(number of transactions).*

 It is taking lots of time in case If I say item can appear at least
 once in the database.

 If I give higher value to min-support then output is very smaller.

 Could anyone please guide me how to reduce the execution time for
 generating frequent items?

 --
 Thanks,
 Raju Bairishetti,
 www.lazada.com

>>>
>>>
>>
>>
>> --
>>
>> --
>> Thanks,
>> Raju Bairishetti,
>> www.lazada.com
>>
>
>


Sharing my DataFrame (DataSet) cheat sheet.

2017-03-04 Thread Yuhao Yang
Sharing some snippets I accumulated during developing with Apache Spark
DataFrame (DataSet). Hope it can help you in some way.

https://github.com/hhbyyh/DataFrameCheatSheet.

[image: 内嵌图片 1]





Regards,
Yuhao Yang


Re: scikit-learn and mllib difference in predictions python

2016-12-25 Thread Yuhao Yang
Hi ioanna,

I'd like to help look into it. Is there a way to access your training data?

2016-12-20 17:21 GMT-08:00 ioanna :

> I have an issue with an SVM model trained for binary classification using
> Spark 2.0.0.
> I have followed the same logic using scikit-learn and MLlib, using the
> exact
> same dataset.
> For scikit learn I have the following code:
>
> svc_model = SVC()
> svc_model.fit(X_train, y_train)
>
> print "supposed to be 1"
> print svc_model.predict([15 ,15,0,15,15,4,12,8,0,7])
> print
> svc_model.predict([15.0,15.0,15.0,7.0,7.0,15.0,15.0,0.0,12.0,15.0])
> print svc_model.predict([15.0,15.0,7.0,0.0,7.0,0.0,15.0,15.0,15.
> 0,15.0])
> print svc_model.predict([7.0,0.0,15.0,15.0,15.0,15.0,7.0,7.0,15.0,
> 15.0])
>
> print "supposed to be 0"
> print svc_model.predict([18.0, 15.0, 7.0, 7.0, 15.0, 0.0, 15.0, 15.0,
> 15.0, 15.0])
> print svc_model.predict([ 11.0,13.0,7.0,10.0,7.0,13.0,7.
> 0,19.0,7.0,7.0])
> print svc_model.predict([ 15.0, 15.0, 18.0, 7.0, 15.0, 15.0, 15.0,
> 18.0,
> 7.0, 15.0])
> print svc_model.predict([ 15.0, 15.0, 8.0, 0.0, 0.0, 8.0, 15.0, 15.0,
> 15.0, 7.0])
>
>
> and it returns:
>
> supposed to be 1
> [0]
> [1]
> [1]
> [1]
> supposed to be 0
> [0]
> [0]
> [0]
> [0]
>
> For spark am doing:
>
> model_svm = SVMWithSGD.train(trainingData, iterations=100)
>
> model_svm.clearThreshold()
>
> print "supposed to be 1"
> print
> model_svm.predict(Vectors.dense(15.0,15.0,0.0,15.0,15.0,
> 4.0,12.0,8.0,0.0,7.0))
> print
> model_svm.predict(Vectors.dense(15.0,15.0,15.0,7.0,7.0,
> 15.0,15.0,0.0,12.0,15.0))
> print
> model_svm.predict(Vectors.dense(15.0,15.0,7.0,0.0,7.0,0.
> 0,15.0,15.0,15.0,15.0))
> print
> model_svm.predict(Vectors.dense(7.0,0.0,15.0,15.0,15.0,
> 15.0,7.0,7.0,15.0,15.0))
>
> print "supposed to be 0"
> print model_svm.predict(Vectors.dense(18.0, 15.0, 7.0, 7.0, 15.0, 0.0,
> 15.0, 15.0, 15.0, 15.0))
> print
> model_svm.predict(Vectors.dense(11.0,13.0,7.0,10.0,7.0,
> 13.0,7.0,19.0,7.0,7.0))
> print model_svm.predict(Vectors.dense(15.0, 15.0, 18.0, 7.0, 15.0,
> 15.0,
> 15.0, 18.0, 7.0, 15.0))
> print model_svm.predict(Vectors.dense(15.0, 15.0, 8.0, 0.0, 0.0, 8.0,
> 15.0, 15.0, 15.0, 7.0))
>
> which returns:
>
> supposed to be 1
> 12.8250120159
> 16.0786937313
> 14.2139435305
> 16.5115589658
> supposed to be 0
> 17.1311777004
> 14.075461697
> 20.8883372052
> 12.9132580999
>
> when I am setting the threshold I am either getting all zeros or all ones.
>
> Does anyone know how to approach this problem?
>
> As I said I have checked multiple times that my dataset and feature
> extraction logic are exactly the same in both cases.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/scikit-learn-and-mllib-difference-in-
> predictions-python-tp28240.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: spark linear regression error training dataset is empty

2016-12-25 Thread Yuhao Yang
Hi Xiaomeng,

Have you tried to confirm the DataFrame contents before fitting? like
assembleddata.show()
before fitting.

Regards,
Yuhao

2016-12-21 10:05 GMT-08:00 Xiaomeng Wan :

> Hi,
>
> I am running linear regression on a dataframe and get the following error:
>
> Exception in thread "main" java.lang.AssertionError: assertion failed:
> Training dataset is empty.
>
> at scala.Predef$.assert(Predef.scala:170)
>
> at org.apache.spark.ml.optim.WeightedLeastSquares$Aggregator.validate(
> WeightedLeastSquares.scala:247)
>
> at org.apache.spark.ml.optim.WeightedLeastSquares.fit(
> WeightedLeastSquares.scala:82)
>
> at org.apache.spark.ml.regression.LinearRegression.
> train(LinearRegression.scala:180)
>
> at org.apache.spark.ml.regression.LinearRegression.
> train(LinearRegression.scala:70)
>
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>
> here is the data and code:
>
> {"label":79.3,"features":{"type":1,"values":[6412.
> 14350001,888.0,1407.0,1.5844594594594594,10.614,12.07,
> 0.12062966031483012,0.9991237664152219,6.065,0.49751449875724935]}}
>
> {"label":72.3,"features":{"type":1,"values":[6306.
> 04450001,1084.0,1451.0,1.338560885608856,7.018,12.04,0.
> 41710963455149497,0.9992054343916128,6.05,0.4975083056478405]}}
>
> {"label":76.7,"features":{"type":1,"values":[6142.
> 9203,1494.0,1437.0,0.9618473895582329,7.939,12.06,
> 0.34170812603648426,0.9992216101762574,6.06,0.49751243781094534]}}
>
> val lr = new LinearRegression().setMaxIter(300).setFeaturesCol("features")
>
> val lrModel = lr.fit(assembleddata)
>
> Any clue or inputs are appreciated.
>
>
> Regards,
>
> Shawn
>
>
>


Re: Multilabel classification with Spark MLlib

2016-11-29 Thread Yuhao Yang
If problem transformation is not an option (
https://en.wikipedia.org/wiki/Multi-label_classification#Problem_transformation_methods),
I would try to develop a customized algorithm based
on MultilayerPerceptronClassifier, in which you probably need to
rewrite LabelConverter.

2016-11-29 9:02 GMT-08:00 Md. Rezaul Karim 
:

> Hello All,
>
> Is there anyone who has developed multilabel classification applications
> with Spark?
>
> I found an example class in Spark distribution (i.e.,
> *JavaMultiLabelClassificationMetricsExample.java*) which is not a
> classifier but an evaluator for a multilabel classification. Moreover, the
> example is not well documented (i.e., I did not understand which one is a
> label and which one is a feature).
>
> More specifically, I was looking for some example implemented in
> Java/Scala/Python so that I can develop my own multi-label classification
> applications.
>
>
>
> Any kind of help would be highly appreciated.
>
>
>
>
>
>
> Regards,
> _
> *Md. Rezaul Karim,* BSc, MSc
> PhD Researcher, INSIGHT Centre for Data Analytics
> National University of Ireland, Galway
> IDA Business Park, Dangan, Galway, Ireland
> Web: http://www.reza-analytics.eu/index.html
> 
>
>


Re: OutOfMemoryError - When saving Word2Vec

2016-06-13 Thread Yuhao Yang
Hi Sharad,

what's your vocabulary size and vector length for Word2Vec?

Regards,
Yuhao

2016-06-13 20:04 GMT+08:00 sharad82 :

> Is this the right forum to post Spark related issues ? I have tried this
> forum along with StackOverflow but not seeing any response.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-When-saving-Word2Vec-tp27142p27151.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Ignore features in Random Forest

2016-06-02 Thread Yuhao Yang
Hi Neha,

This looks like a feature engineering task. I think VectorSlicer can help
with your case. Please refer to
http://spark.apache.org/docs/latest/ml-features.html#vectorslicer .

Regards,
Yuhao

2016-06-01 21:18 GMT+08:00 Neha Mehta :

> Hi,
>
> I am performing Regression using Random Forest. In my input vector, I want
> the algorithm to ignore certain columns/features while training the
> classifier and also while prediction. These are basically Id columns. I
> checked the documentation and could not find any information on the same.
>
> Request help with the same.
>
> Thanks & Regards,
> Neha
>