[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
Hi All,

DataWorks Summit, San Jose, 2018 is a good place to share your experience of 
advanced analytics, data science, machine learning and deep learning.
We have Artificial Intelligence and Data Science session, to cover technologies 
such as:
Apache Spark, Sciki-learn, TensorFlow, Keras, Apache MXNet, PyTorch/Torch, 
XGBoost, Apache Livy, Apache Zeppelin, Jupyter, etc.
Please consider to submit abstract at 
https://dataworkssummit.com/san-jose-2018/ 



Thanks
Yanbo

[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 
2018. This is a great place to talk about work you are doing in Apache Spark or 
how you are using Spark for SQL/streaming processing, machine learning and data 
science. Information on submitting an abstract is at 
https://dataworkssummit.com/berlin-2018/ 
 .

Tracks:
Data Warehousing and Operational Data Stores
Artificial Intelligence and Data Science
Big Compute and Storage
Cloud and Operations
Governance and Security
Cyber Security
IoT and Streaming
Enterprise Adoption

Deadline: December 15th, 2017

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
You are right, native Spark MLlib CrossValidation can't run *different
*algorithms
in parallel.

Thanks
Yanbo

On Tue, Sep 5, 2017 at 10:56 PM, Timsina, Prem <prem.tims...@mssm.edu>
wrote:

> Hi Yanboo,
>
> Thank You, I very much appreciate your help.
>
> For the current use case, the data can fit into a single node. So,
> spark-sklearn seems to be good choice.
>
>
>
> *I have  on question regarding this *
>
> *“If no, Spark MLlib provide CrossValidation which can run multiple
> machine learning algorithms parallel on distributed dataset and do
> parameter search.
> FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>”*
>
> If I understand correctly, it can run parameter search for
> cross-validation in parallel.
>
> However,  currently  Spark does not support  running multiple algorithms
> (like Naïve Bayes,  Random Forest, etc.) in parallel. Am I correct?
>
> If not, could you please point me to some resources where they have run
> multiple algorithms in parallel.
>
>
>
> Thank You very much. It is great help, I will try spark-sklearn.
>
> Prem
>
>
>
>
>
>
>
>
>
> *From: *Yanbo Liang <yblia...@gmail.com>
> *Date: *Tuesday, September 5, 2017 at 10:40 AM
> *To: *Patrick McCarthy <pmccar...@dstillery.com>
> *Cc: *"Timsina, Prem" <prem.tims...@mssm.edu>, "user@spark.apache.org" <
> user@spark.apache.org>
> *Subject: *Re: Apache Spark: Parallelization of Multiple Machine Learning
> ALgorithm
>
>
>
> Hi Prem,
>
>
>
> How large is your dataset? Can it be fitted in a single node?
>
> If no, Spark MLlib provide CrossValidation which can run multiple machine
> learning algorithms parallel on distributed dataset and do parameter
> search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#
> cross-validation
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_latest_ml-2Dtuning.html-23cross-2Dvalidation=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=lVvXRRGoh5uXJw-K246dNzogKEfb2yFYtxpTB9xxizo=>
>
> If yes, you can also try spark-sklearn, which can distribute multiple
> model training(single node training with sklearn) across a distributed
> cluster and do parameter search. FYI: https://github.com/
> databricks/spark-sklearn
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dsklearn=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=JfciAow01oTIYYCjhy83Q_nF85fKW9ZI-qYxfUa0BUU=>
>
>
>
> Thanks
>
> Yanbo
>
>
>
> On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy <pmccar...@dstillery.com>
> wrote:
>
> You might benefit from watching this JIRA issue -
> https://issues.apache.org/jira/browse/SPARK-19071
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D19071=DwMFaQ=shNJtf5dKgNcPZ6Yh64b-A=wnzquyZN5LCZ2v6jPXe4F2nU9j4v9g_t24s63U3cYqE=FtsbdcfaOELxFW8EFphZgjTd7cl3Kc5oYsQ558EZb3A=hQZ6ldug0XZvo4q87r0BQatn55B6UtyVVs0Ge9UneW4=>
>
>
>
> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem <prem.tims...@mssm.edu>
> wrote:
>
> Is there a way to parallelize multiple ML algorithms in Spark. My use case
> is something like this:
>
> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
> Forest, etc.) in parallel.
>
> 1) Validate each algorithm using 10-fold cross-validation
>
> B) Feed the output of step A) in second layer machine learning algorithm.
>
> My question is:
>
> Can we run multiple machine learning algorithm in step A in parallel?
>
> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
> Bayes training in parallel?
>
>
>
> I was not able to find any way to run the different algorithm in parallel.
> And it seems cross-validation also can not be done in parallel.
>
> I appreciate any suggestion to parallelize this use case.
>
>
>
> Prem
>
>
>
>
>


Re: sparkR 3rd library

2017-09-05 Thread Yanbo Liang
I guess you didn't install R package `genalg` for all worker nodes. This is
not built-in package for basic R, so you need to install it to all worker
nodes manually or running `install.packages` inside of your SparkR UDF.
Regards to how to download third party packages and install them inside of
SparkR UDF, please refer this test case:
https://github.com/apache/spark/blob/master/R/pkg/tests/fulltests/test_context.R#L171

Thanks
Yanbo

On Tue, Sep 5, 2017 at 6:42 AM, Felix Cheung 
wrote:

> Can you include the code you call spark.lapply?
>
>
> --
> *From:* patcharee 
> *Sent:* Sunday, September 3, 2017 11:46:40 PM
> *To:* spar >> user@spark.apache.org
> *Subject:* sparkR 3rd library
>
> Hi,
>
> I am using spark.lapply to execute an existing R script in standalone
> mode. This script calls a function 'rbga' from a 3rd library 'genalg'.
> This rbga function works fine in sparkR env when I call it directly, but
> when I apply this to spark.lapply I get the error
>
> could not find function "rbga"
>  at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
>  at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala
>
> Any ideas/suggestions?
>
> BR, Patcharee
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
Hi Prem,

How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine
learning algorithms parallel on distributed dataset and do parameter
search. FYI:
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
If yes, you can also try spark-sklearn, which can distribute multiple model
training(single node training with sklearn) across a distributed cluster
and do parameter search. FYI: https://github.com/databricks/spark-sklearn

Thanks
Yanbo

On Tue, Sep 5, 2017 at 9:56 PM, Patrick McCarthy 
wrote:

> You might benefit from watching this JIRA issue -
> https://issues.apache.org/jira/browse/SPARK-19071
>
> On Sun, Sep 3, 2017 at 5:50 PM, Timsina, Prem 
> wrote:
>
>> Is there a way to parallelize multiple ML algorithms in Spark. My use
>> case is something like this:
>>
>> A) Run multiple machine learning algorithm (Naive Bayes, ANN, Random
>> Forest, etc.) in parallel.
>>
>> 1) Validate each algorithm using 10-fold cross-validation
>>
>> B) Feed the output of step A) in second layer machine learning algorithm.
>>
>> My question is:
>>
>> Can we run multiple machine learning algorithm in step A in parallel?
>>
>> Can we do cross-validation in parallel? Like, run 10 iterations of Naive
>> Bayes training in parallel?
>>
>>
>>
>> I was not able to find any way to run the different algorithm in
>> parallel. And it seems cross-validation also can not be done in parallel.
>>
>> I appreciate any suggestion to parallelize this use case.
>>
>>
>>
>> Prem
>>
>
>


Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea,

Could you let us know which ML algorithm you use? What's the number
instances and dimension of your dataset?
AFAIK, Spark MLlib can train model with several millions of feature if you
configure it correctly.

Thanks
Yanbo

On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet  wrote:

> SGD is supported. I see I assumed you were using Scala. Looks like you can
> do streaming regression, not sure of pyspark API though:
>
> https://spark.apache.org/docs/latest/mllib-linear-methods.
> html#streaming-linear-regression
>
> On 23 August 2017 at 18:22, Sea aj  wrote:
>
>> Thanks for the reply.
>>
>> As far as I understood mini batch is not yet supported in ML libarary. As
>> for MLLib minibatch, I could not find any pyspark api.
>>
>>
>>
>>  Sent with Mailtrack
>> 
>>
>> On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet  wrote:
>>
>>> It depends on what model you would like to train but models requiring
>>> optimisation could use SGD with mini batches. See:
>>> https://spark.apache.org/docs/latest/mllib-optimization.html
>>> #stochastic-gradient-descent-sgd
>>>
>>> On 23 August 2017 at 14:27, Sea aj  wrote:
>>>
 Hi,

 I am trying to feed a huge dataframe to a ml algorithm in Spark but it
 crashes due to the shortage of memory.

 Is there a way to train the model on a subset of the data in multiple
 steps?

 Thanks



  Sent with Mailtrack
 

>>>
>>>
>>>
>>> --
>>>
>>> Mehmet Süzen, MSc, PhD
>>> 
>>>
>>> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission,
>>> and any documents, files or previous e-mail messages attached to it, may
>>> contain confidential information that is legally privileged. If you are not
>>> the intended recipient or a person responsible for delivering it to the
>>> intended recipient, you are hereby notified that any disclosure, copying,
>>> distribution or use of any of the information contained in or attached to
>>> this transmission is STRICTLY PROHIBITED within the applicable law. If you
>>> have received this transmission in error, please: (1) immediately notify me
>>> by reply e-mail to su...@acm.org,  and (2) destroy the original
>>> transmission and its attachments without reading or saving in any manner. |
>>>
>>
>>
>
>
> --
>
> Mehmet Süzen, MSc, PhD
> 
>
> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
> any documents, files or previous e-mail messages attached to it, may
> contain confidential information that is legally privileged. If you are not
> the intended recipient or a person responsible for delivering it to the
> intended recipient, you are hereby notified that any disclosure, copying,
> distribution or use of any of the information contained in or attached to
> this transmission is STRICTLY PROHIBITED within the applicable law. If you
> have received this transmission in error, please: (1) immediately notify me
> by reply e-mail to su...@acm.org,  and (2) destroy the original
> transmission and its attachments without reading or saving in any manner. |
>


Re: [BlockMatrix] multiply is an action or a transformation ?

2017-08-20 Thread Yanbo Liang
BlockMatrix.multiply will return another BlockMatrix. Inside this function,
there are lots of steps of RDD operations, but most of them are
transformation. If you don't trigger to obtain the blocks(which is an RDD
of [(Int, Int, Matrix)] of the result BlockMatrix, the job will not run.

Thanks
Yanbo

On Sun, Aug 13, 2017 at 10:30 PM, Jose Francisco Saray Villamizar <
jsa...@gmail.com> wrote:

> Hi Everyone,
>
> Sorry if the question can be simple, or confusing, but I have not see
> anywhere in documentation
> the anwser:
>
> Is multiply method in BlockMatrix a transformation or an action.
> I mean, in order that the multiplication is effectively done it is enough
> with calling :
>
> m1.multiply(m2),
>
> Or do I have to make something like m1.multiply(m2).count().
>
> Thanks.
>
> --
> --
> Buen dia, alegria !!
> José Francisco Saray Villamizar
> cel +33 6 13710693 <+33%206%2013%2071%2006%2093>
> Lyon, France
>
>


Re: Huber regression in PySpark?

2017-08-20 Thread Yanbo Liang
Hi Jeff,

Actually I have one implementation of robust regression with huber loss for
a long time (https://github.com/apache/spark/pull/14326). This is a fairly
straightforward porting for scikit-learn HuberRegressor.
The PR making huber regression as a separate Estimator, and we found it can
be merged into LinearRegression.
I will update this PR ASAP, and I'm looking forward your reviews and
comments.
After the Scala implementation is merged, it's very easy to add
corresponding PySpark API, then you can use it to train huber regression
model in the distributed environment.

Thanks
Yanbo

On Sun, Aug 20, 2017 at 3:19 PM, Jeff Gates  wrote:

> Hi guys,
>
> Is there huber regression in PySpark? We are using sklearn HuberRegressor (
> http://scikit-learn.org/stable/modules/generated/sklearn.
> linear_model.HuberRegressor.html) to train our model, but with some
> bottleneck in single node.
> If no, is there any obstacle to implement it in PySpark?
>
> Jeff
>


Re: Collecting matrix's entries raises an error only when run inside a test

2017-07-06 Thread Yanbo Liang
Hi Simone,

Would you mind to share the minimized code to reproduce this issue?

Yanbo

On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti 
wrote:

> Hello, I have this problem and  Google is not helping. Instead, it looks
> like an unreported bug and there are no hints to possible workarounds.
>
> the error is the following:
>
> Traceback (most recent call last):
>   File 
> "/home/simone/motionlogic/trip-labeler/test/trip_labeler_test/model_test.py",
> line 43, in test_make_trip_matrix
> entries = trip_matrix.entries.map(lambda entry: (entry.i, entry.j,
> entry.value)).collect()
>   File "/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/rdd.py",
> line 770, in collect
> with SCCallSiteSync(self.context) as css:
>   File 
> "/opt/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/traceback_utils.py",
> line 72, in __enter__
> self._context._jsc.setCallSite(self._call_site)
> AttributeError: 'NoneType' object has no attribute 'setCallSite'
>
> and it is raised when I try to collect a 
> pyspark.mllib.linalg.distributed.CoordinateMatrix
> entries with .collect() and it happens only when this run in a test suite
> with more than one class, so it's probably related to the creation and
> destruction of SparkContexts but I cannot understand how.
>
> Spark version is 1.6.2
>
> I saw multiple references to this error for other classses in the pyspark
> ml library but none of them contained hints toward the solution.
>
> I'm running tests through nosetests when it breaks. Running a single
> TestCase in Intellij works fine.
>
> Is there a known solution? Is it a known problem?
>
> Thank you,
>
> Simone
>


Re: PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-28 Thread Yanbo Liang
It looks like your Spark job was running under user root, but you file
system operation was running under user jomernik. Since Spark will call
corresponding file system(such as HDFS, S3) to commit job(rename temporary
file to persistent one), it should have correct authorization for both
Spark and file system. Could you write a Spark DataFrame to this file
system and check whether it works well?

Thanks
Yanbo

On Tue, Jun 27, 2017 at 8:47 PM, John Omernik  wrote:

> Hello all, I am running PySpark 2.1.1 as a user, jomernik. I am working
> through some documentation here:
>
> https://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests
>
> And was working on the Random Forest Classification, and found it to be
> working!  That said, when I try to save the model to my hdfs (MaprFS in my
> case)  I got a weird error:
>
> I tried to save here:
>
> model.save(sc, "maprfs:///user/jomernik/tmp/myRandomForestClassificationMo
> del")
>
> /user/jomernik is my user directory and I have full access to the
> directory.
>
>
>
> All the directories down to
>
> /user/jomernik/tmp/myRandomForestClassificationModel/metadata/_temporary/0
> are owned by my with full permissions, but when I get to this directory,
> here is the ls
>
> $ ls -ls
>
> total 1
>
> 1 drwxr-xr-x 2 root root 1 Jun 27 07:38 task_20170627123834_0019_m_00
>
> 0 drwxr-xr-x 2 root root 0 Jun 27 07:38 _temporary
>
> Am I doing something wrong here? Why is the temp stuff owned by root? Is
> there a bug in saving things due to this ownership?
>
> John
>
>
>
>
>
>
> Exception:
> Py4JJavaError: An error occurred while calling o338.save.
> : org.apache.hadoop.security.AccessControlException: User jomernik(user
> id 101) does has been denied access to rename  /user/jomernik/tmp/
> myRandomForestClassificationModel/metadata/_temporary/0/
> task_20170627123834_0019_m_00/part-0 to /user/jomernik/tmp/
> myRandomForestClassificationModel/metadata/part-0
> at com.mapr.fs.MapRFileSystem.rename(MapRFileSystem.java:1112)
> at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
> FileOutputCommitter.java:461)
> at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(
> FileOutputCommitter.java:475)
> at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.
> commitJobInternal(FileOutputCommitter.java:392)
> at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(
> FileOutputCommitter.java:364)
> at org.apache.hadoop.mapred.FileOutputCommitter.commitJob(
> FileOutputCommitter.java:136)
> at org.apache.spark.SparkHadoopWriter.commitJob(
> SparkHadoopWriter.scala:111)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1227)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1168)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(
> PairRDDFunctions.scala:1168)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1071)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1037)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(
> PairRDDFunctions.scala:1037)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:963)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963)
> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopFile$1.apply(PairRDDFunctions.scala:963)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(
> PairRDDFunctions.scala:962)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.
> apply$mcV$sp(RDD.scala:1489)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.
> apply(RDD.scala:1468)
> at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.
> apply(RDD.scala:1468)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at 

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic
regression or GBT. Naive bayes usually consider features as count, which is
not suitable to be used on features generated by one-hot encoder.

Thanks
Yanbo

On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti  wrote:

> Hi,
>
> I am trying to run Naive Bayes Model using Spark ML libraries, in Java.
> The sample snippet of dataset is given below:
>
> *Raw Data* -
>
>
> But, as the input data needs to in numeric, so I am using
> *one-hot-encoder* on the Gender field[m->0,1][f->1,0]; and the finally
> the 'features' vector is inputted to Model, and I could get the Output.
>
> *Transformed Data* -
>
>
> But the model *results are not correct *as the 'Gender' field[Originally,
> Categorical] is now considered as a continuous field after one-hot encoding
> transformations.
>
> *Expectation* is that - for 'continuous data', mean and variance ; and
> for 'categorical data', the number of occurrences of different categories,
> is to be calculated. [In, my case, mean and variances are calculated even
> for the Gender Field].
>
> So, is there any way by which I can indicate to the model that a
> particular data field is 'categorical' by nature?
>
> Thanks
>
> Best Regards
> Amlan Jyoti
>
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>


Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang
Since this function is used to compute QR decomposition for RowMatrix of a
tall and skinny shape, the output R is always with small rank.
[image: Inline image 1]

On Fri, Jun 9, 2017 at 10:33 PM, Arun  wrote:

> hi
>
> *def  tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix,
> Matrix]*
>
> *In output of this method Q is distributed matrix*
> *and R is local Matrix*
>
> *Whats  the reason R is Local Matrix?*
>
>
> -Arun
>


Re: spark higher order functions

2017-06-23 Thread Yanbo Liang
See reply here:

http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html

On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson 
wrote:

> Hi,
>
> I have seen that databricks have higher order functions (
> https://docs.databricks.com/_static/notebooks/higher-order-functions.html,
> https://databricks.com/blog/2017/05/24/working-with-
> nested-data-using-higher-order-functions-in-sql-on-databricks.html) which
> basically allows to do generic operations on arrays (and I imagine on maps
> too).
>
> I was wondering if there is an equivalent on vanilla spark.
>
>
>
>
>
> Thanks,
>
>   Assaf.
>
>
>
> --
> View this message in context: spark higher order functions
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>


Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang
gfortran runtime library is still required for Spark 2.1 for better
performance.
If it's not present on your nodes, you will see a warning message and a
pure JVM implementation will be used instead, but you will not get the best
performance.

Thanks
Yanbo

On Wed, Jun 21, 2017 at 5:30 PM, Saroj C  wrote:

> Dear All,
>  Can you please let me know, if gfortran runtime library is still
> required for Spark 2.1, for better performance. Note, I am using Java APIs
> for Spark ?
>
> Thanks & Regards
> Saroj
>
> =-=-=
> Notice: The information contained in this e-mail
> message and/or attachments to it may contain
> confidential or privileged information. If you are
> not the intended recipient, any dissemination, use,
> review, distribution, printing or copying of the
> information contained in this e-mail message
> and/or attachments to it are strictly prohibited. If
> you have received this communication in error,
> please notify us by reply e-mail or telephone and
> immediately and permanently delete the message
> and any attachments. Thank you
>
>


Re: BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-12 Thread Yanbo Liang
Yeah, for binary data, you can also use MulticlassClassificationEvaluator
to evaluate other metrics which BinaryClassificationEvaluator doesn't
cover, such as accuracy, f1, weightedPrecision and weightedRecall.

Thanks
Yanbo

On Thu, May 11, 2017 at 10:31 PM, Lan Jiang  wrote:

> I realized that in the Spark ML, BinaryClassifcationMetrics only supports
> AreaUnderPR and AreaUnderROC. Why is that? I
>
> What if I need other metrics such as F-score, accuracy? I tried to use
> MulticlassClassificationEvaluator to evaluate other metrics such as
> Accuracy for a binary classification problem and it seems working. But I am
> not sure if there is any issue using MulticlassClassificationEvaluator
> for a binary classification. According to the Spark ML documentation "The
> Evaluator can be a RegressionEvaluator for regression problems, *a
> BinaryClassificationEvaluator for binary data, or a
> MulticlassClassificationEvaluator for multiclass problems*. "
>
> https://spark.apache.org/docs/2.1.0/ml-tuning.html
>
> Can someone shed some lights on the issue?
>
> Lan
>


[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year,
September 20-21. This is a great place to talk about work you are doing in
Apache Spark or how you are using Spark. Information on submitting an
abstract is at
https://dataworkssummit.com/sydney-2017/abstracts/submit-abstract/



Tracks:

Apache Hadoop

Apache Spark and Data Science

Cloud and Applications

Data Processing and Warehousing

Enterprise Adoption

IoT and Streaming

Operations, Governance and Security



Deadline: Friday, May 26th, 2017.


Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-01 Thread Yanbo Liang
Hi Tim,

Spark ML API doesn't support set initial model for GMM currently. I wish we
can get this feature in Spark 2.3.

Thanks
Yanbo

On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith  wrote:

> Hi,
>
> I am trying to figure out the API to initialize a gaussian mixture model
> using either centroids created by K-means or previously calculated GMM
> model (I am aware that you can "save" a model and "load" in later but I am
> not interested in saving a model to a filesystem).
>
> The Spark MLlib API lets you do this using SetInitialModel
> https://spark.apache.org/docs/2.1.0/api/scala/index.html#
> org.apache.spark.mllib.clustering.GaussianMixture
>
> However, I cannot figure out how to do this using Spark ML API. Can anyone
> please point me in the right direction? I've tried reading the Spark ML
> code and was wondering if the "set" call lets you do that?
>
> --
> Thanks,
>
> Tim
>


Re: How to create SparkSession using SparkConf?

2017-04-28 Thread Yanbo Liang
StreamingContext is an old API, if you want to process streaming data, you
can use SparkSession directly.
FYI:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Thanks
Yanbo

On Fri, Apr 28, 2017 at 12:12 AM, kant kodali <kanth...@gmail.com> wrote:

> Actually one more question along the same line. This is about .getOrCreate()
> ?
>
> JavaStreamingContext doesn't seem to have a way to accept SparkSession
> object so does that mean a streaming context is not required? If so, how do
> I pass a lambda to .getOrCreate using SparkSession? The lambda that we
> normally pass when we call StreamingContext.getOrCreate.
>
>
>
>
>
>
>
>
> On Thu, Apr 27, 2017 at 8:47 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Ahhh Thanks much! I miss my sparkConf.setJars function instead of this
>> hacky comma separated jar names.
>>
>> On Thu, Apr 27, 2017 at 8:01 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> Could you try the following way?
>>>
>>> val spark = 
>>> SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, 
>>> b.jar").getOrCreate()
>>>
>>>
>>> Thanks
>>>
>>> Yanbo
>>>
>>>
>>> On Thu, Apr 27, 2017 at 9:21 AM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> I am using Spark 2.1 BTW.
>>>>
>>>> On Wed, Apr 26, 2017 at 3:22 PM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am wondering how to create SparkSession using SparkConf object?
>>>>> Although I can see that most of the key value pairs we set in SparkConf we
>>>>> can also set in SparkSession or  SparkSession.Builder however I don't see
>>>>> sparkConf.setJars which is required right? Because we want the driver jar
>>>>> to be distributed across the cluster whether we run it in client mode or
>>>>> cluster mode. so I am wondering how is this possible?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
Could you try the following way?

val spark = SparkSession.builder.appName("my-application").config("spark.jars",
"a.jar, b.jar").getOrCreate()


Thanks

Yanbo


On Thu, Apr 27, 2017 at 9:21 AM, kant kodali  wrote:

> I am using Spark 2.1 BTW.
>
> On Wed, Apr 26, 2017 at 3:22 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I am wondering how to create SparkSession using SparkConf object?
>> Although I can see that most of the key value pairs we set in SparkConf we
>> can also set in SparkSession or  SparkSession.Builder however I don't see
>> sparkConf.setJars which is required right? Because we want the driver jar
>> to be distributed across the cluster whether we run it in client mode or
>> cluster mode. so I am wondering how is this possible?
>>
>> Thanks!
>>
>>
>


Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
What about JOIN your table with a map table?

On Thu, Apr 27, 2017 at 9:58 PM, Nishanth 
wrote:

> I am facing a major issue on replacement of Synonyms in my DataSet.
>
> I am trying to replace the synonym of the Brand names to its equivalent
> names.
>
> I have tried 2 methods to solve this issue.
>
> Method 1 (regexp_replace)
>
> Here i am using the regexp_replace method.
>
> Hashtable manufacturerNames = new Hashtable();
>   Enumeration names;
>   String str;
>   double bal;
>
>   manufacturerNames.put("Allen","Apex Tool Group");
>   manufacturerNames.put("Armstrong","Apex Tool Group");
>   manufacturerNames.put("Campbell","Apex Tool Group");
>   manufacturerNames.put("Lubriplate","Apex Tool Group");
>   manufacturerNames.put("Delta","Apex Tool Group");
>   manufacturerNames.put("Gearwrench","Apex Tool Group");
>   manufacturerNames.put("H.K. Porter","Apex Tool Group");
>   /*100 MORE*/
>   manufacturerNames.put("Stanco","Stanco Mfg");
>   manufacturerNames.put("Stanco","Stanco Mfg");
>   manufacturerNames.put("Standard Safety","Standard Safety
> Equipment Company");
>   manufacturerNames.put("Standard Safety","Standard Safety
> Equipment Company");
>
>
>
>   // Show all balances in hash table.
>   names = manufacturerNames.keys();
>   Dataset dataFileContent = 
> sqlContext.load("com.databricks.spark.csv",
> options);
>
>
>   while(names.hasMoreElements()) {
>  str = (String) names.nextElement();
>  dataFileContent=dataFileContent.withColumn("ManufacturerSource",
> regexp_replace(col("ManufacturerSource"),str,manufacturerNames.get(str).
> toString()));
>   }
>   dataFileContent.show();
>
> I got to know that the amount of data is too huge for regexp_replace so
> got a solution to use UDF
> http://stackoverflow.com/questions/43413513/issue-in-
> regex-replace-in-apache-spark-java
>
>
> Method 2 (UDF)
>
> List data2 = Arrays.asList(
> RowFactory.create("Allen", "Apex Tool Group"),
> RowFactory.create("Armstrong","Apex Tool Group"),
> RowFactory.create("DeWALT","StanleyBlack")
> );
>
> StructType schema2 = new StructType(new StructField[] {
> new StructField("label2", DataTypes.StringType, false,
> Metadata.empty()),
> new StructField("sentence2", DataTypes.StringType, false,
> Metadata.empty())
> });
> Dataset sentenceDataFrame2 = spark.createDataFrame(data2,
> schema2);
>
> UDF2 contains = new UDF2 Boolean>() {
> private static final long serialVersionUID = -5239951370238629896L;
>
> @Override
> public Boolean call(String t1, String t2) throws Exception {
> return t1.contains(t2);
> }
> };
> spark.udf().register("contains", contains, DataTypes.BooleanType);
>
> UDF3 replaceWithTerm = new
> UDF3() {
> private static final long serialVersionUID = -2882956931420910207L;
>
> @Override
> public String call(String t1, String t2, String t3) throws
> Exception {
> return t1.replaceAll(t2, t3);
> }
> };
> spark.udf().register("replaceWithTerm", replaceWithTerm,
> DataTypes.StringType);
>
> Dataset joined = sentenceDataFrame.join(sentenceDataFrame2,
> callUDF("contains", sentenceDataFrame.col("sentence"),
> sentenceDataFrame2.col("label2")))
> .withColumn("sentence_replaced",
> callUDF("replaceWithTerm", sentenceDataFrame.col("sentence"),
> sentenceDataFrame2.col("label2"), sentenceDataFrame2.col("sentence2")))
> .select(col("sentence_replaced"));
>
> joined.show(false);
> }
>
>
> Input
>
> Allen Armstrong nishanth hemanth Allen
> shivu Armstrong nishanth
> shree shivu DeWALT
>
> Replacement of words
> The word in LHS has to replace with the words in RHS given in the input
> sentence
> Allen => Apex Tool Group
> Armstrong => Apex Tool Group
> DeWALT => StanleyBlack
>
>Output
>
>   +-+--+-+
> ---++
>   |label|sentence_replaced
>   |
>   +-+--+-+
> ---++
>   |0|Apex Tool Group Armstrong nishanth hemanth Apex Tool Group
> |
>   |0|Allen Apex Tool Group nishanth hemanth Allen
> |
>   |1|shivu Apex Tool Group nishanth
> |
>   |2|shree shivu StanleyBlack
> |
>   +-+--+-+
> ---++
>
>   Expected Output
>   +-+--+-+
> ---++
>   |label| sentence_replaced
>|
>   

Re: how to create List in pyspark

2017-04-27 Thread Yanbo Liang
​You can try with UDF, like the following code snippet:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
df = spark.read.text("./README.md")​
split_func = udf(lambda text: text.split(" "), ArrayType(StringType()))
df.withColumn("split_value", split_func("value")).show()

Thanks
Yanbo

On Tue, Apr 25, 2017 at 12:27 AM, Selvam Raman  wrote:

> documentDF = spark.createDataFrame([
>
> ("Hi I heard about Spark".split(" "), ),
>
> ("I wish Java could use case classes".split(" "), ),
>
> ("Logistic regression models are neat".split(" "), )
>
> ], ["text"])
>
>
> How can i achieve the same df while i am reading from source?
>
> doc = spark.read.text("/Users/rs/Desktop/nohup.out")
>
> how can i create array type with "sentences" column from
> doc(dataframe)
>
>
> The below one creates more than one column.
>
> rdd.map(lambda rdd: rdd[0]).map(lambda row:row.split(" "))
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
Do you want to get sparse model that most of the coefficients are zeros? If
yes, using L1 regularization leads to sparsity. But the
LogisticRegressionModel coefficients vector's size is still equal with the
number of features, you can get the non-zero elements manually. Actually,
it would be a sparse vector (or matrix for multinomial case) if it's sparse
enough.

Thanks
Yanbo

On Sun, Mar 19, 2017 at 5:02 AM, Dhanesh Padmanabhan  wrote:

> It shouldn't be difficult to convert the coefficients to a sparse vector.
> Not sure if that is what you are looking for
>
> -Dhanesh
>
> On Sun, Mar 19, 2017 at 5:02 PM jinhong lu  wrote:
>
> Thanks Dhanesh,  and how about the features question?
>
> 在 2017年3月19日,19:08,Dhanesh Padmanabhan  写道:
>
> Dhanesh
>
>
> Thanks,
> lujinhong
>
> --
> Dhanesh
> +91-9741125245
>


Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-17 Thread Yanbo Liang
Hi Adrian,

Did you try SQLTransformer? Your preprocessing steps are SQL operations and
can be handled by SQLTransformer in MLlib pipeline scope.

Thanks
Yanbo

On Thu, Mar 9, 2017 at 11:02 AM, aATv  wrote:

> I want to start using PySpark Mllib pipelines, but I don't understand
> how/where preprocessing fits into the pipeline.
>
> My preprocessing steps are generally in the following form:
>1) Load log files(from s3) and parse into a spark Dataframe with columns
> user_id, event_type, timestamp, etc
>2) Group by a column, then pivot and count another column
>   - e.g. df.groupby("user_id").pivot("event_type").count()
>   - We can think of the columns that this creates besides user_id as
> features, where the number of each event type is a different feature
>3) Join the data from step 1 with other metadata, usually stored in
> Cassandra. Then perform a transformation similar to one from step 2), where
> the column that is pivoted and counted is a column that came from the data
> stored in Cassandra.
>
> After this preprocessing, I would use transformers to create other features
> and feed it into a model, lets say Logistic Regression for example.
>
> I would like to make at lease step 2 a custom transformer and add that to a
> pipeline, but it doesn't fit the transformer abstraction. This is because
> it
> takes a single input column and outputs multiple columns.  It also has a
> different number of input rows than output rows due to the group by
> operation.
>
> Given that, how do I fit this into a Mllib pipeline, and it if doesn't fit
> as part of a pipeline, what is the best way to include it in my code so
> that
> it can easily be reused both for training and testing, as well as in
> production.
>
> I'm using pyspark 2.1 and here is an example of 2)
>
>
>
>
> Note: My question is in some way related to this question, but I don't
> think
> it is answered here:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-can-t-a-
> Transformer-have-multiple-output-columns-td18689.html
>  Transformer-have-multiple-output-columns-td18689.html>
>
> Thanks
> Adrian
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-does-preprocessing-fit-into-
> Spark-MLlib-pipeline-tp28473.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: ML PIC

2016-12-21 Thread Yanbo Liang
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the
progress.

On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath 
wrote:

> It is part of the general feature parity roadmap. I can't recall offhand
> any blocker reasons it's just resources
> On Wed, 21 Dec 2016 at 17:05, Robert Hamilton <
> robert_b_hamil...@icloud.com> wrote:
>
>> Hi all.  Is it on the roadmap to have an 
>> Spark.ml.clustering.PowerIterationClustering?
>> Are there technical reasons that there is currently only an .mllib version?
>>
>>
>> Sent from my iPhone
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: Usage of mllib api in ml

2016-11-20 Thread Yanbo Liang
You can refer this example(
http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation)
which use BinaryClassificationEvaluator, and it should be very
straightforward to switch to MulticlassClassificationEvaluator.

Thanks
Yanbo

On Sat, Nov 19, 2016 at 9:03 AM, janardhan shetty 
wrote:

> Hi,
>
> I am trying to use the evaluation metrics offered by mllib
> multiclassmetrics in ml dataframe setting.
> Is there any examples how to use it?
>


Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-19 Thread Yanbo Liang
Hi Russell,

Do you want to use RowMatrix.columnSimilarities to calculate cosine
similarities?
If so, you should using the following steps:

val dataset: DataFrame
// Convert the type of features column from ml.linalg.Vector type to
mllib.linalg.Vector
val oldDataset: DataFrame = MLUtils.convertVectorColumnsFromML(dataset,
"features")
// Convert fromt DataFrame to RDD
val oldRDD: RDD[mllib.linalg.Vector] =
oldDataset.select(col("features")).rdd.map { row =>
row.getAs[mllib.linalg.Vector](0) }
// Generate RowMatrix
val mat: RowMatrix = new RowMatrix(oldRDD, nRows, nCols)
mat.columnSimilarities()

Please feel free to let me know whether it can satisfy your requirements.


Thanks
Yanbo

On Wed, Nov 16, 2016 at 9:26 AM, Russell Jurney 
wrote:

> Asher, can you cast like that? Does that casting work? That is my
> confusion: I don't know what a DataFrame Vector turns into in terms of an
> RDD type.
>
> I'll try this, thanks.
>
> On Tue, Nov 15, 2016 at 11:25 AM, Asher Krim  wrote:
>
>> What language are you using? For Java, you might convert the dataframe to
>> an rdd using something like this:
>>
>> df
>> .toJavaRDD()
>> .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName")));
>>
>> On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney > > wrote:
>>
>>> I have two dataframes with common feature vectors and I need to get the
>>> cosine similarity of one against the other. It looks like this is possible
>>> in the RDD based API, mllib, but not in ml.
>>>
>>> So, how do I convert my sparse dataframe vectors into something spark
>>> mllib can use? I've searched, but haven't found anything.
>>>
>>> Thanks!
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>>>
>>
>>
>>
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com relato.io
>


Re: VectorUDT and ml.Vector

2016-11-19 Thread Yanbo Liang
The reason behind this error can be inferred from the error log:
*MLUtils.convertMatrixColumnsFromML *was used to convert ml.linalg.Matrix
to mllib.linalg.Matrix,
but it looks like the column type is ml.linalg.Vector in your case.
Could you check the type of column "features" in your dataframe (Vector or
Matrix)? I think it's ml.linalg.Vector, so your should use
*MLUtils.convertVectorColumnsFromML.*

Thanks
Yanbo


On Mon, Nov 7, 2016 at 5:25 AM, Ganesh  wrote:

> I am trying to run a SVD on a dataframe and I have used ml TF-IDF which
> has created a dataframe.
> Now for Singular Value Decomposition I am trying to use RowMatrix which
> takes in RDD with mllib.Vector so I have to convert this Dataframe with
> what I assumed was ml.Vector
>
> However the conversion
>
> *val convertedTermDocMatrix =
> MLUtils.convertMatrixColumnsFromML(termDocMatrix,"features")*
>
> fails with
>
> java.lang.IllegalArgumentException: requirement failed: Column features
> must be new Matrix type to be converted to old type but got
> org.apache.spark.ml.linalg.VectorUDT
>
>
> So the question is: How do I perform SVD on a DataFrame? I assume all the
> functionalities of mllib has not be ported to ml.
>
>
> I tried to convert my entire project to use RDD but computeSVD on
> RowMatrix is throwing up out of Memory errors and anyway I would like to
> stick with DataFrame.
>
> Our text corpus is around 55 Gb of text data.
>
>
>
> Ganesh
>


Re: why is method predict protected in PredictionModel

2016-11-19 Thread Yanbo Liang
This function is used internally currently, we will expose it as public to
support make prediction on single instance.
See discussion at https://issues.apache.org/jira/browse/SPARK-10413.

Thanks
Yanbo

On Thu, Nov 17, 2016 at 1:24 AM, wobu  wrote:

> Hi,
>
> we were using Spark 1.3.1 for a long time and now we want to upgrade to 2.0
> release.
> So we used till today the mllib package and the RDD API.
>
> Now im trying to refactor our mllib NaiveBayesClassifier to the new "ml"
> api.
>
> *The Question:*
> why is the method "predict" in the class "PredictionModel" of package "ml"
> protected?
>
> I am trying to find a way to load a saved prediction model and predict
> values without having to use the pipline or transformer api.
>
>
> Best regards
>
> wobu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/why-is-method-predict-protected-in-
> PredictionModel-tp28095.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-16 Thread Yanbo Liang
Hi Pietro,

Actually we have implemented R survreg() counterpart in Spark: Accelerated
failure time model. You can refer AFTSurvivalRegression if you use
Scala/Java/Python. For SparkR users, you can try spark.survreg().
The algorithms is completely distributed and return the same solution with
native R survreg().
Since Cox regression model is not easy to be trained distributed, we choose
to implement survreg() rather than coxph() for Spark. However, the
coefficients of AFT survival regression model is related of the Cox
regression model. You can check whether spark.survreg() satisfies your
requirements.
BTW, I'm the author of Spark AFTSurvivalRegression. Any more questions,
please feel free to let me know.

http://spark.apache.org/docs/latest/ml-classification-regression.html#survival-regression
http://spark.apache.org/docs/latest/api/R/index.html

Thanks
Yanbo

On Tue, Nov 15, 2016 at 9:46 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think the answer to this depends on what granularity you want to run
> the algorithm on. If its on the entire Spark DataFrame and if you
> except the data frame to be very large then it isn't easy to use the
> existing R function. However if you want to run the algorithm on
> smaller subsets of the data you can look at the support for UDFs we
> have in SparkR at
> http://spark.apache.org/docs/latest/sparkr.html#applying-
> user-defined-function
>
> Thanks
> Shivaram
>
> On Tue, Nov 15, 2016 at 3:56 AM, pietrop  wrote:
> > Hi all,
> > I'm writing here after some intensive usage on pyspark and SparkSQL.
> > I would like to use a well known function in the R world: coxph() from
> the
> > survival package.
> > From what I understood, I can't parallelize a function like coxph()
> because
> > it isn't provided with the SparkR package. In other words, I should
> > implement a SparkR compatible algorithm instead of using coxph().
> > I have no chance to make coxph() parallelizable, right?
> > More generally, I think this is true for any non-spark function which
> only
> > accept data.frame format as the data input.
> >
> > Do you plan to implement the coxph() counterpart in Spark? The most
> useful
> > version of this model is the Cox Regression Model for Time-Dependent
> > Covariates, which is missing from ANY ML framework as far as I know.
> >
> > Thank you
> >  Pietro
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-R-guidelines-for-non-spark-
> functions-and-coxph-Cox-Regression-for-Time-Dependent-
> Covariates-tp28077.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: HashingTF for TF.IDF computation

2016-10-23 Thread Yanbo Liang
HashingTF was not designed to handle your case, you can try CountVectorizer
who will keep the original terms as vocabulary for retrieving.
CountVectorizer will compute a global term-to-index map, which can be
expensive for a large corpus and has the risk of OOM. IDF can accept
feature vectors generated by HashingTF or CountVectorizer.
FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf

Thanks
Yanbo

On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu 
wrote:

> Hello everyone,
>
> I'm having a usage issue with HashingTF class from Spark MLLIB.
>
> I'm computing TF.IDF on a set of terms/documents which later I'm using to
> identify most important ones in each of the input document.
>
> Below is a short code snippet which outlines the example (2 documents with
> 2 words each, executed on Spark 2.0).
>
> val documentsToEvaluate = sc.parallelize(Array(Seq("Mars", 
> "Jupiter"),Seq("Venus", "Mars")))
> val hashingTF = new HashingTF()
> val tf = hashingTF.transform(documentsToEvaluate)
> tf.cache()
> val idf = new IDF().fit(tf)
> val tfidf: RDD[Vector] = idf.transform(tf)
> documentsToEvaluate.zip(tfidf).saveAsTextFile("/tmp/tfidf")
>
> The computation yields to the following result:
>
> (List(Mars, Jupiter),(1048576,[593437,962819],[0.4054651081081644,0.0]))
> (List(Venus, Mars),(1048576,[798918,962819],[0.4054651081081644,0.0]))
>
> My concern is that I can't get a mapping of TF.IDF weights an initial
> terms used (i.e. Mars : 0.0, Jupiter : 0.4, Venus : 0.4. You may notice
> that the weight and terms indices do not correspond after zipping 2
> sequences). I can only identify the hash (i.e. 593437 : 0.4) mappings.
>
> I know HashingTF uses the hashing trick to compute TF. My question is it
> possible to retrieve terms / weights mapping, or HashingTF was not designed
> to handle this use-case. If latter, what other implementation of TF.IDF you
> may recommend.
>
> I may continue the computation with the (*hash:weight*) tuple, though
> getting initial (*term:weight)* would result in a lot easier debugging
> steps later down the pipeline.
>
> Any response will be greatly appreciated!
>
> Regards, Sergiu Ciumac
>


Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread Yanbo Liang
​Please increase the value of "maxMemoryInMB"​ of your
RandomForestClassifier or RandomForestRegressor.
It's a warning which will not affect the result but may lead your training
slower.

Thanks
Yanbo

On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部) 
wrote:

> Hi Xi Shen
>
> The warning message wasn’t  removed after I had upgraded my java to V8,
> but  anyway I appreciate your kind help.
>
> Since it’s just a WARN, I suppose I can bear with it and nothing bad would
> really happen. Am I right?
>
>
> 6/10/18 11:12:42 WARN RandomForest: Tree learning is using approximately
> 268437864 bytes per iteration, which exceeds requested limit
> maxMemoryUsage=268435456. This allows splitting 80088 nodes in this
> iteration.
> 16/10/18 11:13:07 WARN RandomForest: Tree learning is using approximately
> 268436304 bytes per iteration, which exceeds requested limit
> maxMemoryUsage=268435456. This allows splitting 80132 nodes in this
> iteration.
> 16/10/18 11:13:32 WARN RandomForest: Tree learning is using approximately
> 268437816 bytes per iteration, which exceeds requested limit
> maxMemoryUsage=268435456. This allows splitting 80082 nodes in this
> iteration.
>
>
>
> 发件人: zhangjianxin 
> 日期: 2016年10月17日 星期一 下午8:16
> 至: Xi Shen 
> 抄送: "user@spark.apache.org" 
> 主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.
>
> Hi Xi Shen
>
> Not yet.  For the moment my idk for spark is still V7. Thanks for your
> reminding, I will try it out by upgrading java.
>
> 发件人: Xi Shen 
> 日期: 2016年10月17日 星期一 下午8:00
> 至: zhangjianxin , "user@spark.apache.org" <
> user@spark.apache.org>
> 主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.
>
> Did you also upgrade to Java from v7 to v8?
>
> On Mon, Oct 17, 2016 at 7:19 PM 张建鑫(市场部) 
> wrote:
>
>>
>> Did anybody encounter this problem before and why it happens , how to
>> solve it?  The same training data and same source code work in 1.6.1,
>> however become lousy in 2.0.1
>>
>> --
>
>
> Thanks,
> David S.
>


Re: Logistic Regression Standardization in ML

2016-10-10 Thread Yanbo Liang
AFAIK, we can guarantee with/without standardization, the models always
converged to the same solution if there is no regularization. You can refer
the test casts at:

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L551


https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala#L588

Thanks
Yanbo

On Mon, Oct 10, 2016 at 7:27 AM, Sean Owen  wrote:

> (BTW I think it means "when no standardization is applied", which is how
> you interpreted it, yes.) I think it just means that if feature i is
> divided by s_i, then its coefficients in the resulting model will end up
> larger by a factor of s_i. They have to be divided by s_i to put them back
> on the same scale as the unnormalized inputs. I don't think that in general
> it will result in exactly the same model, because part of the point of
> standardizing is to improve convergence. You could propose a rewording of
> the two occurrences of this paragraph if you like.
>
> On Mon, Oct 10, 2016 at 3:15 PM Cesar  wrote:
>
>>
>> I have a question regarding how the default standardization in the ML
>> version of the Logistic Regression (Spark 1.6) works.
>>
>> Specifically about the next comments in the Spark Code:
>>
>> /**
>> * Whether to standardize the training features before fitting the model.
>> * The coefficients of models will be always returned on the original
>> scale,
>> * so it will be transparent for users. *Note that with/without
>> standardization,*
>> ** the models should be always converged to the same solution when no
>> regularization*
>> ** is applied.* In R's GLMNET package, the default behavior is true as
>> well.
>> * Default is true.
>> *
>> * @group setParam
>> */
>>
>>
>> Specifically I am having issues with understanding why the solution
>> should converge to the same weight values with/without standardization ?
>>
>>
>>
>> Thanks !
>> --
>> Cesar Flores
>>
>


Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of
Spark and Matlab are right.

Thanks

On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers  wrote:

>
> just looking at a comparision between Matlab and Spark for svd with an
> input matrix N
>
>
> this is matlab code - yes very small matrix
>
> N =
>
> 2.5903   -0.04160.6023
>-0.12362.55960.7629
> 0.0148   -0.06930.2490
>
>
>
> U =
>
>-0.3706   -0.92840.0273
>-0.92870.37080.0014
>-0.0114   -0.0248   -0.9996
>
> 
> Spark code
>
> // Breeze to spark
> val N1D = N.reshape(1, 9).toArray
>
>
> // Note I had to transpose array to get correct values with incorrect signs
> val V2D = N1D.grouped(3).toArray.transpose
>
>
> // Then convert the array into a RDD
> // val NVecdis = Vectors.dense(N1D.map(x => x.toDouble))
> // val V2D = N1D.grouped(3).toArray
>
>
> val rowlocal = V2D.map{x => Vectors.dense(x)}
> val rows = sc.parallelize(rowlocal)
> val mat = new RowMatrix(rows)
> val mat = new RowMatrix(rows)
> val svd = mat.computeSVD(mat.numCols().toInt, computeU=true)
>
> 
>
> Spark Output - notice the change in sign on the 2nd and 3rd column
> -0.3158590633523746   0.9220516369164243   -0.22372713505049768
> -0.8822050381939436   -0.3721920780944116  -0.28842213436035985
> -0.34920956843045253  0.10627246051309004  0.9309988407367168
>
>
>
> And finally some julia code
> N  = [2.59031-0.0416335  0.602295;
> -0.1235842.559640.762906;
> 0.0148463  -0.0693119  0.249017]
>
> svd(N, thin=true)   --- same as matlab
> -0.315859  -0.922052   0.223727
> -0.882205   0.372192   0.288422
> -0.34921   -0.106272  -0.930999
>
> Most likely its an issue with my implementation rather than being a bug
> with svd within the spark environment
> My spark instance is running locally with a docker container
> Any suggestions
> tks
>
>


Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
It looks like you mixed use ALS in spark.ml and spark.mllib package.
You can train the model by either one, meanwhile, you should use the
corresponding save/load functions.
You can not train/save the model by spark.mllib ALS, and then use spark.ml
ALS to load the model. It will throw exceptions.

I saw you use Pipeline to train ALS under spark.ml package. Then you should
use PipelineModel.load to read the model and get corresponding stage in the
pipeline as the ALSModel.

We strongly recommend you to use the spark.ml package which is the primary
API of MLlib. The spark.mllib package is in maintenance mode. So do all
your work under the same APIs.

Thanks
Yanbo

2016-08-17 1:30 GMT-07:00 :

> Hello guys:
>  I have a problem in loading recommend model. I have 2 models, one is
> good(able to get recommend result) and another is not working. I checked
> these 2 models, both are  MatrixFactorizationModel object. But in the
> metadata, one is a PipelineModel and another is a MatrixFactorizationModel.
> Is below exception caused by this?
>
> here are my stack trace:
> Exception in thread "main" org.json4s.package$MappingException: Did not
> find value which can be converted into java.lang.String
> at org.json4s.reflect.package$.fail(package.scala:96)
> at org.json4s.Extraction$.convert(Extraction.scala:554)
> at org.json4s.Extraction$.extract(Extraction.scala:331)
> at org.json4s.Extraction$.extract(Extraction.scala:42)
> at org.json4s.ExtractableJsonAstNode.extract(
> ExtractableJsonAstNode.scala:21)
> at org.apache.spark.mllib.util.Loader$.loadMetadata(
> modelSaveLoad.scala:131)
> at org.apache.spark.mllib.recommendation.
> MatrixFactorizationModel$.load(MatrixFactorizationModel.scala:330)
> at org.brave.spark.ml.RecommandForMultiUsers$.main(
> RecommandForMultiUsers.scala:55)
> at org.brave.spark.ml.RecommandForMultiUsers.main(
> RecommandForMultiUsers.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
> SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(
> SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.
> scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> The attached files are my codes, FYI.
> RecommandForMultiUsers.scala:55 is :
> val model = MatrixFactorizationModel.load(sc, modelpath)
>
>
> 
>
> ThanksBest regards!
> San.Luo
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>


Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread Yanbo Liang
If you want to tie them with other data, I think the best way is to use
DataFrame join operation on condition that they share an identity column.

Thanks
Yanbo

2016-08-16 20:39 GMT-07:00 ayan guha <guha.a...@gmail.com>:

> Hi
>
> Thank you for your reply. Yes, I can get prediction and original features
> together. My question is how to tie them back to other parts of the data,
> which was not in LP.
>
> For example, I have a bunch of other dimensions which are not part of
> features or label.
>
> Sorry if this is a stupid question.
>
> On Wed, Aug 17, 2016 at 12:57 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> MLlib will keep the original dataset during transformation, it just
>> append new columns to existing DataFrame. That is you can get both
>> prediction value and original features from the output DataFrame of
>> model.transform.
>>
>> Thanks
>> Yanbo
>>
>> 2016-08-16 17:48 GMT-07:00 ayan guha <guha.a...@gmail.com>:
>>
>>> Hi
>>>
>>> I have a dataset as follows:
>>>
>>> DF:
>>> amount:float
>>> date_read:date
>>> meter_number:string
>>>
>>> I am trying to predict future amount based on past 3 weeks consumption
>>> (and a heaps of weather data related to date).
>>>
>>> My Labelpoint looks like
>>>
>>> label (populated from DF.amount)
>>> features (populated from a bunch of other stuff)
>>>
>>> Model.predict output:
>>> label
>>> prediction
>>>
>>> Now, I am trying to put together this prediction value back to meter
>>> number and date_read from original DF?
>>>
>>> One way to assume order of records in DF and Model.predict will be
>>> exactly same and zip two RDDs. But any other (possibly better) solution?
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: VectorUDT with spark.ml.linalg.Vector

2016-08-18 Thread Yanbo Liang
@Michal

Yes, we have public VectorUDT in spark.mllib package at 1.6, and this class
is still existing in 2.0.
And from 2.0, we provide a new VectorUDT in spark.ml package and make it
private temporary (will be public in the near future).
Since from 2.0, spark.mllib package will be in maintenance mode, so we
strongly recommend users to use the DataFrame-based spark.ml API.

Thanks
Yanbo

2016-08-17 11:46 GMT-07:00 Michał Zieliński <zielinski.mich...@gmail.com>:

> I'm using Spark 1.6.2 for Vector-based UDAF and this works:
>
> def inputSchema: StructType = new StructType().add("input", new
> VectorUDT())
>
> Maybe it was made private in 2.0
>
> On 17 August 2016 at 05:31, Alexey Svyatkovskiy <alex...@princeton.edu>
> wrote:
>
>> Hi Yanbo,
>>
>> Thanks for your reply. I will keep an eye on that pull request.
>> For now, I decided to just put my code inside org.apache.spark.ml to be
>> able to access private classes.
>>
>> Thanks,
>> Alexey
>>
>> On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> It seams that VectorUDT is private and can not be accessed out of Spark
>>> currently. It should be public but we need to do some refactor before make
>>> it public. You can refer the discussion at https://github.com/apache/s
>>> park/pull/12259 .
>>>
>>> Thanks
>>> Yanbo
>>>
>>> 2016-08-16 9:48 GMT-07:00 alexeys <alex...@princeton.edu>:
>>>
>>>> I am writing an UDAF to be applied to a data frame column of type Vector
>>>> (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not
>>>> have to
>>>> go back and forth between dataframe and RDD.
>>>>
>>>> Inside the UDAF, I have to specify a data type for the input, buffer,
>>>> and
>>>> output (as usual). VectorUDT is what I would use with
>>>> spark.mllib.linalg.Vector:
>>>> https://github.com/apache/spark/blob/master/mllib/src/main/s
>>>> cala/org/apache/spark/mllib/linalg/Vectors.scala
>>>>
>>>> However, when I try to import it from spark.ml instead: import
>>>> org.apache.spark.ml.linalg.VectorUDT
>>>> I get a runtime error (no errors during the build):
>>>>
>>>> class VectorUDT in package linalg cannot be accessed in package
>>>> org.apache.spark.ml.linalg
>>>>
>>>> Is it expected/can you suggest a workaround?
>>>>
>>>> I am using Spark 2.0.0
>>>>
>>>> Thanks,
>>>> Alexey
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://apache-spark-user-list.
>>>> 1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-
>>>> tp27542.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark
currently. It should be public but we need to do some refactor before make
it public. You can refer the discussion at
https://github.com/apache/spark/pull/12259 .

Thanks
Yanbo

2016-08-16 9:48 GMT-07:00 alexeys :

> I am writing an UDAF to be applied to a data frame column of type Vector
> (spark.ml.linalg.Vector). I rely on spark/ml/linalg so that I do not have
> to
> go back and forth between dataframe and RDD.
>
> Inside the UDAF, I have to specify a data type for the input, buffer, and
> output (as usual). VectorUDT is what I would use with
> spark.mllib.linalg.Vector:
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/mllib/linalg/Vectors.scala
>
> However, when I try to import it from spark.ml instead: import
> org.apache.spark.ml.linalg.VectorUDT
> I get a runtime error (no errors during the build):
>
> class VectorUDT in package linalg cannot be accessed in package
> org.apache.spark.ml.linalg
>
> Is it expected/can you suggest a workaround?
>
> I am using Spark 2.0.0
>
> Thanks,
> Alexey
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/VectorUDT-with-spark-ml-linalg-Vector-tp27542.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append
new columns to existing DataFrame. That is you can get both prediction
value and original features from the output DataFrame of model.transform.

Thanks
Yanbo

2016-08-16 17:48 GMT-07:00 ayan guha :

> Hi
>
> I have a dataset as follows:
>
> DF:
> amount:float
> date_read:date
> meter_number:string
>
> I am trying to predict future amount based on past 3 weeks consumption
> (and a heaps of weather data related to date).
>
> My Labelpoint looks like
>
> label (populated from DF.amount)
> features (populated from a bunch of other stuff)
>
> Model.predict output:
> label
> prediction
>
> Now, I am trying to put together this prediction value back to meter
> number and date_read from original DF?
>
> One way to assume order of records in DF and Model.predict will be exactly
> same and zip two RDDs. But any other (possibly better) solution?
>
> --
> Best Regards,
> Ayan Guha
>


Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does
your program output same model between different attempts?

Thanks
Yanbo

2016-08-12 3:08 GMT-07:00 olivierjeunen :

> I'm using pyspark ML's logistic regression implementation to do some
> classification on an AWS EMR Yarn cluster.
>
> The cluster consists of 10 m3.xlarge nodes and is set up as follows:
> spark.driver.memory 10g, spark.driver.cores  3 , spark.executor.memory 10g,
> spark.executor-cores 4.
>
> I enabled yarn's dynamic allocation abilities.
>
> The problem is that my results are way unstable. Sometimes my application
> finishes using 13 executors total, sometimes all of them seem to die and
> the
> application ends up using anywhere between 100 and 200...
>
> Any insight on what could cause this stochastic behaviour would be greatly
> appreciated.
>
> The code used to run the logistic regression:
>
> data = spark.read.parquet(storage_path).repartition(80)
> lr = LogisticRegression()
> lr.setMaxIter(50)
> lr.setRegParam(0.063)
> evaluator = BinaryClassificationEvaluator()
> lrModel = lr.fit(data.filter(data.test == 0))
> predictions = lrModel.transform(data.filter(data.test == 1))
> auROC = evaluator.evaluate(predictions)
> print "auROC on test set: ", auROC
> Data is a dataframe of roughly 2.8GB
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-s-Logistic-Regression-runs-
> unstable-on-Yarn-cluster-tp27520.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients
currently.

Thanks
Yanbo

2016-08-15 3:53 GMT-07:00 letaiv :

> Hi all,
>
> Is there any approach to add constrain for weights in linear regression?
> What I need is least squares regression with non-negative constraints on
> the
> coefficients/weights.
>
> Thanks in advance.
>
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Linear-regression-weights-constraint-tp27535.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
A good way is to implement your own data source to load data of matrix
format. You can refer the LibSVM data format (
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml/source/libsvm)
which contains one column of vector type which is very similar with matrix.

Thanks
Yanbo

2016-08-08 11:06 GMT-07:00 Vadla, Karthik :

> Hello all,
>
>
>
> I'm trying to load set of medical images(dicom) into spark SQL dataframe.
> Here each image is loaded into matrix column of dataframe. I see spark
> recently added MatrixUDT to support this kind of cases, but i don't find a
> sample for using matrix as column in dataframe.
>
> https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/ml/linalg/MatrixUDT.scala
>
> Can anyone help me with this.
>
> Really appreciate your help.
>
> Thanks
>
> Karthik Vadla
>
>
>


Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir,

Did you use VectorAssembler to assemble some columns into the feature
column? If there are NULLs in your dataset, VectorAssembler will throw this
exception. You can use DataFrame.drop() or DataFrame.replace() to
drop/substitute NULL values.

Thanks
Yanbo

2016-08-07 19:51 GMT-07:00 Javier Rey :

> Hi everybody.
>
> I have executed RF on H2O I didn't troubles with nulls values, by in
> contrast in Spark using dataframes and ML library I obtain this error,l I
> know my dataframe contains nulls, but I understand that Random Forest
> supports null values:
>
> "Values to assemble cannot be null"
>
> Any advice, that framework can handle this issue?.
>
> Regards,
>
> Samir
>


Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the
estimator such as LogisticRegression. The output array will be the encoded
feature names corresponding the coefficients of the model.

Thanks
Yanbo

2016-08-08 15:53 GMT-07:00 Cesar :

>
> I have a data frame with four columns, label , feature_1, feature_2,
> feature_3. Is there a simple way in the ML library to give me the weights
> based in feature names? I can only get the weights, which make this simple
> task complicated when one of my features is categorical.
>
> I am looking for something similar to what R output does (where it clearly
> indicates which weight corresponds to each feature name, including
> categorical ones).
>
>
>
> Thanks a lot !
> --
> Cesar Flores
>


Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao,

HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced memory
usage, but it does not remember what the input features looked like and can
not convert the output back to the original features. Actually we misnamed
this transformer, it only does the work of feature hashing rather than
computing hashing term frequency.

CountVectorizer will select the top vocabSize words ordered by term
frequency across the corpus to build the hash table of the features. So it
will consume more memory than HashingTF. However, we can convert the output
back to the original feature.

Both of the transformers do not consider the length of each document. If
you want to compute term frequency divided by the length of the document,
you should write your own function based on transformers provided by MLlib.

Thanks
Yanbo

2016-08-01 15:29 GMT-07:00 Hao Ren :

> When computing term frequency, we can use either HashTF or CountVectorizer
> feature extractors.
> However, both of them just use the number of times that a term appears in
> a document.
> It is not a true frequency. Acutally, it should be divided by the length
> of the document.
>
> Is this a wanted feature ?
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the
sum of squared distances of points to their nearest center as the k-means
cost on the given dataset.

Thanks
Yanbo

2016-07-24 17:30 GMT-07:00 janardhan shetty :

> Hi,
>
> I was trying to evaluate k-means clustering prediction since the exact
> cluster numbers were provided before hand for each data point.
> Just tried the Error = Predicted cluster number - Given number as brute
> force method.
>
> What are the evaluation metrics available in Spark for K-means clustering
> validation to improve?
>


Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501)
for porting spark.mllib.fpm to spark.ml.

Thanks
Yanbo

2016-07-24 11:18 GMT-07:00 janardhan shetty :

> Is there any implementation of FPGrowth and Association rules in Spark
> Dataframes ?
> We have in RDD but any pointers to Dataframes ?
>


Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan,

Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992)
for the discussion about LSH.

Regards
Yanbo

2016-07-24 7:13 GMT-07:00 Karl Higley :

> Hi Janardhan,
>
> I collected some LSH papers while working on an RDD-based implementation.
> Links at the end of the README here:
> https://github.com/karlhigley/spark-neighbors
>
> Keep me posted on what you come up with!
>
> Best,
> Karl
>
> On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty 
> wrote:
>
>> I was looking through to implement locality sensitive hashing in
>> dataframes.
>> Any pointers for reference?
>>
>


Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml (
https://github.com/jpmml/jpmml-sparkml).

Thanks
Yanbo

2016-07-24 4:46 GMT-07:00 Yanbo Liang <yblia...@gmail.com>:

> Spark does not support exporting ML models to PMML currently. You can try
> the third party jpmml-spark (https://github.com/jpmml/jpmml-spark)
> package which supports a part of ML models.
>
> Thanks
> Yanbo
>
> 2016-07-20 11:14 GMT-07:00 Ajinkya Kale <kaleajin...@gmail.com>:
>
>> Just found Google dataproc has a preview of spark 2.0. Tried it and
>> save/load works! Thanks Shuai.
>> Followup question - is there a way to export the pyspark.ml models to
>> PMML ? If not, what is the best way to integrate the model for inference in
>> a production service ?
>>
>> On Tue, Jul 19, 2016 at 8:22 PM Ajinkya Kale <kaleajin...@gmail.com>
>> wrote:
>>
>>> I am using google cloud dataproc which comes with spark 1.6.1. So
>>> upgrade is not really an option.
>>> No way / hack to save the models in spark 1.6.1 ?
>>>
>>> On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin <linshuai2...@gmail.com>
>>> wrote:
>>>
>>>> It's added in not-released-yet 2.0.0 version.
>>>>
>>>> https://issues.apache.org/jira/browse/SPARK-13036
>>>> https://github.com/apache/spark/commit/83302c3b
>>>>
>>>> so i guess you need to wait for 2.0 release (or use the current rc4).
>>>>
>>>> On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale <kaleajin...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib
>>>>> has that but mllib does not have PCA afaik. How do people do model
>>>>> persistence for inference using the pyspark ml models ? Did not find any
>>>>> documentation on model persistency for ml.
>>>>>
>>>>> --ajinkya
>>>>>
>>>>
>>>>
>


Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try
the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package
which supports a part of ML models.

Thanks
Yanbo

2016-07-20 11:14 GMT-07:00 Ajinkya Kale :

> Just found Google dataproc has a preview of spark 2.0. Tried it and
> save/load works! Thanks Shuai.
> Followup question - is there a way to export the pyspark.ml models to
> PMML ? If not, what is the best way to integrate the model for inference in
> a production service ?
>
> On Tue, Jul 19, 2016 at 8:22 PM Ajinkya Kale 
> wrote:
>
>> I am using google cloud dataproc which comes with spark 1.6.1. So upgrade
>> is not really an option.
>> No way / hack to save the models in spark 1.6.1 ?
>>
>> On Tue, Jul 19, 2016 at 8:13 PM Shuai Lin  wrote:
>>
>>> It's added in not-released-yet 2.0.0 version.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-13036
>>> https://github.com/apache/spark/commit/83302c3b
>>>
>>> so i guess you need to wait for 2.0 release (or use the current rc4).
>>>
>>> On Wed, Jul 20, 2016 at 6:54 AM, Ajinkya Kale 
>>> wrote:
>>>
 Is there a way to save a pyspark.ml.feature.PCA model ? I know mllib
 has that but mllib does not have PCA afaik. How do people do model
 persistence for inference using the pyspark ml models ? Did not find any
 documentation on model persistency for ml.

 --ajinkya

>>>
>>>


Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
Hi Gourav,

I can not reproduce your problem. The following code snippets works well on
my local machine, you can try to verify it in your environment. Or could
you provide more information to make others can reproduce your problem?

from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
l = [(1, 1, 10), (2, 2, 20), (3, 3, 30)]
df = sqlContext.createDataFrame(l, ['row', 'column', 'value'])
rdd = df.select('row', 'column', 'value').rdd.map(lambda row:
MatrixEntry(*row))
mat = CoordinateMatrix(rdd)
mat.entries.collect()

Thanks
Yanbo



2016-07-22 13:14 GMT-07:00 Gourav Sengupta :

> Hi,
>
> I had a sparse matrix and I wanted to add the value of a particular row
> which is identified by a particular number.
>
> from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry
> mat =
> CoordinateMatrix(all_scores_df.select('ID_1','ID_2','value').rdd.map(lambda
> row: MatrixEntry(*row)))
>
>
> This gives me the number or rows and columns. But I am not able to extract
> the values and it always reports back the error:
>
> AttributeError: 'NoneType' object has no attribute 'setCallSite'
>
>
> Thanks and Regards,
>
> Gourav Sengupta
>
>


Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-17 Thread Yanbo Liang
Hi Tobi,

Thanks for clarifying the question. It's very straight forward to convert
the filtered RDD to DataFrame, you can refer the following code snippets:

from pyspark.sql import Row

rdd2 = filteredRDD.map(lambda v: Row(features=v))

df = rdd2.toDF()


Thanks
Yanbo

2016-07-16 14:51 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>:

> Hi Yanbo,
>
> Appreciate the response. I might not have phrased this correctly, but I
> really wanted to know how to convert the pipeline rdd into a data frame. I
> have seen the example you posted. However I need to transform all my data,
> just not 1 line. So I did sucessfully use map to use the chisq selector to
> filter the chosen features of my data. I just want to convert it to a df so
> I can apply a logistic regression model from spark.ml.
>
> Trust me I would use the dataframes api if I could, but the chisq
> functionality is not available to me in the python spark 1.4 api.
>
> Regards,
> Tobi
>
> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote:
>
>> Hi Tobi,
>>
>> The MLlib RDD-based API does support to apply transformation on both
>> Vector and RDD, but you did not use the appropriate way to do.
>> Suppose you have a RDD with LabeledPoint in each line, you can refer the
>> following code snippets to train a ChiSqSelectorModel model and do
>> transformation:
>>
>> from pyspark.mllib.regression import LabeledPoint
>>
>> from pyspark.mllib.feature import ChiSqSelector
>>
>> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), 
>> LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, 
>> [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])]
>>
>> rdd = sc.parallelize(data)
>>
>> model = ChiSqSelector(1).fit(rdd)
>>
>> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>>
>> filteredRDD.collect()
>>
>> However, we strongly recommend you to migrate to DataFrame-based API
>> since the RDD-based API is switched to maintain mode.
>>
>> Thanks
>> Yanbo
>>
>> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>:
>>
>>> Hi everyone,
>>>
>>> I am trying to filter my features based on the spark.mllib
>>> ChiSqSelector.
>>>
>>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>>> model.transform(lp.features)))
>>>
>>> However when I do the following I get the error below. Is there any
>>> other way to filter my data to avoid this error?
>>>
>>> filteredDataDF=filteredData.toDF()
>>>
>>> Exception: It appears that you are attempting to reference SparkContext 
>>> from a broadcast variable, action, or transforamtion. SparkContext can only 
>>> be used on the driver, not in code that it run on workers. For more 
>>> information, see SPARK-5063.
>>>
>>>
>>> I would directly use the spark.ml ChiSqSelector and work with dataframes, 
>>> but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not 
>>> available to me. filteredData is of type piplelineRDD, if that helps. It is 
>>> not a regular RDD. I think that may part of why calling toDF() is not 
>>> working.
>>>
>>>
>>> Thanks,
>>>
>>> Tobi
>>>
>>>
>>


Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
Spark 1.5 only support getting feature importance for
RandomForestClassificationModel and RandomForestRegressionModel by Scala.
We support this feature in PySpark until 2.0.0.

It's very straight forward with a few lines of code.

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)

model = rf.fit(td)

model.featureImportances

Then you can get the feature importances which is a Vector.

Thanks
Yanbo

2016-07-12 10:30 GMT-07:00 pseudo oduesp :

> Hi,
>  i use pyspark 1.5.0
> can i  ask you how i can get feature imprtance for a randomforest
> algorithme in pyspark and please give me example
> thanks for advance.
>


Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree
structure, they are private in the ml.clustering package scope.
But I think we should make a plan to expose these APIs like what we did for
Decision Tree.

Thanks
Yanbo

2016-07-12 11:45 GMT-07:00 roni :

> Hi Spark,Mlib experts,
> Anyone who can shine light on this?
> Thanks
> _R
>
> On Thu, Apr 21, 2016 at 12:46 PM, roni  wrote:
>
>> Hi ,
>>  I want to get the bisecting kmeans tree structure to show a dendogram
>>  on the heatmap I am generating based on the hierarchical clustering of
>> data.
>>  How do I get that using mlib .
>> Thanks
>> -Roni
>>
>
>


Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
Since you use two steps (StringIndexer and OneHotEncoder) to encode
categories to Vector, I guess you want to decode the eventual vector into
their original categories.
Suppose you have a DataFrame with only one column named "name", there are
three categories: "b", "a", "c" (ranked by frequency). You can refer the
following code snippets to do encode and decode:

val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a",
"b").map(Tuple1.apply)).toDF("name")

val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName")

val siModel = si.fit(df)

val df2 = siModel.transform(df)

val encoder = new OneHotEncoder()

  .setDropLast(false)

  .setInputCol("indexedName")

  .setOutputCol("encodedName")

val df3 = encoder.transform(df2)

df3.show()

// Decode to get the original categories.

val group = AttributeGroup.fromStructField(df3.schema("encodedName"))

val categories = group.attributes.get.map(_.name.get)

println(categories.mkString(","))

// Output: b,a,c


Thanks
Yanbo

2016-07-14 6:46 GMT-07:00 rachmaninovquartet :

> or would it be common practice to just retain the original categories in
> another df?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
Hi Tobi,

The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do
transformation:

from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.feature import ChiSqSelector

data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
LabeledPoint(1.0, [0.0, 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0,
5.0])]

rdd = sc.parallelize(data)

model = ChiSqSelector(1).fit(rdd)

filteredRDD = model.transform(rdd.map(lambda lp: lp.features))

filteredRDD.collect()

However, we strongly recommend you to migrate to DataFrame-based API since
the RDD-based API is switched to maintain mode.

Thanks
Yanbo

2016-07-14 13:23 GMT-07:00 Tobi Bosede :

> Hi everyone,
>
> I am trying to filter my features based on the spark.mllib ChiSqSelector.
>
> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
> model.transform(lp.features)))
>
> However when I do the following I get the error below. Is there any other
> way to filter my data to avoid this error?
>
> filteredDataDF=filteredData.toDF()
>
> Exception: It appears that you are attempting to reference SparkContext from 
> a broadcast variable, action, or transforamtion. SparkContext can only be 
> used on the driver, not in code that it run on workers. For more information, 
> see SPARK-5063.
>
>
> I would directly use the spark.ml ChiSqSelector and work with dataframes, but 
> I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not 
> available to me. filteredData is of type piplelineRDD, if that helps. It is 
> not a regular RDD. I think that may part of why calling toDF() is not working.
>
>
> Thanks,
>
> Tobi
>
>


Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used?
We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to
these versions and retry.
If this issue still exists, please let us know.

Thanks
Yanbo

2016-07-12 11:03 GMT-07:00 Pasquinell Urbani <
pasquinell.urb...@exalitica.com>:

> In the forum mentioned above the flowing solution is suggested
>
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be
> fixed by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 1)
> after: val requiredSamples = math.max(numBins * numBins, 1.0)
>
> Is there another way?
>
>
> 2016-07-11 18:28 GMT-04:00 Pasquinell Urbani <
> pasquinell.urb...@exalitica.com>:
>
>> Hi all,
>>
>> We have a dataframe with 2.5 millions of records and 13 features. We want
>> to perform a logistic regression with this data but first we neet to divide
>> each columns in discrete values using QuantileDiscretizer. This will
>> improve the performance of the model by avoiding outliers.
>>
>> For small dataframes QuantileDiscretizer works perfect (see the ml
>> example:
>> https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
>> but for large data frames it tends to split the column in only the values 0
>> and 1 (despite the custom number of buckets is settled in to 5). Here is my
>> code:
>>
>> val discretizer = new QuantileDiscretizer()
>>   .setInputCol("C4")
>>   .setOutputCol("C4_Q")
>>   .setNumBuckets(5)
>>
>> val result = discretizer.fit(df3).transform(df3)
>> result.show()
>>
>> I found the same problem presented here:
>> https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
>> solution yet.
>>
>> Do I am configuring the function in a bad way? Should I pre-process the
>> data (like z-scores)? Can somebody help me dealing with this?
>>
>> Regards
>>
>
>


Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
IsotonicRegression can handle feature column of vector type. It will
extract the a certain index (controlled by param "featureIndex") of this
feature vector and feed it into model training. It will perform Pool
adjacent violators algorithms on each partition, so it's distributed and
the data is not necessary to fit into memory of a single machine.
The following code snippets can work well on my machine:

val labels = Seq(1, 2, 3, 1, 6, 17, 16, 17, 18)
val dataset = spark.createDataFrame(
  labels.zipWithIndex.map { case (label, i) =>
(label, Vectors.dense(Array(i.toDouble, i.toDouble + 1.0)), 1.0)
  }
).toDF("label", "features", "weight")

val ir = new IsotonicRegression().setIsotonic(true)

val model = ir.fit(dataset)

val predictions = model
  .transform(dataset)
  .select("prediction").rdd.map { case Row(pred) =>
  pred
}.collect()

assert(predictions === Array(1, 2, 2, 2, 6, 16.5, 16.5, 17, 18))


Thanks
Yanbo

2016-07-11 6:14 GMT-07:00 Fridtjof Sander :

> Hi Swaroop,
>
> from my understanding, Isotonic Regression is currently limited to data
> with 1 feature plus weight and label. Also the entire data is required to
> fit into memory of a single machine.
> I did some work on the latter issue but discontinued the project, because
> I felt no one really needed it. I'd be happy to resume my work on Spark's
> IR implementation, but I fear there won't be a quick for your issue.
>
> Fridtjof
>
>
> Am 08.07.2016 um 22:38 schrieb dsp:
>
>> Hi I am trying to perform Isotonic Regression on a data set with 9
>> features
>> and a label.
>> When I run the algorithm similar to the way mentioned on MLlib page, I get
>> the error saying
>>
>> /*error:* overloaded method value run with alternatives:
>> (input: org.apache.spark.api.java.JavaRDD[(java.lang.Double,
>> java.lang.Double,
>>
>> java.lang.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
>> 
>>(input: org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
>> scala.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
>>   cannot be applied to (org.apache.spark.rdd.RDD[(scala.Double,
>> scala.Double,
>> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
>> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
>> scala.Double)])
>>   val model = new
>> IsotonicRegression().setIsotonic(true).run(training)/
>>
>> For the may given in the sample code, it looks like it can be done only
>> for
>> dataset with a single feature because run() method can accept only three
>> parameters leaving which already has a label and a default value leaving
>> place for only one variable.
>> So, How can this be done for multiple variables ?
>>
>> Regards,
>> Swaroop
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Isotonic-Regression-run-method-overloaded-Error-tp27313.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop,

Would you mind to share your code that others can help you to figure out
what caused this error?
I can run the isotonic regression examples well.

Thanks
Yanbo

2016-07-08 13:38 GMT-07:00 dsp :

> Hi I am trying to perform Isotonic Regression on a data set with 9 features
> and a label.
> When I run the algorithm similar to the way mentioned on MLlib page, I get
> the error saying
>
> /*error:* overloaded method value run with alternatives:
> (input: org.apache.spark.api.java.JavaRDD[(java.lang.Double,
> java.lang.Double,
>
> java.lang.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
> 
>   (input: org.apache.spark.rdd.RDD[(scala.Double, scala.Double,
> scala.Double)])org.apache.spark.mllib.regression.IsotonicRegressionModel
>  cannot be applied to (org.apache.spark.rdd.RDD[(scala.Double,
> scala.Double,
> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
> scala.Double, scala.Double, scala.Double, scala.Double, scala.Double,
> scala.Double)])
>  val model = new
> IsotonicRegression().setIsotonic(true).run(training)/
>
> For the may given in the sample code, it looks like it can be done only for
> dataset with a single feature because run() method can accept only three
> parameters leaving which already has a label and a default value leaving
> place for only one variable.
> So, How can this be done for multiple variables ?
>
> Regards,
> Swaroop
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Isotonic-Regression-run-method-overloaded-Error-tp27313.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing.
Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in
Spark 2.0.
We can say that MLlib will focus on the Dataset-based API for futher
development more accurately.

Thanks
Yanbo

2016-07-10 20:35 GMT-07:00 jinhong lu :

> Hi,
> Since the DataSet will be the major API in spark2.0,  why mllib will
> DataFrame-based, and 'future development will focus on the DataFrame-based
> API.’
>
>Any plan will change mllib form DataFrame-based to DataSet-based?
>
>
> =
> Thanks,
> lujinhong
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when
I have time.

2016-07-04 14:09 GMT-07:00 mshiryae :

> Hi,
>
> I am trying to train model by MultilayerPerceptronClassifier.
>
> It works on sample data from
> data/mllib/sample_multiclass_classification_data.txt with 4 features, 3
> classes and layers [4, 4, 3].
> But when I try to use other input files with other features and classes
> (from here for example:
> https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html)
> then I get errors.
>
> Example:
> Input file aloi (128 features, 1000 classes, layers [128, 128, 1000]):
>
>
> with block size = 1:
> ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation.
> Decreasing step size to Infinity
> ERROR LBFGS: Failure! Resetting history:
> breeze.optimize.FirstOrderException: Line search failed
> ERROR LBFGS: Failure again! Giving up and returning. Maybe the objective is
> just poorly behaved?
>
>
> with default block size = 128:
>  java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:629)
>   at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3$$anonfun$apply$4.apply(Layer.scala:628)
>at scala.collection.immutable.List.foreach(List.scala:381)
>at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:628)
>at
>
> org.apache.spark.ml.ann.DataStacker$$anonfun$3$$anonfun$apply$3.apply(Layer.scala:624)
>
>
>
> Even if I modify sample_multiclass_classification_data.txt file (rename all
> 4-th features to 5-th) and run with layers [5, 5, 3] then I also get the
> same errors as for file above.
>
>
> So to resume:
> I can't run training with default block size and with more than 4 features.
> If I set  block size to 1 then some actions are happened but I get errors
> from LBFGS.
> It is reproducible with Spark 1.5.2 and from master branch on github (from
> 4-th July).
>
> Did somebody already met with such behavior?
> Is there bug in MultilayerPerceptronClassifier or I use it incorrectly?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-MultilayerPerceptronClassifier-error-tp27279.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
Hi Arun,

The command

bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6

will automatically load the required graphframes jar file from maven
repository, it was not affected by the location where the jar file was
placed. Your examples works well in my laptop.

Or you can use try with

bin/pyspark --py-files ***/graphframes.jar --jars ***/graphframes.jar

to launch PySpark with graphframes enabled. You should set "--py-files" and
"--jars" options with the directory where you saved graphframes.jar.

Thanks
Yanbo


2016-07-03 15:48 GMT-07:00 Arun Patel :

> I started my pyspark shell with command  (I am using spark 1.6).
>
> bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
>
> I have copied
> http://dl.bintray.com/spark-packages/maven/graphframes/graphframes/0.1.0-spark1.6/graphframes-0.1.0-spark1.6.jar
> to the lib directory of Spark as well.
>
> I was getting below error
>
> >>> from graphframes import *
> Traceback (most recent call last):
>   File "", line 1, in 
> zipimport.ZipImportError: can't find module 'graphframes'
> >>>
>
> So, as per suggestions from similar questions, I have extracted the
> graphframes python directory and copied to the local directory where I am
> running pyspark.
>
> >>> from graphframes import *
>
> But, not able to create the GraphFrame
>
> >>> g = GraphFrame(v, e)
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'GraphFrame' is not defined
>
> Also, I am getting below error.
> >>> from graphframes.examples import Graphs
> Traceback (most recent call last):
>   File "", line 1, in 
> ImportError: Bad magic number in graphframes/examples.pyc
>
> Any help will be highly appreciated.
>
> - Arun
>


Re: Several questions about how pyspark.ml works

2016-07-02 Thread Yanbo Liang
Hi Nick,

Please see my inline reply.

Thanks
Yanbo

2016-06-12 3:08 GMT-07:00 XapaJIaMnu :

> Hey,
>
> I have some additional Spark ML algorithms implemented in scala that I
> would
> like to make available in pyspark. For a reference I am looking at the
> available logistic regression implementation here:
>
>
> https://spark.apache.org/docs/1.6.0/api/python/_modules/pyspark/ml/classification.html
>
> I have couple of questions:
> 1) The constructor for the *class LogisticRegression* as far as I
> understand
> just accepts the arguments and then just constructs the underlying Scala
> object via /py4j/ and parses its arguments. This is done via the line
> *self._java_obj = self._new_java_obj(
> "org.apache.spark.ml.classification.LogisticRegression", self.uid)*
> Is this correct?
> What does the line *super(LogisticRegression, self).__init__()* do?
>

*super(LogisticRegression, self).__init__()* is used to initialize the
*Params* object at Python side, since we store all params at Python side
and transfer them to Scala side when calling *fit*.


>
> Does this mean that any python datastructures used with it will be
> converted
> to java structures once the object is instantiated?
>
> 2) The corresponding model *class LogisticRegressionModel(JavaModel):*
> again
> just instantiates the Java object and nothing else? Is just enough for me
> to
> forward the arguments and instantiate the scala objects?
> Does this mean that when the pipeline is created, even if the pipeline is
> python it expects objects which are underlying scala code instantiated by
> /py4j/. Can one use pure python elements inside the pipeline (dealing with
> RDDs)? What would be the performance implication?
>

*class LogisticRegressionModel(JavaModel)* is only a wrapper of the peer
Scala model object.


>
> Cheers,
>
> Nick
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Several-questions-about-how-pyspark-ml-works-tp27141.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-07-02 Thread Yanbo Liang
Yes, WeightedLeastSquares can not solve some ill-conditioned problem
currently, the community members have paid some efforts to resolve it
(SPARK-13777). For the work around, you can set the solver to "l-bfgs"
which will train the LogisticRegressionModel by L-BFGS optimization method.

2016-06-09 7:37 GMT-07:00 chaz2505 :

> I ran into this problem too - it's because WeightedLeastSquares (added in
> 1.6.0 SPARK-10668) is being used on an ill-conditioned problem
> (SPARK-11918). I guess because of the one hot encoding. To get around it
> you
> need to ensure WeightedLeastSquares isn't used. Set parameters to make the
> following false:
>
> $(solver) == "auto" && $(elasticNetParam) == 0.0 &&
>   numFeatures <= WeightedLeastSquares.MAX_NUM_FEATURES) || $(solver) ==
> "normal"
>
> Hope this helps
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Trainning-a-spark-ml-linear-regresion-model-fail-after-migrating-from-1-5-2-to-1-6-1-tp27111p27128.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-02 Thread Yanbo Liang
Hi Mathieu,

Using the new ml package to train a RandomForestClassificationModel, you
can get feature importance. Then you can convert the prediction result to
RDD and feed it into BinaryClassificationEvaluator for ROC curve. You can
refer the following code snippet:

val rf = new RandomForestClassifier()
val model = rf.fit(trainingData)

val predictions = model.transform(testData)

val scoreAndLabels =
  predictions.select(model.getRawPredictionCol, model.getLabelCol).rdd.map {
case Row(rawPrediction: Vector, label: Double) => (rawPrediction(1),
label)
case Row(rawPrediction: Double, label: Double) => (rawPrediction, label)
  }
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
metrics.roc()


Thanks
Yanbo

2016-06-15 7:13 GMT-07:00 matd :

> Hi ml folks !
>
> I'm using a Random Forest for a binary classification.
> I'm interested in getting both the ROC *curve* and the feature importance
> from the trained model.
>
> If I'm not missing something obvious, the ROC curve is only available in
> the
> old mllib world, via BinaryClassificationMetrics. In the new ml package,
> only the areaUnderROC and areaUnderPR are available through
> BinaryClassificationEvaluator.
>
> The feature importance is only available in ml package, through
> RandomForestClassificationModel.
>
> Any idea to get both ?
>
> Mathieu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Get-both-feature-importance-and-ROC-curve-from-a-random-forest-classifier-tp27175.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Ideas to put a Spark ML model in production

2016-07-02 Thread Yanbo Liang
Let's suppose you have trained a LogisticRegressionModel and saved it at
"/tmp/lr-model". You can copy the directory to production environment and
use it to make prediction on users new data. You can refer the following
code snippets:

val model = LogisiticRegressionModel.load("/tmp/lr-model")
val data = newDataset
val prediction = model.transform(data)

However, usually we save/load PipelineModel which include necessary feature
transformers and model training process rather than the single model, but
they are similar operations.

Thanks
Yanbo

2016-06-23 10:54 GMT-07:00 Saurabh Sardeshpande :

> Hi all,
>
> How do you reliably deploy a spark model in production? Let's say I've
> done a lot of analysis and come up with a model that performs great. I have
> this "model file" and I'm not sure what to do with it. I want to build some
> kind of service around it that takes some inputs, converts them into a
> feature, runs the equivalent of 'transform', i.e. predict the output and
> return the output.
>
> At the Spark Summit I heard a lot of talk about how this will be easy to
> do in Spark 2.0, but I'm looking for an solution sooner. PMML support is
> limited and the model I have can't be exported in that format.
>
> I would appreciate any ideas around this, especially from personal
> experiences.
>
> Regards,
> Saurabh
>


Re: Custom Optimizer

2016-07-02 Thread Yanbo Liang
Spark MLlib does not support optimizer as a plugin, since the optimizer
interface is private.

Thanks
Yanbo

2016-06-23 16:56 GMT-07:00 Stephen Boesch :

> My team has a custom optimization routine that we would have wanted to
> plug in as a replacement for the default  LBFGS /  OWLQN for use by some of
> the ml/mllib algorithms.
>
> However it seems the choice of optimizer is hard-coded in every algorithm
> except LDA: and even in that one it is only a choice between the internally
> defined Online or batch version.
>
> Any suggestions on how we might be able to incorporate our own optimizer?
> Or do we need to roll all of our algorithms from top to bottom - basically
> side stepping ml/mllib?
>
> thanks
> stephen
>


Re: Spark ML - Java implementation of custom Transformer

2016-07-02 Thread Yanbo Liang
Hi Mehdi,

Could you share your code and then we can help you to figure out the
problem?
Actually JavaTestParams can work well but there is some compatibility issue
for JavaDeveloperApiExample.
We have removed JavaDeveloperApiExample temporary at Spark 2.0 in order to
not confuse users. Since the solution for the compatibility issue has been
figured out, we will add it back at 2.1.

Thanks
Yanbo

2016-06-27 11:57 GMT-07:00 Mehdi Meziane :

> Hi all,
>
> We have some problems while implementing custom Transformers in JAVA
> (SPARK 1.6.1).
> We do override the method copy, but it crashes with an AbstractMethodError.
>
> If we extends the UnaryTransformer, and do not override the copy method,
> it works without any error.
>
> We tried to write the copy like in these examples :
>
> https://github.com/apache/spark/blob/branch-2.0/mllib/src/test/java/org/apache/spark/ml/param/JavaTestParams.java
>
> https://github.com/eBay/Spark/blob/branch-1.6/examples/src/main/java/org/apache/spark/examples/ml/JavaDeveloperApiExample.java
>
> None of it worked.
>
> The copy is defined in the Params class as :
>
>   /**
>* Creates a copy of this instance with the same UID and some extra
> params.
>* Subclasses should implement this method and set the return type
> properly.
>*
>* @see [[defaultCopy()]]
>*/
>   def copy(extra: ParamMap): Params
>
> Any idea?
> Thanks,
>
> Mehdi
>


Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and
the detail exception information? Or it's better to paste your code and
exception here if it's applicable, then other members can help you to
diagnose the problem.

Thanks
Yanbo

2016-05-12 2:03 GMT-07:00 AlexModestov :

> Hello,
> I have the same problem... Sometimes I get the error: "Py4JError: Answer
> from Java side is empty"
> Sometimes my code works fine but sometimes not...
> Did you find why it might come? What was the reason?
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/ML-regression-spark-context-dies-without-error-tp22633p26938.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Yes, you are right.

2016-05-30 2:34 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>:

>
> Thanks Yanbo.
>
> So, you mean that if I have a variable which is of type double but I want
> to treat it like String in my model I just have to cast those columns into
> string and simply run the glm model. String columns will be directly
> one-hot encoded by the glm provided by sparkR ?
>
> Just wanted to clarify as in R we need to apply as.factor for categorical
> variables.
>
> val dfNew = df.withColumn("C0",df.col("C0").cast("String"))
>
>
> Abhi !!
>
> On Mon, May 30, 2016 at 2:58 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> Hi Abhi,
>>
>> In SparkR glm, category features (columns of type string) will be one-hot
>> encoded automatically.
>> So pre-processing like `as.factor` is not necessary, you can directly
>> feed your data to the model training.
>>
>> Thanks
>> Yanbo
>>
>> 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>:
>>
>>> Hi ,
>>>
>>> I want to run glm variant of sparkR for my data that is there in a csv
>>> file.
>>>
>>> I see that the glm function in sparkR takes a spark dataframe as input.
>>>
>>> Now, when I read a file from csv and create a spark dataframe, how could
>>> I take care of the factor variables/columns in my data ?
>>>
>>> Do I need to convert it to a R dataframe, convert to factor using
>>> as.factor and create spark dataframe and run glm over it ?
>>>
>>> But, running as.factor over big dataset is not possible.
>>>
>>> Please suggest what is the best way to acheive this ?
>>>
>>> What pre-processing should be done, and what is the best way to achieve
>>> it  ?
>>>
>>>
>>> Thanks,
>>> Abhi
>>>
>>
>>
>


Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi,

In SparkR glm, category features (columns of type string) will be one-hot
encoded automatically.
So pre-processing like `as.factor` is not necessary, you can directly feed
your data to the model training.

Thanks
Yanbo

2016-05-30 2:06 GMT-07:00 Abhishek Anand :

> Hi ,
>
> I want to run glm variant of sparkR for my data that is there in a csv
> file.
>
> I see that the glm function in sparkR takes a spark dataframe as input.
>
> Now, when I read a file from csv and create a spark dataframe, how could I
> take care of the factor variables/columns in my data ?
>
> Do I need to convert it to a R dataframe, convert to factor using
> as.factor and create spark dataframe and run glm over it ?
>
> But, running as.factor over big dataset is not possible.
>
> Please suggest what is the best way to acheive this ?
>
> What pre-processing should be done, and what is the best way to achieve it
>  ?
>
>
> Thanks,
> Abhi
>


Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to
throw exception when you creating a Vector with element of unicode type.

2016-05-24 7:27 GMT-07:00 flyinggip :

> Hi there,
>
> I notice that there might be a bug in pyspark.mllib.linalg.Vectors when
> dealing with a vector with a single element.
>
> Firstly, the 'dense' method says it can also take numpy.array. However the
> code uses 'if len(elements) == 1' and when a numpy.array has only one
> element its length is undefined and currently if calling dense() on a numpy
> array with one element the program crashes. Probably instead of using len()
> in the above if, size should be used.
>
> Secondly, after I managed to create a dense-Vectors object with only one
> element from unicode, it seems that its behaviour is unpredictable. For
> example,
>
> Vectors.dense(unicode("0.1"))
>
> will report an error.
>
> dense_vec = Vectors.dense(unicode("0.1"))
>
> will NOT report any error until you run
>
> dense_vec
>
> to check its value. And the following will be able to create a successful
> DataFrame:
>
> mylist = [(0, Vectors.dense(unicode("0.1")))]
> myrdd = sc.parallelize(mylist)
> mydf = sqlContext.createDataFrame(myrdd, ["X", "Y"])
>
> However if the above unicode value is read from a text file (e.g., a csv
> file with 2 columns) then the DataFrame column corresponding to "Y" will be
> EMPTY:
>
> raw_data = sc.textFile(filename)
> split_data = raw_data.map(lambda line: line.split(','))
> parsed_data = split_data.map(lambda line: (int(line[0]),
> Vectors.dense(line[1])))
> mydf = sqlContext.createDataFrame(parsed_data, ["X", "Y"])
>
> It would be great if someone could share some ideas. Thanks a lot.
>
> f.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-involving-Vectors-with-a-single-element-tp27013.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
Actually it's unnecessary to convert csv row to LabeledPoint, because we
use DataFrame as the standard data format when training a model by Spark ML.
What you should do is converting double attributes to Vector named
"feature". Then you can train the ML model by specifying the featureCol and
labelCol.

Thanks
Yanbo

2016-03-16 13:41 GMT+08:00 Dharmin Siddesh J :

> Hi
>
> I am trying to read a csv with few double attributes and String Label .
> How can i convert it to labelpoint RDD so that i can run it with spark
> mllib classification algorithms.
>
> I have tried
> The LabelPoint Constructor (is available only for Regression ) but it
> accepts only double format label. Is there any other way to point out the
> string label and convert it into RDD
>
> Regards
> Siddesh
>
>
>


Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
Hi Jean,

DataFrame is connected with SQLContext which is connected with
SparkContext, so I think it's impossible to run `model.transform` without
touching Spark.
I think what you need is model should support prediction on single
instance, then you can make prediction without Spark. You can track the
progress of https://issues.apache.org/jira/browse/SPARK-10413.

Thanks
Yanbo

2016-02-27 8:52 GMT+08:00 Eugene Morozov :

> Hi everyone.
>
> I have a requirement to run prediction for random forest model locally on
> a web-service without touching spark at all in some specific cases. I've
> achieved that with previous mllib API (java 8 syntax):
>
> public List> predictLocally(RandomForestModel
> model, List data) {
> return data.stream()
> .map(point -> new
> Tuple2<>(model.predict(point.features()), point.label()))
> .collect(Collectors.toList());
> }
>
> So I have instance of trained model and can use it any way I want.
> The question is whether it's possible to run this on the driver itself
> with the following:
> DataFrame predictions = model.transform(test);
> because AFAIU test has to be a DataFrame, which means it's going to be run
> on the cluster.
>
> The use case to run it on driver is very small amount of data for
> prediction - much faster to handle it this way, than using spark cluster.
> Thank you.
> --
> Be well!
> Jean Morozov
>


Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
Hi Raj,

If you choose JSON as the storage format, Spark SQL will store VectorUDT as
Array of Double.
So when you load back to memory, it can not be recognized as Vector.
One workaround is storing the DataFrame as parquet format, it will be
loaded and recognized as expected.

df.write.format("parquet").mode("overwrite").save(output)
> val data = sqlContext.read.format("parquet").load(output)


Thanks
Yanbo

2016-02-27 2:01 GMT+08:00 Raj Kumar <raj.ku...@hooklogic.com>:

> Thanks for the response Yanbo. Here is the source (it uses the
> sample_libsvm_data.txt file used in the
> mlliv examples).
>
> -Raj
> — IOTest.scala -
>
> import org.apache.spark.{SparkConf,SparkContext}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.DataFrame
>
> object IOTest {
>   val InputFile = "/tmp/sample_libsvm_data.txt"
>   val OutputDir ="/tmp/out"
>
>   val sconf = new SparkConf().setAppName("test").setMaster("local[*]")
>   val sqlc  = new SQLContext( new SparkContext( sconf ))
>   val df = sqlc.read.format("libsvm").load( InputFile  )
>   df.show; df.printSchema
>
>   df.write.format("json").mode("overwrite").save( OutputDir )
>   val data = sqlc.read.format("json").load( OutputDir )
>   data.show; data.printSchema
>
>   def main( args: Array[String]):Unit = {}
> }
>
>
> ---
>
> On Feb 26, 2016, at 12:47 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
> Hi Raj,
>
> Could you share your code which can help others to diagnose this issue?
> Which version did you use?
> I can not reproduce this problem in my environment.
>
> Thanks
> Yanbo
>
> 2016-02-26 10:49 GMT+08:00 raj.kumar <raj.ku...@hooklogic.com>:
>
>> Hi,
>>
>> I am using mllib. I use the ml vectorization tools to create the
>> vectorized
>> input dataframe for
>> the ml/mllib machine-learning models with schema:
>>
>> root
>>  |-- label: double (nullable = true)
>>  |-- features: vector (nullable = true)
>>
>> To avoid repeated vectorization, I am trying to save and load this
>> dataframe
>> using
>>df.write.format("json").mode("overwrite").save( url )
>> val data = Spark.sqlc.read.format("json").load( url )
>>
>> However when I load the dataframe, the newly loaded dataframe has the
>> following schema:
>> root
>>  |-- features: struct (nullable = true)
>>  ||-- indices: array (nullable = true)
>>  |||-- element: long (containsNull = true)
>>  ||-- size: long (nullable = true)
>>  ||-- type: long (nullable = true)
>>  ||-- values: array (nullable = true)
>>  |||-- element: double (containsNull = true)
>>  |-- label: double (nullable = true)
>>
>> which the machine-learning models do not recognize.
>>
>> Is there a way I can save and load this dataframe without the schema
>> changing.
>> I assume it has to do with the fact that Vector is not a basic type.
>>
>> thanks
>> -Raj
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> <http://nabble.com>.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Survival Curves using AFT implementation in Spark

2016-02-26 Thread Yanbo Liang
Hi Stuti,

AFTSurvivalRegression does not support computing the predicted survival
functions/curves currently.
I don't know whether the quantile predictions can help you, you can refer
the example

.
Maybe we can add this feature later.

Thanks
Yanbo

2016-02-26 14:35 GMT+08:00 Stuti Awasthi :

> Hi All,
>
> I wanted to apply Survival Analysis using Spark AFT algorithm
> implementation. Now I perform the same in R using coxph model and passing
> the model in Survfit() function to generate survival curves
>
> Then I can visualize the survival curve on validation data to understand
> how good my model fits.
>
>
>
> R: Code
>
> fit <- coxph(Surv(futime, fustat) ~ age, data = ovarian)
>
> plot(survfit(fit,newdata=data.frame(age=60)))
>
>
>
> I wanted to achieve something similar with Spark. Hence I created the AFT
> model using Spark and passed my Test dataframe for prediction. The result
> of prediction is single prediction value for single input data which is as
> expected. But now how can I use this model to generate the Survival curves
> for visualization.
>
>
>
> Eg: Spark Code model.transform(test_final).show()
>
>
>
> standardized_features|   prediction|
>
> +-+-+
>
> | [0.0,0.0,0.743853...|48.33071792204102|
>
> +-+-+
>
>
>
> Can any suggest how to use the developed model for plotting Survival
> Curves for “test_final” data which is a dataframe feature[vector].
>
>
>
> Thanks
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> 
>


Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin.
You can also try with DataFrameStatFunctions.freqItems() to get the
frequent items for columns.

Thanks
Yanbo

2016-02-24 1:21 GMT+08:00 Burak Yavuz :

> You could use the Bucketizer transformer in Spark ML.
>
> Best,
> Burak
>
> On Tue, Feb 23, 2016 at 9:13 AM, Arunkumar Pillai  > wrote:
>
>> Hi
>> Is there any predefined method to calculate histogram bins and frequency
>> in spark. Currently I take range and find bins then count frequency using
>> SQL query.
>>
>> Is there any better way
>>
>
>


Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj,

Could you share your code which can help others to diagnose this issue?
Which version did you use?
I can not reproduce this problem in my environment.

Thanks
Yanbo

2016-02-26 10:49 GMT+08:00 raj.kumar :

> Hi,
>
> I am using mllib. I use the ml vectorization tools to create the vectorized
> input dataframe for
> the ml/mllib machine-learning models with schema:
>
> root
>  |-- label: double (nullable = true)
>  |-- features: vector (nullable = true)
>
> To avoid repeated vectorization, I am trying to save and load this
> dataframe
> using
>df.write.format("json").mode("overwrite").save( url )
> val data = Spark.sqlc.read.format("json").load( url )
>
> However when I load the dataframe, the newly loaded dataframe has the
> following schema:
> root
>  |-- features: struct (nullable = true)
>  ||-- indices: array (nullable = true)
>  |||-- element: long (containsNull = true)
>  ||-- size: long (nullable = true)
>  ||-- type: long (nullable = true)
>  ||-- values: array (nullable = true)
>  |||-- element: double (containsNull = true)
>  |-- label: double (nullable = true)
>
> which the machine-learning models do not recognize.
>
> Is there a way I can save and load this dataframe without the schema
> changing.
> I assume it has to do with the fact that Vector is not a basic type.
>
> thanks
> -Raj
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Saving-and-Loading-Dataframes-tp26339.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
Hi Stuti,

The features should be standardized before training the model. Currently
AFTSurvivalRegression does not support standardization. Here is the work
around for this issue, and I will send a PR to fix this issue soon.

val ovarian = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true") // Use first line of all files as header
  .option("inferSchema", "true") // Automatically infer data types
  .load("..")
  .toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")

val assembler = new VectorAssembler()
  .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
  .setOutputCol("features")

val ovarian2 = assembler.transform(ovarian)
  .select(col("censor").cast(DoubleType),
col("label").cast(DoubleType), col("features"))

val standardScaler = new StandardScaler()
  .setInputCol("features")
  .setOutputCol("standardized_features")
val ssModel = standardScaler.fit(ovarian2)
val ovarian3 = ssModel.transform(ovarian2)

val aft = new
AFTSurvivalRegression().setFeaturesCol("standardized_features")

val model = aft.fit(ovarian3)

val newCoefficients =
model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
  x._1 / x._2
}
println(newCoefficients.toSeq.mkString(","))
println(model.intercept)
println(model.scale)

Yanbo

2016-02-15 16:07 GMT+08:00 Yanbo Liang <yblia...@gmail.com>:

> Hi Stuti,
>
> This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
> infinity" properly.
> I have open https://issues.apache.org/jira/browse/SPARK-13322 to track
> this issue and will send a PR.
> Thanks for reporting this issue.
>
> Yanbo
>
> 2016-02-12 15:03 GMT+08:00 Stuti Awasthi <stutiawas...@hcl.com>:
>
>> Hi All,
>>
>> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able
>> to run the AFT example provided. Now I tried to train the model with
>> Ovarian data which is standard data comes with Survival library in R.
>>
>> Default Column Name :  *Futime,fustat,age,resid_ds,rx,ecog_ps*
>>
>>
>>
>> Here are the steps I have done :
>>
>> · Loaded the data from csv to dataframe labeled as
>>
>> *val* ovarian_data = sqlContext.read
>>
>>   .format("com.databricks.spark.csv")
>>
>>   .option("header", "true") // Use first line of all files as header
>>
>>   .option("inferSchema", "true") // Automatically infer data types
>>
>>   .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds",
>> "rx", "ecog_ps")
>>
>> · Utilize the VectorAssembler() to create features from "age",
>> "resid_ds", "rx", "ecog_ps" like
>>
>> *val* assembler = *new* VectorAssembler()
>>
>> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>>
>> .setOutputCol("features")
>>
>>
>>
>> · Then I create a new dataframe with only 3 colums as :
>>
>> *val* training = finalDf.select("label", "censor", "features")
>>
>>
>>
>> · Finally Im passing it to AFT
>>
>> *val* model = aft.fit(training)
>>
>>
>>
>> Im getting the error as :
>>
>> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
>> infinity. Error for unknown reason.*
>>
>>at scala.Predef$.assert(*Predef.scala:179*)
>>
>>at org.apache.spark.ml.regression.AFTAggregator.add(
>> *AFTSurvivalRegression.scala:480*)
>>
>>at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:522*)
>>
>>at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:521*)
>>
>>at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>>at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>>at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>>
>>
>>
>> I have tried to print the schema :
>>
>> ()root
>>
>> |-- label: double (nullable = true)
>>
>> |-- censor: d

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti,

This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
infinity" properly.
I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this
issue and will send a PR.
Thanks for reporting this issue.

Yanbo

2016-02-12 15:03 GMT+08:00 Stuti Awasthi :

> Hi All,
>
> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to
> run the AFT example provided. Now I tried to train the model with Ovarian
> data which is standard data comes with Survival library in R.
>
> Default Column Name :  *Futime,fustat,age,resid_ds,rx,ecog_ps*
>
>
>
> Here are the steps I have done :
>
> · Loaded the data from csv to dataframe labeled as
>
> *val* ovarian_data = sqlContext.read
>
>   .format("com.databricks.spark.csv")
>
>   .option("header", "true") // Use first line of all files as header
>
>   .option("inferSchema", "true") // Automatically infer data types
>
>   .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx",
> "ecog_ps")
>
> · Utilize the VectorAssembler() to create features from "age",
> "resid_ds", "rx", "ecog_ps" like
>
> *val* assembler = *new* VectorAssembler()
>
> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>
> .setOutputCol("features")
>
>
>
> · Then I create a new dataframe with only 3 colums as :
>
> *val* training = finalDf.select("label", "censor", "features")
>
>
>
> · Finally Im passing it to AFT
>
> *val* model = aft.fit(training)
>
>
>
> Im getting the error as :
>
> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
> infinity. Error for unknown reason.*
>
>at scala.Predef$.assert(*Predef.scala:179*)
>
>at org.apache.spark.ml.regression.AFTAggregator.add(
> *AFTSurvivalRegression.scala:480*)
>
>at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
> *AFTSurvivalRegression.scala:522*)
>
>at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
> *AFTSurvivalRegression.scala:521*)
>
>at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
> *TraversableOnce.scala:144*)
>
>at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
> *TraversableOnce.scala:144*)
>
>at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>
>
>
> I have tried to print the schema :
>
> ()root
>
> |-- label: double (nullable = true)
>
> |-- censor: double (nullable = true)
>
> |-- features: vector (nullable = true)
>
>
>
> Sample data training looks like
>
> [59.0,1.0,[72.3315,2.0,1.0,1.0]]
>
> [115.0,1.0,[74.4932,2.0,1.0,1.0]]
>
> [156.0,1.0,[66.4658,2.0,1.0,2.0]]
>
> [421.0,0.0,[53.3644,2.0,2.0,1.0]]
>
> [431.0,1.0,[50.3397,2.0,1.0,1.0]]
>
>
>
> Im not able to understand about the error, as if I use same data and
> create the denseVector as given in Sample example of AFT, then code works
> completely fine. But I would like to read the data from CSV file and then
> proceed.
>
>
>
> Please suggest
>
>
>
> Thanks 
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> 
>


Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true.
But not always correct for a pipeline model, some transformers in pipeline
will change the features such as OneHotEncoder.

2016-02-03 1:21 GMT+08:00 jmvllt :

> Hi everyone,
>
> This may sound like a stupid question but I need to be sure of this :
>
> Given a dataframe composed by « n » features  : f1, f2, …, fn
>
> For each row of my dataframe, I create a labeled point :
> val row_i = LabeledPoint(label, Vectors.dense(v1_i,v2_i,…, vn_i) )
> where v1_i,v2_i,…, vn_i are respectively the values of the features f1, f2,
> …, fn of the i th row.
>
> Then, I fit a pipeline composed by a standardScaler and a
> logisticRegression
> model.
> When I get back my LogisticRegressionModel and StandardScalerModel from the
> pipeline, I’m calling the getters :
> LogisticRegressionModel.coefficients, StandardScalerModel.mean and
> StandardScalerModel.std
>
> This gives me 3 vectors of length « n »
>
> My question is the following :
> Am I assured that the element of index « j » of each vectors correspond to
> the feature « j »  ? Is the "*order*" of the feature kept ?
> e.g : Is StandardScalerModel.mean(j) the mean of the feature « j » of my
> data frame ?
>
> Thanks for your time.
> Regards,
> J.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Is-the-order-of-the-coefficients-in-a-LogisticRegresionModel-kept-tp26137.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan,

MLlib only support getting p-value, t-value from Linear Regression model,
other models such as Logistic Model are not supported currently. This
feature is under development and will be released at the next version(Spark
2.0).

Thanks
Yanbo

2016-01-18 16:45 GMT+08:00 Chandan Verma :

> Hi,
>
>
>
> Can anyone help me to extract p-values from a logistic regression model
> using mllib and scala.
>
>
>
> Thanks
>
> Chandan Verma
> ===
> DISCLAIMER: The information contained in this message (including any
> attachments) is confidential and may be privileged. If you have received it
> by mistake please notify the sender by return e-mail and permanently delete
> this message and any attachments from your system. Any dissemination, use,
> review, distribution, printing or copying of this message in whole or in
> part is strictly prohibited. Please note that e-mails are susceptible to
> change. CitiusTech shall not be liable for the improper or incomplete
> transmission of the information contained in this communication nor for any
> delay in its receipt or damage to your system. CitiusTech does not
> guarantee that the integrity of this communication has been maintained or
> that this communication is free of viruses, interceptions or interferences.
> 
>
>


Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
Hi Andy,
I will take a look at your code after your share it.
Thanks!
Yanbo

2016-01-23 0:18 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:

> Hi Yanbo
>
> I recently code up the trivial example from
> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
>  I
> do not get the same results. I’ll put my code up on github over the weekend
> if anyone is interested
>
> Andy
>
> From: Yanbo Liang <yblia...@gmail.com>
> Date: Tuesday, January 19, 2016 at 1:11 AM
>
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: has any one implemented TF_IDF using ML transformers?
>
> Hi Andy,
>
> The equation to calculate IDF is:
> idf = log((m + 1) / (d(t) + 1))
> you can refer here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L150
>
> The equation to calculate TFIDF is:
> TFIDF=TF * IDF
> you can refer:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L226
>
>
> Thanks
> Yanbo
>
> 2016-01-19 7:05 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:
>
>> Hi Yanbo
>>
>> I am using 1.6.0. I am having a hard of time trying to figure out what
>> the exact equation is. I do not know Scala.
>>
>> I took a look a the source code URL  you provide. I do not know Scala
>>
>> override def transform(dataset: DataFrame): DataFrame = {
>> transformSchema(dataset.schema, logging = true)
>> val idf = udf { vec: Vector => idfModel.transform(vec) }
>> dataset.withColumn($(outputCol), idf(col($(inputCol
>> }
>>
>>
>> You mentioned the doc is out of date.
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>>
>> Based on my understanding of the subject matter the equations in the java
>> doc are correct. I could not find anything like the equations in the source
>> code?
>>
>> IDF(t,D)=log|D|+1DF(t,D)+1,
>>
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>>
>>
>> I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite
>> the results do not match equation. (In general the unit test asserts seem
>> incomplete).
>>
>>
>>  I have created several small test example to try and figure out how to
>> use NaiveBase, HashingTF, and IDF. The values of TFIDF,  theta,
>> probabilities , … The result produced by spark not match the published
>> results at
>> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
>>
>>
>> Kind regards
>>
>> Andy
>>
>> private DataFrame createTrainingData() {
>>
>> // make sure we only use dictionarySize words
>>
>> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(
>>
>> // 0 is Chinese
>>
>> // 1 in notChinese
>>
>> RowFactory.create(0, 0.0, Arrays.asList("Chinese",
>> "Beijing", "Chinese")),
>>
>> RowFactory.create(1, 0.0, Arrays.asList("Chinese",
>> "Chinese", "Shanghai")),
>>
>> RowFactory.create(2, 0.0, Arrays.asList("Chinese",
>> "Macao")),
>>
>> RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan",
>> "Chinese";
>>
>>
>>
>> return createData(rdd);
>>
>> }
>>
>>
>> private DataFrame createData(JavaRDD rdd) {
>>
>> StructField id = null;
>>
>> id = new StructField("id", DataTypes.IntegerType, false,
>> Metadata.empty());
>>
>>
>> StructField label = null;
>>
>> label = new StructField("label", DataTypes.DoubleType, false,
>> Metadata.empty());
>>
>>
>>
>> StructField words = null;
>>
>> words = new StructField("words",
>> DataTypes.createArrayType(DataTypes.StringType), false,
>> Metadata.empty());
>>
>>
>> StructType schema = new StructType(new StructField[] { id, label,
>> words });
>>
>> DataFrame ret = sqlContext.createDataFrame(rdd, schema);
>>
>>
>>
>> return ret;
>>
>> }
>>
>>
>>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>>
>> HashingTF hashingTF = new Has

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Hi Devesh,

RFormula will encode category variables(column of string type) as dummy
variables automatically. You do not need to do dummy transform explicitly
if you want to train machine learning model using SparkR. Although SparkR
only supports a limited ML algorithms(GLM) currently.

Thanks
Yanbo

2016-01-20 1:15 GMT+08:00 Vinayak Agrawal :

> Yes, you can use Rformula library. Please see
>
> https://databricks.com/blog/2015/10/05/generalized-linear-models-in-sparkr-and-r-formula-support-in-mllib.html
>
> On Tue, Jan 19, 2016 at 10:34 AM, Devesh Raj Singh  > wrote:
>
>> Hi,
>>
>> Can we create dummy variables for categorical variables in sparkR like we
>> do using "dummies" package in R
>>
>> --
>> Warm regards,
>> Devesh.
>>
>
>
>
> --
> Vinayak Agrawal
> Big Data Analytics
> IBM
>
> "To Strive, To Seek, To Find and Not to Yield!"
> ~Lord Alfred Tennyson
>


Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.


Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
-++-+---+
>
> |id |label|words   |tf   |idf
>   |
>
>
> +---+-++-+---+
>
> |0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
> |(7,[1,2],[0.0,0.9162907318741551]) |
>
> |1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
> |(7,[1,4],[0.0,0.9162907318741551]) |
>
> |2  |0.0  |[Chinese, Macao]|(7,[1,6],[1.0,1.0])
> |(7,[1,6],[0.0,0.9162907318741551]) |
>
> |3  |1.0  |[Tokyo, Japan, Chinese]
> |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])|
>
>
> +---+-++-+---+
>
>
> Here is the spark test case
>
>
>  @Test
>
>   public void tfIdf() {
>
> // The tests are to check Java compatibility.
>
> HashingTF tf = new HashingTF();
>
> @SuppressWarnings("unchecked")
>
> JavaRDD<List> documents = sc.parallelize(Arrays.asList(
>
>   Arrays.asList("this is a sentence".split(" ")),
>
>   Arrays.asList("this is another sentence".split(" ")),
>
>   Arrays.asList("this is still a sentence".split(" "))), 2);
>
> JavaRDD termFreqs = tf.transform(documents);
>
> termFreqs.collect();
>
> IDF idf = new IDF();
>
> JavaRDD tfIdfs = idf.fit(termFreqs).transform(termFreqs);
>
> List localTfIdfs = tfIdfs.collect();
>
> int indexOfThis = tf.indexOf("this");
>
> System.err.println("AEDWIP: indexOfThis: " + indexOfThis);
>
>
>
> int indexOfSentence = tf.indexOf("sentence");
>
> System.err.println("AEDWIP: indexOfSentence: " + indexOfSentence);
>
>
> int indexOfAnother = tf.indexOf("another");
>
> System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother);
>
>
> for (Vector v: localTfIdfs) {
>
> System.err.println("AEDWIP: V.toString() " + v.toString());
>
>   Assert.assertEquals(0.0, v.apply(indexOfThis), 1e-15);
>
> }
>
>   }
>
>
> $ mvn test -DwildcardSuites=none
> -Dtest=org.apache.spark.mllib.feature.JavaTfIdfSuite
>
> AEDWIP: indexOfThis: 413342
>
> AEDWIP: indexOfSentence: 251491
>
> AEDWIP: indexOfAnother: 263939
>
> AEDWIP: V.toString()
> (1048576,[97,3370,251491,413342],[0.28768207245178085,0.0,0.0,0.0])
>
> AEDWIP: V.toString()
> (1048576,[3370,251491,263939,413342],[0.0,0.0,0.6931471805599453,0.0])
>
> AEDWIP: V.toString()
> (1048576,[97,3370,251491,413342,713128],[0.28768207245178085,0.0,0.0,0.0,0.6931471805599453])
>
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.908 sec
> - in org.apache.spark.mllib.feature.JavaTfIdfSuite
>
> From: Yanbo Liang <yblia...@gmail.com>
> Date: Sunday, January 17, 2016 at 12:34 AM
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: has any one implemented TF_IDF using ML transformers?
>
> Hi Andy,
>
> Actually, the output of ML IDF model is the TF-IDF vector of each instance
> rather than IDF vector.
> So it's unnecessary to do member wise multiplication to calculate TF-IDF
> value. You can refer the code at here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121
> I found the document of IDF is not very clear, we need to update it.
>
> Thanks
> Yanbo
>
> 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:
>
>> I wonder if I am missing something? TF-IDF is very popular. Spark ML has
>> a lot of transformers how ever it TF_IDF is not supported directly.
>>
>> Spark provide a HashingTF and IDF transformer. The java doc
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>>
>> Mentions you can implement TFIDF as follows
>>
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>>
>> The problem I am running into is both HashingTF and IDF return a sparse
>> vector.
>>
>> *Ideally the spark code  to implement TFIDF would be one line.*
>>
>>
>> * DataFrame ret = tmp.withColumn("features", 
>> tmp.col("tf").multiply(tmp.col("idf")));*
>>
>&g

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
Hi Andy,

Actually, the output of ML IDF model is the TF-IDF vector of each instance
rather than IDF vector.
So it's unnecessary to do member wise multiplication to calculate TF-IDF
value. You can refer the code at here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121
I found the document of IDF is not very clear, we need to update it.

Thanks
Yanbo

2016-01-16 6:10 GMT+08:00 Andy Davidson :

> I wonder if I am missing something? TF-IDF is very popular. Spark ML has a
> lot of transformers how ever it TF_IDF is not supported directly.
>
> Spark provide a HashingTF and IDF transformer. The java doc
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>
> Mentions you can implement TFIDF as follows
>
> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>
> The problem I am running into is both HashingTF and IDF return a sparse
> vector.
>
> *Ideally the spark code  to implement TFIDF would be one line.*
>
>
> * DataFrame ret = tmp.withColumn("features", 
> tmp.col("tf").multiply(tmp.col("idf")));*
>
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' requires numeric type, not vector;
>
> I could implement my own UDF to do member wise multiplication how ever
> given how common TF-IDF is I wonder if this code already exists some where
>
> I found  org.apache.spark.util.Vector.Multiplier. There is no
> documentation how ever give the argument is double, my guess is it just
> does scalar multiplication.
>
> I guess I could do something like
>
> Double[] v = mySparkVector.toArray();
>  Then use JBlas to do member wise multiplication
>
> I assume sparceVectors are not distributed so there  would not be any
> additional communication cost
>
>
> If this code is truly missing. I would be happy to write it and donate it
>
> Andy
>
>
> From: Andrew Davidson 
> Date: Wednesday, January 13, 2016 at 2:52 PM
> To: "user @spark" 
> Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)'
> requires numeric type, not vector;
>
> Bellow is a little snippet of my Java Test Code. Any idea how I implement
> member wise vector multiplication?
>
> Kind regards
>
> Andy
>
> transformed df printSchema()
>
> root
>
>  |-- id: integer (nullable = false)
>
>  |-- label: double (nullable = false)
>
>  |-- words: array (nullable = false)
>
>  ||-- element: string (containsNull = true)
>
>  |-- tf: vector (nullable = true)
>
>  |-- idf: vector (nullable = true)
>
>
>
> +---+-++-+---+
>
> |id |label|words   |tf   |idf
>   |
>
>
> +---+-++-+---+
>
> |0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
> |(7,[1,2],[0.0,0.9162907318741551]) |
>
> |1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
> |(7,[1,4],[0.0,0.9162907318741551]) |
>
> |2  |0.0  |[Chinese, Macao]|(7,[1,6],[1.0,1.0])
> |(7,[1,6],[0.0,0.9162907318741551]) |
>
> |3  |1.0  |[Tokyo, Japan, Chinese]
> |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])|
>
>
> +---+-++-+---+
>
> @Test
>
> public void test() {
>
> DataFrame rawTrainingDF = createTrainingData();
>
> DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF);
>
> . . .
>
> }
>
>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>
> HashingTF hashingTF = new HashingTF()
>
> .setInputCol("words")
>
> .setOutputCol("tf")
>
> .setNumFeatures(dictionarySize);
>
>
>
> DataFrame termFrequenceDF = hashingTF.transform(rawDF);
>
>
>
> termFrequenceDF.cache(); // idf needs to make 2 passes over data
> set
>
> IDFModel idf = new IDF()
>
> //.setMinDocFreq(1) // our vocabulary has 6 words
> we hash into 7
>
> .setInputCol(hashingTF.getOutputCol())
>
> .setOutputCol("idf")
>
> .fit(termFrequenceDF);
>
>
> DataFrame tmp = idf.transform(termFrequenceDF);
>
>
>
> DataFrame ret = tmp.withColumn("features", tmp.col("tf").multiply(
> tmp.col("idf")));
>
> logger.warn("\ntransformed df printSchema()");
>
> ret.printSchema();
>
> ret.show(false);
>
>
>
> return ret;
>
> }
>
>
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' 

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
Hi Robin,

#1 This feature is available from Spark 1.5.0.
#2 You should use the new ML rather than the old MLlib package to train the
Random Forest model and get featureImportances, because it was only exposed
at ML package. You can refer the documents:
https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
.

Thanks
Yanbo

2016-01-16 0:16 GMT+08:00 Robin East :

> re 1.
> The pull requests reference the JIRA ticket in this case
> https://issues.apache.org/jira/browse/SPARK-5133. The JIRA says it was
> released in 1.5.
>
>
>
> ---
> Robin East
> *Spark GraphX in Action* Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
>
>
>
>
>
> On 15 Jan 2016, at 16:06, Scott Imig  wrote:
>
> Hello,
>
> I have a couple of quick questions about this pull request, which adds
> feature importance calculations to the random forests in MLLib.
>
> https://github.com/apache/spark/pull/7838
>
> 1. Can someone help me determine the Spark version where this is first
> available?  (1.5.0?  1.5.1?)
>
> 2. Following the templates in this  documentation to construct a
> RandomForestModel, should I be able to retrieve model.featureImportances?
> Or is there a different pattern for random forests in more recent spark
> versions?
>
> https://spark.apache.org/docs/1.2.0/mllib-ensembles.html
>
> Thanks for the help!
> Imig
> --
> S. Imig | Senior Data Scientist Engineer | *rich**relevance *|m:
> 425.999.5725
>
> I support Bip 101 and BitcoinXT .
>
>
>


Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar,

It does not support output AIC value for Linear Regression currently. This
feature is under development and will be released at Spark 2.0.

Thanks
Yanbo

2016-01-15 17:20 GMT+08:00 Arunkumar Pillai :

> Hi
>
> Is it possible to get AIC value in Linear Regression using ml pipeline ?
> Is so please help me
>
> --
> Thanks and Regards
> Arun
>


Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is
the number of features.

2016-01-13 10:47 GMT+08:00 Andy Davidson :

> I am trying to debug my trained model by exploring theta
> Theta is a Matrix. The java Doc for Matrix says that it is column major
> formate
>
> I have trained a NaiveBayesModel. Is the number of classes == to the
> number of rows?
>
> int numRows = nbModel.numClasses();
>
> int numColumns = nbModel.numFeatures();
>
>
> Kind regards
>
>
> Andy
>


Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan,

Could you tell us the meaning of deploying model? Using the model to make
prediction by R?

Thanks
Yanbo

2016-01-11 20:40 GMT+08:00 Chandan Verma :

> Hi All,
>
> Does any one over here has deployed a model produced in SparkR or atleast
> help me with the steps for deployment.
>
> Regards,
> Chandan Verma
>
> Sent from my Sony Xperia™ smartphone
> ===
> DISCLAIMER: The information contained in this message (including any
> attachments) is confidential and may be privileged. If you have received it
> by mistake please notify the sender by return e-mail and permanently delete
> this message and any attachments from your system. Any dissemination, use,
> review, distribution, printing or copying of this message in whole or in
> part is strictly prohibited. Please note that e-mails are susceptible to
> change. CitiusTech shall not be liable for the improper or incomplete
> transmission of the information contained in this communication nor for any
> delay in its receipt or damage to your system. CitiusTech does not
> guarantee that the integrity of this communication has been maintained or
> that this communication is free of viruses, interceptions or interferences.
> 
>


Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi,

The parameters should be broadcasted again after you update it at driver
side, then you can get updated version at worker side.

Thanks
Yanbo

2016-01-09 23:12 GMT+08:00 octavian.ganea :

> Hi,
>
> In my app, I have a Params scala object that keeps all the specific
> (hyper)parameters of my program. This object is read in each worker. I
> would
> like to be able to pass specific values of the Params' fields in the
> command
> line. One way would be to simply update all the fields of the Params object
> using the values in the command line arguments. However, this will only
> update the Params local copy at the master node, while the worker nodes
> will
> still use the default Params version that is broadcasted by default at the
> very beginning of the program.
>
> Does anyone have an idea of how can I change at runtime the parameters of a
> specific object for each of its copies located at each worker ?
>
> Thanks,
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-params-to-workers-at-the-very-beginning-tp25927.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
Hi Kristina,

The input column of StandardScaler must be vector type, because it's
usually used as feature scaling before model training and the type of
feature column should be vector in most cases.
If you only want to standardize a numeric column, you can wrap it as a
vector and feed into StandardScaler.

Thanks
Yanbo

2016-01-10 8:10 GMT+08:00 Kristina Rogale Plazonic :

> Hi,
>
> The code below gives me an unexpected result. I expected that
> StandardScaler (in ml, not mllib) will take a specified column of an input
> dataframe and subtract the mean of the column and divide the difference by
> the standard deviation of the dataframe column.
>
> However, Spark gives me the error that the input column must be of type
> vector. This shouldn't be the case, as the StandardScaler should transform
> a numeric column (not vector column) to numeric column, right?  (The
> offending line in Spark source code
> ).
> Am I missing something?
>
> Reproducing the error (python's sklearn example
> ):
>
> val ccdf = sqlContext.createDataFrame( Seq(
>   ( 1.0, -1.0,  2.0),
>   ( 2.0,  0.0,  0.0),
>   ( 0.0,  1.0, -1.0)
>   )).toDF("c1", "c2", "c3")
>
> val newccdf = new StandardScaler()
>   .setInputCol("c1")
>   .setOutputCol("c1_norm")
>   .setWithMean(true)
>   .setWithStd(true)
>   .fit(ccdf)
>   .transform(ccdf)
>
> The error output: (spark-shell, Spark 1.5.2)
>
> java.lang.IllegalArgumentException: requirement failed: Input column c1
> must be a vector column
> (.)
>
> Thanks!
> Kristina
>


Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
First extracting year, month, day, time from the datetime.
Then you should decide which variables can be treated as category features
such as year/month/day and encode them to boolean form using OneHotEncoder.
At last using VectorAssembler to assemble the encoded output vector and the
other raw input into the features which can be feed into model trainer.

OneHotEncoder and VectorAssembler are feature transformers provided by
Spark ML, you can refer
https://spark.apache.org/docs/latest/ml-features.html

Thanks
Yanbo

2016-01-08 7:52 GMT+08:00 Annabel Melongo :

> Or he can also transform the whole date into a string
>
>
> On Thursday, January 7, 2016 2:25 PM, Sujit Pal 
> wrote:
>
>
> Hi Jorge,
>
> Maybe extract things like dd, mm, day of week, time of day from the
> datetime string and use them as features?
>
> -sujit
>
>
> On Thu, Jan 7, 2016 at 11:09 AM, Jorge Machado <
> jorge.w.mach...@hotmail.com> wrote:
>
> Hello all,
>
> I'm new to machine learning. I'm trying to predict some electric usage
> with a decision  Free
> The data is :
> 2015-12-10-10:00, 1200
> 2015-12-11-10:00, 1150
>
> My question is : What is the best way to turn date and time into feature
> on my Vector ?
>
> Something like this :  Vector (1200, [2015,12,10,10,10] )?
> I could not fine any example with value prediction where features had
> dates in it.
>
> Thanks
>
> Jorge Machado
>
> Jorge Machado
> jo...@jmachado.me
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext.

sc <- sparkR.init()

sqlContext <- sparkRHive.init(sc)


2016-01-06 20:35 GMT+08:00 Sandeep Khurana :

> Felix
>
> I tried the option suggested by you.  It gave below error.  I am going to
> try the option suggested by Prem .
>
> Error in writeJobj(con, object) : invalid jobj 1
> 8
> stop("invalid jobj ", value$id)
> 7
> writeJobj(con, object)
> 6
> writeObject(con, a)
> 5
> writeArgs(rc, args)
> 4
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 3
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 2
> read.df(sqlContext, filepath, "orc") at
> spark_api.R#108
>
> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
> wrote:
>
>> Firstly I don't have ORC data to verify but this should work:
>>
>> df <- loadDF(sqlContext, "data/path", "orc")
>>
>> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
>> should be called after sparkR.init() - please check if there is any error
>> message there.
>>
>> _
>> From: Prem Sure 
>> Sent: Tuesday, January 5, 2016 8:12 AM
>> Subject: Re: sparkR ORC support.
>> To: Sandeep Khurana 
>> Cc: spark users , Deepak Sharma <
>> deepakmc...@gmail.com>
>>
>>
>>
>> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>>
>>
>> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
>> wrote:
>>
>>> Also, do I need to setup hive in spark as per the link
>>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>>> ?
>>>
>>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>>
>>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
>>> wrote:
>>>
 Deepak

 Tried this. Getting this error now

 rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
 unused argument ("")


 On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
 wrote:

> Hi Sandeep
> can you try this ?
>
> results <- sql(hivecontext, "FROM test SELECT id","")
>
> Thanks
> Deepak
>
>
> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
> wrote:
>
>> Thanks Deepak.
>>
>> I tried this as well. I created a hivecontext   with  "hivecontext
>> <<- sparkRHive.init(sc) "  .
>>
>> When I tried to read hive table from this ,
>>
>> results <- sql(hivecontext, "FROM test SELECT id")
>>
>> I get below error,
>>
>> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
>> SparkR was restarted, Spark operations need to be re-executed.
>>
>>
>> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>>
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
>> wrote:
>>
>>> Hi Sandeep
>>> I am not sure if ORC can be read directly in R.
>>> But there can be a workaround .First create hive table on top of ORC
>>> files and then access hive table in R.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana <
>>> sand...@infoworks.io> wrote:
>>>
 Hello

 I need to read an ORC files in hdfs in R using spark. I am not able
 to find a package to do that.

 Can anyone help with documentation or example for this purpose?

 --
 Architect
 Infoworks.io 
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io 
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



 --
 Architect
 Infoworks.io 
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Architect
>>> Infoworks.io 
>>> http://Infoworks.io
>>>
>>
>>
>>
>>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>


Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar,

You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
approxCountDistinct for a approximate result.

2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :

> Hi
>
> Is there any   functions to find distinct count of all the variables in
> dataframe.
>
> val sc = new SparkContext(conf) // spark context
> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" 
> -> "true")
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
> val datasetDF = 
> sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>
>
> we are able to get the schema, variable data type. is there any method to get 
> the distinct count ?
>
>
>
> --
> Thanks and Regards
> Arun
>


Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
Hi Alexander,

That's cool! Thanks for the clarification.

Yanbo

2016-01-05 5:06 GMT+08:00 Ulanov, Alexander <alexander.ula...@hpe.com>:

> Hi Yanbo,
>
>
>
> As long as two models fit into memory of a single machine, there should be
> no problems, so even 16GB machines can handle large models. (master should
> have more memory because it runs LBFGS) In my experiments, I’ve trained the
> models 12M and 32M parameters without issues.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Yanbo Liang [mailto:yblia...@gmail.com]
> *Sent:* Sunday, December 27, 2015 2:23 AM
> *To:* Joseph Bradley
> *Cc:* Eugene Morozov; user; d...@spark.apache.org
> *Subject:* Re: SparkML algos limitations question.
>
>
>
> Hi Eugene,
>
>
>
> AFAIK, the current implementation of MultilayerPerceptronClassifier have
> some scalability problems if the model is very huge (such as >10M),
> although I think the limitation can cover many use cases already.
>
>
>
> Yanbo
>
>
>
> 2015-12-16 6:00 GMT+08:00 Joseph Bradley <jos...@databricks.com>:
>
> Hi Eugene,
>
>
>
> The maxDepth parameter exists because the implementation uses Integer node
> IDs which correspond to positions in the binary tree.  This simplified the
> implementation.  I'd like to eventually modify it to avoid depending on
> tree node IDs, but that is not yet on the roadmap.
>
>
>
> There is not an analogous limit for the GLMs you listed, but I'm not very
> familiar with the perceptron implementation.
>
>
>
> Joseph
>
>
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
> Hello!
>
>
>
> I'm currently working on POC and try to use Random Forest (classification
> and regression). I also have to check SVM and Multiclass perceptron (other
> algos are less important at the moment). So far I've discovered that Random
> Forest has a limitation of maxDepth for trees and just out of curiosity I
> wonder why such a limitation has been introduced?
>
>
>
> An actual question is that I'm going to use Spark ML in production next
> year and would like to know if there are other limitations like maxDepth in
> RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.
>
>
>
> Thanks in advance for your time.
>
> --
> Be well!
> Jean Morozov
>
>
>
>
>


Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
Hi Tomasz,

The limitation will not be changed and you will found all the models
reference to SparkContext in the new Spark ML package. It make the Python
API simple for implementation.

But it does not means you can only call this function on local data, you
can operate this function on an RDD like the following code snippet:

gmmModel.predictSoft(rdd)

then you will get a new RDD which is the soft prediction result. And all
the models in ML package follow this rule.

Yanbo

2016-01-04 22:16 GMT+08:00 Tomasz Fruboes <tomasz.frub...@ncbj.gov.pl>:

> Hi Yanbo,
>
>  thanks for info. Is it likely to change in (near :) ) future? Ability to
> call this function only on local data (ie not in rdd) seems to be rather
> serious limitation.
>
>  cheers,
>   Tomasz
>
> On 02.01.2016 09:45, Yanbo Liang wrote:
>
>> Hi Tomasz,
>>
>> The GMM is bind with the peer Java GMM object, so it need reference to
>> SparkContext.
>> Some of MLlib(not ML) models are simple object such as KMeansModel,
>> LinearRegressionModel etc., but others will refer SparkContext. The
>> later ones and corresponding member functions should not called in map().
>>
>> Cheers
>> Yanbo
>>
>>
>>
>> 2016-01-01 4:12 GMT+08:00 Tomasz Fruboes <tomasz.frub...@ncbj.gov.pl
>> <mailto:tomasz.frub...@ncbj.gov.pl>>:
>>
>> Dear All,
>>
>>   I'm trying to implement a procedure that iteratively updates a rdd
>> using results from GaussianMixtureModel.predictSoft. In order to
>> avoid problems with local variable (the obtained GMM) beeing
>> overwritten in each pass of the loop I'm doing the following:
>>
>> ###
>> for i in xrange(10):
>>  gmm = GaussianMixture.train(rdd, 2)
>>
>>  def getSafePredictor(unsafeGMM):
>>  return lambda x: \
>>  (unsafeGMM.predictSoft(x.features),
>> unsafeGMM.gaussians.mu <http://unsafeGMM.gaussians.mu>)
>>
>>
>>  safePredictor = getSafePredictor(gmm)
>>  predictionsRDD = (labelledpointrddselectedfeatsNansPatched
>>.map(safePredictor)
>>  )
>>  print predictionsRDD.take(1)
>>  (... - rest of code - update rdd with results from
>> predictionsRdd)
>> ###
>>
>> Unfortunately this ends with:
>>
>> ###
>> Exception: It appears that you are attempting to reference
>> SparkContext from a broadcast variable, action, or transformation.
>> SparkContext can only be used on the driver, not in code that it run
>> on workers. For more information, see SPARK-5063.
>> ###
>>
>> Any idea why I'm getting this behaviour? My expectation would be,
>> that GMM should be a "simple" object without SparkContext in it.
>> I'm using spark 1.5.2
>>
>>   Thanks,
>> Tomasz
>>
>>
>> ps As a workaround I'm doing currently
>>
>> 
>>  def getSafeGMM(unsafeGMM):
>>  return lambda x: unsafeGMM.predictSoft(x)
>>
>>  safeGMM = getSafeGMM(gmm)
>>  predictionsRDD = \
>>  safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
>> 
>>   which works fine. If it's possible I would like to avoid this
>> approach, since it would require to perform another closure on
>> gmm.gaussians later in my code
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org
>> <mailto:user-h...@spark.apache.org>
>>
>>
>>
>


Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next
release(Spark 2.0).

2016-01-03 23:02 GMT+08:00 :

> keyStoneML could be an alternative.
>
> Ardo.
>
> On 03 Jan 2016, at 15:50, Arunkumar Pillai 
> wrote:
>
> Is there any road map for glm in pipeline?
>
>


Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
Hi Tomasz,

The GMM is bind with the peer Java GMM object, so it need reference to
SparkContext.
Some of MLlib(not ML) models are simple object such as KMeansModel,
LinearRegressionModel etc., but others will refer SparkContext. The later
ones and corresponding member functions should not called in map().

Cheers
Yanbo



2016-01-01 4:12 GMT+08:00 Tomasz Fruboes :

> Dear All,
>
>  I'm trying to implement a procedure that iteratively updates a rdd using
> results from GaussianMixtureModel.predictSoft. In order to avoid problems
> with local variable (the obtained GMM) beeing overwritten in each pass of
> the loop I'm doing the following:
>
> ###
> for i in xrange(10):
> gmm = GaussianMixture.train(rdd, 2)
>
> def getSafePredictor(unsafeGMM):
> return lambda x: \
> (unsafeGMM.predictSoft(x.features), unsafeGMM.gaussians.mu)
>
> safePredictor = getSafePredictor(gmm)
> predictionsRDD = (labelledpointrddselectedfeatsNansPatched
>   .map(safePredictor)
> )
> print predictionsRDD.take(1)
> (... - rest of code - update rdd with results from predictionsRdd)
> ###
>
> Unfortunately this ends with:
>
> ###
> Exception: It appears that you are attempting to reference SparkContext
> from a broadcast variable, action, or transformation. SparkContext can only
> be used on the driver, not in code that it run on workers. For more
> information, see SPARK-5063.
> ###
>
> Any idea why I'm getting this behaviour? My expectation would be, that GMM
> should be a "simple" object without SparkContext in it.  I'm using spark
> 1.5.2
>
>  Thanks,
>Tomasz
>
>
> ps As a workaround I'm doing currently
>
> 
> def getSafeGMM(unsafeGMM):
> return lambda x: unsafeGMM.predictSoft(x)
>
> safeGMM = getSafeGMM(gmm)
> predictionsRDD = \
> safeGMM(labelledpointrddselectedfeatsNansPatched.map(rdd))
> 
>  which works fine. If it's possible I would like to avoid this approach,
> since it would require to perform another closure on gmm.gaussians later in
> my code
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto,

Could you share your code snippet that others can help to diagnose your
problems?



2016-01-02 7:51 GMT+08:00 Roberto Pagliari :

> When using the frequent itemsets APIs, I’m running into stackOverflow
> exception whenever there are too many combinations to deal with and/or too
> many transactions and/or too many items.
>
>
> Does anyone know how many transactions/items these APIs can deal with?
>
>
> Thank you ,
>
>


  1   2   >