[jira] [Created] (SPARK-17785) Find a more robust way to detect the existing of the initialModel

2016-10-05 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-17785:
-

 Summary: Find a more robust way to detect the existing of the 
initialModel
 Key: SPARK-17785
 URL: https://issues.apache.org/jira/browse/SPARK-17785
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Xusen Yin
Priority: Minor


Currently, we use initialModelFlag to check whether an estimator has an initial 
model. Figure out a more robust way to detect the existing of the initialModel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17784) Add fromCenters method for KMeans

2016-10-05 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-17784:
-

 Summary: Add fromCenters method for KMeans
 Key: SPARK-17784
 URL: https://issues.apache.org/jira/browse/SPARK-17784
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Xusen Yin
Priority: Minor


Add a new factory method fromCenters(centers: Array[Vector]) for KMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-23 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434030#comment-15434030
 ] 

Xusen Yin commented on SPARK-16581:
---

Sure, no problem.

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-19 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428557#comment-15428557
 ] 

Xusen Yin commented on SPARK-14381:
---

I believe we can resolve this.

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-17 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425792#comment-15425792
 ] 

Xusen Yin commented on SPARK-16581:
---

I'll find related JIRAs and link them if possible.

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public

2016-08-17 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425775#comment-15425775
 ] 

Xusen Yin commented on SPARK-16581:
---

[~shivaram] [~sunrui] Still work on it? I can help work on this if it's 
available.

> Making JVM backend calling functions public
> ---
>
> Key: SPARK-16581
> URL: https://issues.apache.org/jira/browse/SPARK-16581
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> As described in the design doc in SPARK-15799, to help packages that need to 
> call into the JVM, it will be good to expose some of the R -> JVM functions 
> we have. 
> As a part of this we could also rename, reformat the functions to make them 
> more user friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405069#comment-15405069
 ] 

Xusen Yin commented on SPARK-16857:
---

I agree the cluster assignments could be arbitrary. Yes under this condition we 
shouldn't use MulticlassClassificationEvaluator to evaluate the result.

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException

2016-08-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405050#comment-15405050
 ] 

Xusen Yin commented on SPARK-16857:
---

Using CrossValidator with KMeans should be supported. As a kind of external 
evaluation for KMeans, I think using MulticlassClassificationEvaluator with 
KMeans should also be supported. Why not send a PR since it would be a quick 
fix.

CC [~yanboliang]

> CrossValidator and KMeans throws IllegalArgumentException
> -
>
> Key: SPARK-16857
> URL: https://issues.apache.org/jira/browse/SPARK-16857
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
> Environment: spark-jobserver docker image.  Spark 1.6.1 on ubuntu, 
> Hadoop 2.4
>Reporter: Ryan Claussen
>
> I am attempting to use CrossValidation to train KMeans model. When I attempt 
> to fit the data spark throws an IllegalArgumentException as below since the 
> KMeans algorithm outputs an Integer into the prediction column instead of a 
> Double.   Before I go too far:  is using CrossValidation with Kmeans 
> supported?
> Here's the exception:
> {quote}
> java.lang.IllegalArgumentException: requirement failed: Column prediction 
> must be of type DoubleType but was actually IntegerType.
>  at scala.Predef$.require(Predef.scala:233)
>  at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>  at 
> org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109)
>  at 
> org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>  at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62)
>  at 
> com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39)
>  at 
> spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>  at 
> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}
> Here is the code I'm using to set up my cross validator.  As the stack trace 
> above indicates it is failing at the fit step when 
> {quote}
> ...
> val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures")
> val labelConverter = new 
> IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels)
> val pipeline = new Pipeline().setStages(Array(labelIndexer, 
> featureIndexer, mpc, labelConverter))
> val evaluator = new 
> MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction")
> val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, 
> 200, 500)).build()
> val cv = new 
> CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)
> val cvModel = cv.fit(trainingData)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-17 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381583#comment-15381583
 ] 

Xusen Yin edited comment on SPARK-3728 at 7/17/16 11:46 PM:


Not now. Because I thought the BFS style could reach the best parallelism, 
while the DFS may hurt the parallel ability. And IMHO the BFS style training is 
not the root cause of out-of-memory during the training phase of RandomForest. 
Do you have any suggestions on this?


was (Author: yinxusen):
Not now. Because I thought the BFS style could reach the best parallelism, 
while the DFS may harm the parallel ability. And IMHO the BFS style training is 
not the root cause of out-of-memory during the training phase of RandomForest. 
Do you have any suggestions on this?

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-17 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381583#comment-15381583
 ] 

Xusen Yin commented on SPARK-3728:
--

Not now. Because I thought the BFS style could reach the best parallelism, 
while the DFS may harm the parallel ability. And IMHO the BFS style training is 
not the root cause of out-of-memory during the training phase of RandomForest. 
Do you have any suggestions on this?

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16558) examples/mllib/LDAExample should use MLVector instead of MLlib Vector

2016-07-14 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-16558:
-

 Summary: examples/mllib/LDAExample should use MLVector instead of 
MLlib Vector
 Key: SPARK-16558
 URL: https://issues.apache.org/jira/browse/SPARK-16558
 Project: Spark
  Issue Type: Bug
  Components: Examples, MLlib
Reporter: Xusen Yin
Priority: Minor


mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former 
transforms original data into MLVector format, while the latter uses 
MLlibVector format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16447) LDA wrapper in SparkR

2016-07-08 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368149#comment-15368149
 ] 

Xusen Yin commented on SPARK-16447:
---

[~mengxr] I'd like to work on this.

> LDA wrapper in SparkR
> -
>
> Key: SPARK-16447
> URL: https://issues.apache.org/jira/browse/SPARK-16447
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>
> Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16372) Retag RDD to tallSkinnyQR of RowMatrix

2016-07-04 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-16372:
--
Summary: Retag RDD to tallSkinnyQR of RowMatrix  (was: RowMatrix 
constructor should use retag for Java compatibility)

> Retag RDD to tallSkinnyQR of RowMatrix
> --
>
> Key: SPARK-16372
> URL: https://issues.apache.org/jira/browse/SPARK-16372
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Xusen Yin
>Priority: Minor
>
> The following Java code because of type erasing:
> {code}
> JavaRDD rows = jsc.parallelize(...);
> RowMatrix mat = new RowMatrix(rows.rdd());
> QRDecomposition result = mat.tallSkinnyQR(true);
> {code}
> We should use retag to restore the type to prevent the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16372) RowMatrix constructor should use retag for Java compatibility

2016-07-04 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-16372:
-

 Summary: RowMatrix constructor should use retag for Java 
compatibility
 Key: SPARK-16372
 URL: https://issues.apache.org/jira/browse/SPARK-16372
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Xusen Yin
Priority: Minor


The following Java code because of type erasing:

{code}
JavaRDD rows = jsc.parallelize(...);
RowMatrix mat = new RowMatrix(rows.rdd());
QRDecomposition result = mat.tallSkinnyQR(true);
{code}

We should use retag to restore the type to prevent the following exception:

{code}
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
[Lorg.apache.spark.mllib.linalg.Vector;
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16372) RowMatrix constructor should use retag for Java compatibility

2016-07-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361822#comment-15361822
 ] 

Xusen Yin commented on SPARK-16372:
---

SPARK-11497 fixed this for PySpark.

> RowMatrix constructor should use retag for Java compatibility
> -
>
> Key: SPARK-16372
> URL: https://issues.apache.org/jira/browse/SPARK-16372
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Xusen Yin
>Priority: Minor
>
> The following Java code because of type erasing:
> {code}
> JavaRDD rows = jsc.parallelize(...);
> RowMatrix mat = new RowMatrix(rows.rdd());
> QRDecomposition result = mat.tallSkinnyQR(true);
> {code}
> We should use retag to restore the type to prevent the following exception:
> {code}
> java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
> [Lorg.apache.spark.mllib.linalg.Vector;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16369) tallSkinnyQR of RowMatrix should aware of empty partition

2016-07-04 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-16369:
-

 Summary: tallSkinnyQR of RowMatrix should aware of empty partition
 Key: SPARK-16369
 URL: https://issues.apache.org/jira/browse/SPARK-16369
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Xusen Yin
Priority: Minor


tallSkinnyQR of RowMatrix should aware of empty partition, which could cause 
exception from Breeze qr decomposition.

See the [archived dev 
mail|https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3ccaf7adnrycvpl3qx-vzjhq4oymiuuhoscut_tkom63cm18ik...@mail.gmail.com%3E]
 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-27 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351942#comment-15351942
 ] 

Xusen Yin commented on SPARK-16144:
---

I'd like to work on this.

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala

2016-06-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332475#comment-15332475
 ] 

Xusen Yin commented on SPARK-15574:
---

I just finished the prototype of PythonTransformer in Scala as the transformer 
wrapper of pure Python transformers. It works well if I run it alone from Scala 
side. But if I chained the PythonTransformer with other transformers/estimators 
in Pipeline, it fails for lacking of transformSchema in Python side. AFAIK, we 
need to add transformSchema in Python ML for pure Python PipelineStages. 
[~josephkb] [~mengxr]

> Python meta-algorithms in Scala
> ---
>
> Key: SPARK-15574
> URL: https://issues.apache.org/jira/browse/SPARK-15574
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This is an experimental idea for implementing Python ML meta-algorithms 
> (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala.  
> This would require a Scala wrapper for algorithms implemented in Python, 
> somewhat analogous to Python UDFs.
> The benefit of this change would be that we could avoid currently awkward 
> conversions between Scala/Python meta-algorithms required for persistence.  
> It would let us have full support for Python persistence and would generally 
> simplify the implementation within MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11106) Should ML Models contains single models or Pipelines?

2016-06-07 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319685#comment-15319685
 ] 

Xusen Yin commented on SPARK-11106:
---

RFormula is easy to use, but it may not always do right things. For example, 
RFormula indexes categorical features with OneHotEncoder, but in some scenario 
(like RandomForest), a StringIndexer is better.

> Should ML Models contains single models or Pipelines?
> -
>
> Key: SPARK-11106
> URL: https://issues.apache.org/jira/browse/SPARK-11106
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> This JIRA is for discussing whether an ML Estimators should do feature 
> processing.
> h2. Issue
> Currently, almost all ML Estimators require strict input types.  E.g., 
> DecisionTreeClassifier requires that the label column be Double type and have 
> metadata indicating the number of classes.
> This requires users to know how to preprocess data.
> h2. Ideal workflow
> A user should be able to pass any reasonable data to a Transformer or 
> Estimator and have it "do the right thing."
> E.g.:
> * If DecisionTreeClassifier is given a String column for labels, it should 
> know to index the Strings.
> * See [SPARK-10513] for a similar issue with OneHotEncoder.
> h2. Possible solutions
> There are a few solutions I have thought of.  Please comment with feedback or 
> alternative ideas!
> h3. Leave as is
> Pro: The current setup is good in that it forces the user to be very aware of 
> what they are doing.  Feature transformations will not happen silently.
> Con: The user has to write boilerplate code for transformations.  The API is 
> not what some users would expect; e.g., coming from R, a user might expect 
> some automatic transformations.
> h3. All Transformers can contain PipelineModels
> We could allow all Transformers and Models to contain arbitrary 
> PipelineModels.  E.g., if a DecisionTreeClassifier were given a String label 
> column, it might return a Model which contains a simple fitted PipelineModel 
> containing StringIndexer + DecisionTreeClassificationModel.
> The API could present this to the user, or it could be hidden from the user.  
> Ideally, it would be hidden from the beginner user, but accessible for 
> experts.
> The main problem is that we might have to break APIs.  E.g., OneHotEncoder 
> may need to do indexing if given a String input column.  This means it should 
> no longer be a Transformer; it should be an Estimator.
> h3. All Estimators should use RFormula
> The best option I have thought of is to make RFormula be the primary method 
> for automatic feature transformation.  We could start adding an RFormula 
> Param to all Estimators, and it could handle most of these feature 
> transformation issues.
> We could maintain old APIs:
> * If a user sets the input column names, then those can be used in the 
> traditional (no automatic transformation) way.
> * If a user sets the RFormula Param, then it can be used instead.  (This 
> should probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala

2016-06-06 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317503#comment-15317503
 ] 

Xusen Yin commented on SPARK-15574:
---

[~josephkb] Can I work on this one? 

> Python meta-algorithms in Scala
> ---
>
> Key: SPARK-15574
> URL: https://issues.apache.org/jira/browse/SPARK-15574
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This is an experimental idea for implementing Python ML meta-algorithms 
> (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala.  
> This would require a Scala wrapper for algorithms implemented in Python, 
> somewhat analogous to Python UDFs.
> The benefit of this change would be that we could avoid currently awkward 
> conversions between Scala/Python meta-algorithms required for persistence.  
> It would let us have full support for Python persistence and would generally 
> simplify the implementation within MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-06-06 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317459#comment-15317459
 ] 

Xusen Yin commented on SPARK-14381:
---

Comparing mllib.feature with ml.feature, there are only two APIs missing for ml.

1. HashingTF should have setAlgorithm. However, it is intended to do so 
according to JIRA: https://issues.apache.org/jira/browse/SPARK-14899

2. Word2vec should have maxSentenceLength. I created a new JIRA: 
https://issues.apache.org/jira/browse/SPARK-15793

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15793) Word2vec in ML package should have maxSentenceLength method

2016-06-06 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-15793:
-

 Summary: Word2vec in ML package should have maxSentenceLength 
method
 Key: SPARK-15793
 URL: https://issues.apache.org/jira/browse/SPARK-15793
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Xusen Yin
Priority: Minor


Word2vec in ML package should have maxSentenceLength method for feature parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-06-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315060#comment-15315060
 ] 

Xusen Yin commented on SPARK-14381:
---

I can work on this one.

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-06-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314795#comment-15314795
 ] 

Xusen Yin commented on SPARK-3728:
--

Hi [~josephkb], as I [surveyed on 
H2O|https://issues.apache.org/jira/browse/SPARK-13868?focusedCommentId=15313400=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15313400],
 it trains model in a tree-by-tree style. Can I work on this one? 

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13868) Random forest accuracy exploration

2016-06-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313400#comment-15313400
 ] 

Xusen Yin edited comment on SPARK-13868 at 6/3/16 12:40 AM:


[~josephkb] [~tanwanirahul] Here is what I found:

1. Dataset preprocessing
In this dataset, all columns except DepTime and Distance are categorical 
features. The easiest way to transform the data into LabeledPoint style is 
RFormula. However, RFormula is not suitable here because it produces different 
shapes of the dataset in comparison with the original one. RFormula uses 
One-hot encoder, so it expands the original dataset into thousands of columns.

It brings two drawbacks:
a. The volume of the dataset is expanded, which may hurt the performance.
b. One-hot encoder splits one column into cardinality size of new columns, 
while Random Forest cannot take groups of features into consideration so that 
it may hurt the accuracy.

The RFormula also recognizes DepTime and Distance as categorical features, so 
it brings more unnecessary new columns and reduces the accuracy a step further 
because DepTime and Distance are the two most important features for this task.

On the contrary, H2O uses the original dataset, without further preprocessing.

2. Spark RandomForest can also get a good result
In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data 
gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed 
result, see 
https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing
Note that those "NA"s mean Spark got OOM on my laptop.

3. OOM of Spark Random Forest
In the single machine environment, Spark RF is slower than H2O. What's worse, 
OOM frequently occurs on Spark with larger bins, larger trees, and larger 
maxDepth. The reason is Spark creates new Double array quite often inside each 
partition.

Say in one partition of our dataset, Spark creates numNodes Double array with 
length numFeatures * numBins * statsSize. If we use a single machine with 16 
partitions, we may generate a new Double array with O(numPartitions * numNodes 
* numFeatures * numBins * statsSize) Double in total. And I can see from my 
experiment that the parameter maxMemoryInMB barely useful. It will be better if 
we use multi-server and spread out those tasks.

Spark trains random forest in a BFS mode, i.e. the 1st layer of all trees, then 
the 2nd layer of all trees, while H2O does tree-by-tree, and inside each tree, 
it trains layer-by-layer. H2O also uses smaller arrays to collect histograms 
than Spark. It uses Java Fork/Join to split tasks, and inside each task, it 
generates Double arrays with size numNodes * numFeatures * numBins, then merges 
them inside a shared DHistogram in each process. (I am not quite sure about the 
process since DRF code in H2O is more complicated than Spark, and without 
detailed comments.)

Besides, H2O also has a MemoryManager to allocate arrays and gets around OOM as 
long as possible. However, H2O also crashes with OOM once a time on my laptop 
when I was training 500 trees with 20 maxDepth on 10m dataset.



was (Author: yinxusen):
[~josephkb] [~tanwanirahul] Here is what I found:

1. Dataset preprocessing
In this dataset, all columns except DepTime and Distance are categorical 
features. The easiest way to transform the data into LabeledPoint style is 
RFormula. However, RFormula is not suitable here because it produces different 
shapes of the dataset in comparison with the original one. RFormula uses 
One-hot encoder, so it expands the original dataset into thousands of columns.

It brings two drawbacks:
a. The volume of the dataset is expanded, which may hurt the performance.
b. One-hot encoder splits one column into cardinality size of new columns, 
while Random Forest cannot take groups of features into consideration so that 
it may hurt the accuracy.

The RFormula also recognizes DepTime and Distance as categorical features, so 
it brings more unnecessary new columns and reduces the accuracy a step further 
because DepTime and Distance are the two most important features for this task.

On the contrary, H2O uses the original dataset, without further preprocessing.

2. Spark RandomForest can also get a good result
In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data 
gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed 
result, see 
https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing
Note that those "NA"s mean Spark got OOM on my laptop.

3. OOM of Spark Random Forest
In the single machine environment, Spark RF is slower than H2O. What's worse, 
OOM frequently occurs on Spark with larger bins, larger trees, and larger 
maxDepth. The reason is Spark creates new Double array quite often inside each 
partition.

Say in one partition of our 

[jira] [Commented] (SPARK-13868) Random forest accuracy exploration

2016-06-02 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313400#comment-15313400
 ] 

Xusen Yin commented on SPARK-13868:
---

[~josephkb] [~tanwanirahul] Here is what I found:

1. Dataset preprocessing
In this dataset, all columns except DepTime and Distance are categorical 
features. The easiest way to transform the data into LabeledPoint style is 
RFormula. However, RFormula is not suitable here because it produces different 
shapes of the dataset in comparison with the original one. RFormula uses 
One-hot encoder, so it expands the original dataset into thousands of columns.

It brings two drawbacks:
a. The volume of the dataset is expanded, which may hurt the performance.
b. One-hot encoder splits one column into cardinality size of new columns, 
while Random Forest cannot take groups of features into consideration so that 
it may hurt the accuracy.

The RFormula also recognizes DepTime and Distance as categorical features, so 
it brings more unnecessary new columns and reduces the accuracy a step further 
because DepTime and Distance are the two most important features for this task.

On the contrary, H2O uses the original dataset, without further preprocessing.

2. Spark RandomForest can also get a good result
In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data 
gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed 
result, see 
https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing
Note that those "NA"s mean Spark got OOM on my laptop.

3. OOM of Spark Random Forest
In the single machine environment, Spark RF is slower than H2O. What's worse, 
OOM frequently occurs on Spark with larger bins, larger trees, and larger 
maxDepth. The reason is Spark creates new Double array quite often inside each 
partition.

Say in one partition of our dataset, Spark creates numNodes Double array with 
length numFeatures * numBins * statsSize. If we use a single machine with 16 
partitions, we may generate a new Double array with O(numPartitions * numNodes 
* numFeatures * numBins * statsSize) Double in total. And I can see from my 
experiment that the parameter maxMemoryInMB barely useful. It will be better if 
we use multi-server and spread out those tasks.

Spark trains random forest in a BFS mode, i.e. the 1st layer of all trees, then 
the 2nd layer of all trees, while H2O does tree-by-tree, and inside each tree, 
it trains layer-by-layer. H2O also uses smaller arrays to collect histograms 
than Spark. It uses Java Fork/Join to split tasks, and inside each task, it 
generates Double arrays with size numNodes * numFeatures * numBins, then merges 
them inside a shared DHistogram in each process. (I am not quite sure about the 
process since DRF code in H2O is more complicated than Spark, and without 
comments.)

Besides, H2O also has a MemoryManager to allocate arrays and gets around OOM as 
long as possible. However, H2O also crashes with OOM once a time on my laptop 
when I was training 500 trees with 20 maxDepth on 10m dataset.


> Random forest accuracy exploration
> --
>
> Key: SPARK-13868
> URL: https://issues.apache.org/jira/browse/SPARK-13868
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is a JIRA for exploring accuracy improvements for Random Forests.
> h2. Background
> Initial exploration was based on reports of poor accuracy from 
> [http://datascience.la/benchmarking-random-forest-implementations/]
> Essentially, Spark 1.2 showed poor performance relative to other libraries 
> for training set sizes of 1M and 10M.
> h3.  Initial improvements
> The biggest issue was that the metric being used was AUC and Spark 1.2 was 
> using hard predictions, not class probabilities.  This was fixed in 
> [SPARK-9528], and that brought Spark up to performance parity with 
> scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.
> h3.  Remaining issues
> For training set size 10M, Spark does not yet match the AUC of the other 2 
> libraries benchmarked (H2O and xgboost).
> Note that, on 1M instances, these 2 libraries also show better results than 
> scikit-learn, VW, and R.  I'm not too familiar with the H2O implementation 
> and how it differs, but xgboost is a very different algorithm, so it's not 
> surprising it has different behavior.
> h2. My explorations
> I've run Spark on the test set of 10M instances.  (Note that the benchmark 
> linked above used somewhat different settings for the different algorithms, 
> but those settings are actually not that important for this problem.  This 
> included gini vs. entropy impurity and limits on splitting nodes.)
> I've tried adjusting:
> * maxDepth: Past depth 20, going deeper does not seem to matter
> * 

[jira] [Updated] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading

2016-05-02 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14973:
--
Description: The CrossValidator and TrainValidationSplit miss the seed when 
saving and loading. Need to fix both Spark side code and test suite.  (was: The 
CrossValidator and TrainValidationSplit miss the seed when saving and loading. 
Need to fix both Spark side code and test suite, plus PySpark side code and 
test suite.)

> The CrossValidator and TrainValidationSplit miss the seed when saving and 
> loading
> -
>
> Key: SPARK-14973
> URL: https://issues.apache.org/jira/browse/SPARK-14973
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Xusen Yin
>
> The CrossValidator and TrainValidationSplit miss the seed when saving and 
> loading. Need to fix both Spark side code and test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin resolved SPARK-14302.
---
Resolution: Won't Fix

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266093#comment-15266093
 ] 

Xusen Yin commented on SPARK-14302:
---

I'll close it, anything else I'll let you know. Thanks!

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-05-01 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266006#comment-15266006
 ] 

Xusen Yin commented on SPARK-14302:
---

[~kanjilal] Thanks for working on this. However, I check the duplicated 
examples again and find out that we should not delete all of them. As I 
depicted below:

* python/ml
** None

* Unsure duplications, double check
** dataframe_example.py  --> serves for an example of dataframe usage.
** kmeans_example.py  --> serves as an application
** simple_params_example.py  --> serves for an example of params usage.
** simple_text_classification_pipeline.py  --> serves as an application.

* python/mllib
** gaussian_mixture_model.py  --> serves as an application.
** kmeans.py  --> ditto
** logistic_regression.py  --> ditto

* Unsure duplications, double check
** correlations.py  --> ditto
** random_rdd_generation.py  --> ditto
** sampled_rdds.py  --> ditto
** word2vec.py  --> ditto

So I think we can close this JIRA as won't fix. What do you think about it?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-28 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262828#comment-15262828
 ] 

Xusen Yin commented on SPARK-14302:
---

Thanks! And sorry for the late response, I forgot it.

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-28 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262810#comment-15262810
 ] 

Xusen Yin commented on SPARK-14302:
---

We should leave them unmerged e.g. ml.bisecting_k_means_example and 
mllib.bisecting_k_means_example. Even though they are similar, but each serves 
different purposes, i.e. is used for different document files.

This JIRA aims to merge duplicated codes inside examples/python/ml, 
examples/python/mllib, but not between them two. 

For example, we have python/mllib/gaussian_mixture_model.py, which is 
duplicated with python/mllib/gaussian_mixture_example.py. The latter has 
$example on$ and $example off$ blocks in it which means it serves as a part of 
document files. So we should delete the former one and keep the latter.

Also, according to here 
https://github.com/apache/spark/pull/12092#issuecomment-204276885, we should 
leave the example code with command-line parameters untouched, so we should 
keep the python/mllib/gaussian_mixture_model.py.

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-04-28 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262794#comment-15262794
 ] 

Xusen Yin commented on SPARK-14302:
---

Hi Saikat, any updates?

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading

2016-04-28 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261608#comment-15261608
 ] 

Xusen Yin commented on SPARK-14973:
---

Will fix it with SPARK-14706

> The CrossValidator and TrainValidationSplit miss the seed when saving and 
> loading
> -
>
> Key: SPARK-14973
> URL: https://issues.apache.org/jira/browse/SPARK-14973
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Reporter: Xusen Yin
>
> The CrossValidator and TrainValidationSplit miss the seed when saving and 
> loading. Need to fix both Spark side code and test suite, plus PySpark side 
> code and test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading

2016-04-28 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14973:
-

 Summary: The CrossValidator and TrainValidationSplit miss the seed 
when saving and loading
 Key: SPARK-14973
 URL: https://issues.apache.org/jira/browse/SPARK-14973
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: Xusen Yin


The CrossValidator and TrainValidationSplit miss the seed when saving and 
loading. Need to fix both Spark side code and test suite, plus PySpark side 
code and test suite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark

2016-04-26 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14931:
-

 Summary: Mismatched default values between pipelines in Spark and 
PySpark
 Key: SPARK-14931
 URL: https://issues.apache.org/jira/browse/SPARK-14931
 Project: Spark
  Issue Type: Bug
Reporter: Xusen Yin


Mismatched default values between pipelines in Spark and PySpark lead to 
different pipelines in PySpark after saving and loading.

Find generic ways to check JavaParams then fix them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14924) OneVsRest with classifier in estimatorParamMaps of tuning fail to persistence

2016-04-26 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14924:
-

 Summary: OneVsRest with classifier in estimatorParamMaps of tuning 
fail to persistence
 Key: SPARK-14924
 URL: https://issues.apache.org/jira/browse/SPARK-14924
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Reporter: Xusen Yin


{code}
ovr = OneVsRest()
epms = [{ovr.classifier: }, {ovr.classifier: xxx}]
cv = CrossValidator(estimator=ovr, estimatorParamMaps=epms, ...)
cv.load()
{code}

fails because classifier cannot be serialized via JSON.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11337) Make example code in user guide testable

2016-04-25 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256751#comment-15256751
 ] 

Xusen Yin commented on SPARK-11337:
---

[~mengxr] We can close this now.

> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.
> *self-check list for contributors in this JIRA*
> * Be sure to match Scala/Java/Python code style guide. If unsure of a code 
> style, please refer to other merged example code under examples/.
> * Remove useless imports
> * It's better to have a side-effect operation at the end of each example 
> code, usually it's a {code}print(...){code}
> * Make sure the code example is runnable without error.
> * After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
> serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
> generated html looks good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11399) Include_example should support labels to cut out different parts in one example code

2016-04-25 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin closed SPARK-11399.
-
Resolution: Won't Fix

> Include_example should support labels to cut out different parts in one 
> example code
> 
>
> Key: SPARK-11399
> URL: https://issues.apache.org/jira/browse/SPARK-11399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>
> There are many small examples that do not need to create a single example 
> file. Take the MLlib datatype page – mllib-data-types.md – as an example, 
> code examples like creating vectors and matrices are trivial works. We can 
> merge them into one single vector/matrix creation example. Then we use labels 
> to distinguish each other, such as {% include_example .scala 
> vector_creation %}.
> The "label way" is also useful in the dialog-style code example: 
> http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14706) Python ML persistence integration test

2016-04-21 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252435#comment-15252435
 ] 

Xusen Yin commented on SPARK-14706:
---

Sure. I'll take care of it.

There are more issues with CrossValidator, TrainValidationSplit and OneVsRest 
such as no implementation of _transfer_param_map_from/to_java(), which lead 
that they cannot be wrapped in another meta-estimator. I'm fixing them together.

> Python ML persistence integration test
> --
>
> Key: SPARK-14706
> URL: https://issues.apache.org/jira/browse/SPARK-14706
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Goal: extend integration test in {{ml/tests.py}}.
> In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}.  
> This issue includes:
> * Extending {{_compare_pipelines}} to handle CrossValidator, 
> TrainValidationSplit, and OneVsRest
> * Adding an integration test in PersistenceTest which includes nested 
> meta-algorithms.  E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ 
> OneVsRest[ LogisticRegression ] ] ] ]}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14706) Python ML persistence integration test

2016-04-18 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246950#comment-15246950
 ] 

Xusen Yin commented on SPARK-14706:
---

I am starting write it.

> Python ML persistence integration test
> --
>
> Key: SPARK-14706
> URL: https://issues.apache.org/jira/browse/SPARK-14706
> Project: Spark
>  Issue Type: Test
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> Goal: extend integration test in {{ml/tests.py}}.
> In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}.  
> This issue includes:
> * Extending {{_compare_pipelines}} to handle CrossValidator, 
> TrainValidationSplit, and OneVsRest
> * Adding an integration test in PersistenceTest which includes nested 
> meta-algorithms.  E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ 
> OneVsRest[ LogisticRegression ] ] ] ]}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-14 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14440:
--
Description: 
Since the 
PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
are just extended from JavaMLWriter and JavaMLReader without other 
modifications of attributes and methods, there is no need to keep them, just 
like what we did in the save/load of ml/tuning.py.


  was:
Since the 
PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
are just extends from JavaMLWriter and JavaMLReader without other modifications 
of attributes and methods, there is no need to keep them, just like what we did 
in the save/load of ml/tuning.py.



> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Trivial
>
> Since the 
> PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
> are just extended from JavaMLWriter and JavaMLReader without other 
> modifications of attributes and methods, there is no need to keep them, just 
> like what we did in the save/load of ml/tuning.py.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-14 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14440:
--
Description: 
Since the 
PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
are just extends from JavaMLWriter and JavaMLReader without other modifications 
of attributes and methods, there is no need to keep them, just like what we did 
in the save/load of ml/tuning.py.


  was:
Since the 
PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
are just extends from JavaMLWriter and JavaMLReader without other modifications 
of attributes and methods, there is no need to keep them, just like what we did 
in 

Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.



> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Trivial
>
> Since the 
> PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
> are just extends from JavaMLWriter and JavaMLReader without other 
> modifications of attributes and methods, there is no need to keep them, just 
> like what we did in the save/load of ml/tuning.py.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-14 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14440:
--
Description: 
Since the 
PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
are just extends from JavaMLWriter and JavaMLReader without other modifications 
of attributes and methods, there is no need to keep them, just like what we did 
in 

Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.


  was:
Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.




> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Trivial
>
> Since the 
> PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader 
> are just extends from JavaMLWriter and JavaMLReader without other 
> modifications of attributes and methods, there is no need to keep them, just 
> like what we did in 
> Remove
> * PipelineMLWriter
> * PipelineMLReader
> * PipelineModelMLWriter
> * PipelineModelMLReader
> and modify comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-14 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14440:
--
Description: 
Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.



  was:
Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.


> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Trivial
>
> Remove
> * PipelineMLWriter
> * PipelineMLReader
> * PipelineModelMLWriter
> * PipelineModelMLReader
> and modify comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-14 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242147#comment-15242147
 ] 

Xusen Yin commented on SPARK-14440:
---

Sorry for the late response, I'll update it soon.

> Remove PySpark ml.pipeline's specific Reader and Writer
> ---
>
> Key: SPARK-14440
> URL: https://issues.apache.org/jira/browse/SPARK-14440
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Xusen Yin
>Priority: Trivial
>
> Remove
> * PipelineMLWriter
> * PipelineMLReader
> * PipelineModelMLWriter
> * PipelineModelMLReader
> and modify comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14306) PySpark ml.classification OneVsRest support export/import

2016-04-13 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239678#comment-15239678
 ] 

Xusen Yin commented on SPARK-14306:
---

Yes, but blocked by this https://github.com/apache/spark/pull/12124

> PySpark ml.classification OneVsRest support export/import
> -
>
> Key: SPARK-14306
> URL: https://issues.apache.org/jira/browse/SPARK-14306
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer

2016-04-06 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14440:
-

 Summary: Remove PySpark ml.pipeline's specific Reader and Writer
 Key: SPARK-14440
 URL: https://issues.apache.org/jira/browse/SPARK-14440
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Xusen Yin
Priority: Trivial


Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14301) Java examples code merge and clean up

2016-04-06 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228693#comment-15228693
 ] 

Xusen Yin commented on SPARK-14301:
---

Thanks, we'll make sure that. :)

> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * Unsure code duplications of java/ml, double check
> ** JavaDeveloperApiExample.java
> ** JavaSimpleParamsExample.java
> ** JavaSimpleTextClassificationPipeline.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * Unsure code duplications of java/mllib, double check
> ** JavaALS.java
> ** JavaFPGrowthExample.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-04-01 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala
** SimpleTextClassificationPipeline.scala --> 
ModelSelectionViaCrossValidationExample
** DataFrameExample.scala --> merge with LogisticRegressionSummaryExample.scala

* Intend to reserve with command-line support:
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala
** SimpleTextClassificationPipeline.scala --> 
ModelSelectionViaCrossValidationExample
** DataFrameExample.scala --> merge with LogisticRegressionSummaryExample.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> ** SimpleParamsExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> ** SimpleTextClassificationPipeline.scala --> 
> ModelSelectionViaCrossValidationExample
> ** DataFrameExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> * Intend to reserve with command-line support:
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14306) PySpark ml.classification OneVsRest support export/import

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220519#comment-15220519
 ] 

Xusen Yin commented on SPARK-14306:
---

start work on it now.

> PySpark ml.classification OneVsRest support export/import
> -
>
> Key: SPARK-14306
> URL: https://issues.apache.org/jira/browse/SPARK-14306
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220337#comment-15220337
 ] 

Xusen Yin commented on SPARK-14302:
---

This JIRA only focuses on Python examples. I.e. 
spark/examples/src/main/python/ml and spark/examples/src/main/python/mllib

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220324#comment-15220324
 ] 

Xusen Yin commented on SPARK-14302:
---

And this JIRA is to delete or merge some example codes, not to compare code in 
python/examples/mllib and python/examples/ml. See 
https://github.com/apache/spark/pull/12092 as an example.

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220321#comment-15220321
 ] 

Xusen Yin commented on SPARK-14302:
---

Java code is in this JIRA: https://issues.apache.org/jira/browse/SPARK-14301

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220304#comment-15220304
 ] 

Xusen Yin commented on SPARK-14302:
---

Sure, thanks

> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220242#comment-15220242
 ] 

Xusen Yin commented on SPARK-14301:
---

Go ahead. Thanks!

> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * Unsure code duplications of java/ml, double check
> ** JavaDeveloperApiExample.java
> ** JavaSimpleParamsExample.java
> ** JavaSimpleTextClassificationPipeline.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * Unsure code duplications of java/mllib, double check
> ** JavaALS.java
> ** JavaFPGrowthExample.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-13462) Vector serialization error in example code of ModelSelectionViaTrainValidationSplitExample and JavaModelSelectionViaTrainValidationSplitExample

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin closed SPARK-13462.
-
Resolution: Won't Fix

> Vector serialization error in example code of 
> ModelSelectionViaTrainValidationSplitExample and 
> JavaModelSelectionViaTrainValidationSplitExample
> ---
>
> Key: SPARK-13462
> URL: https://issues.apache.org/jira/browse/SPARK-13462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Reporter: Xusen Yin
>Priority: Minor
>
> ModelSelectionViaTrainValidationSplitExample and 
> JavaModelSelectionViaTrainValidationSplitExample fail to run. If finally it's 
> a bug of TrainValidationSplit or LinearRegression, let's move the JIRA out of 
> SPARK-11337.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13462) Vector serialization error in example code of ModelSelectionViaTrainValidationSplitExample and JavaModelSelectionViaTrainValidationSplitExample

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220239#comment-15220239
 ] 

Xusen Yin commented on SPARK-13462:
---

Well, this is a false alarm. They can run with current github master. I'll 
close it.

> Vector serialization error in example code of 
> ModelSelectionViaTrainValidationSplitExample and 
> JavaModelSelectionViaTrainValidationSplitExample
> ---
>
> Key: SPARK-13462
> URL: https://issues.apache.org/jira/browse/SPARK-13462
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Examples
>Reporter: Xusen Yin
>Priority: Minor
>
> ModelSelectionViaTrainValidationSplitExample and 
> JavaModelSelectionViaTrainValidationSplitExample fail to run. If finally it's 
> a bug of TrainValidationSplit or LinearRegression, let's move the JIRA out of 
> SPARK-11337.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220182#comment-15220182
 ] 

Xusen Yin commented on SPARK-14300:
---

Thanks! Be sure to check every code example.

> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala
** SimpleTextClassificationPipeline.scala --> 
ModelSelectionViaCrossValidationExample
** DataFrameExample.scala --> merge with LogisticRegressionSummaryExample.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala
** SimpleTextClassificationPipeline.scala --> 
ModelSelectionViaCrossValidationExample

* Unsure code duplications (need double check)
** DataFrameExample.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> ** SimpleParamsExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> ** SimpleTextClassificationPipeline.scala --> 
> ModelSelectionViaCrossValidationExample
> ** DataFrameExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala
** SimpleTextClassificationPipeline.scala --> 
ModelSelectionViaCrossValidationExample

* Unsure code duplications (need double check)
** DataFrameExample.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala

* Unsure code duplications (need double check)
** DataFrameExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> ** SimpleParamsExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> ** SimpleTextClassificationPipeline.scala --> 
> ModelSelectionViaCrossValidationExample
> * Unsure code duplications (need double check)
> ** DataFrameExample.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.
** SimpleParamsExample.scala --> merge with 
LogisticRegressionSummaryExample.scala

* Unsure code duplications (need double check)
** DataFrameExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.

* Unsure code duplications (need double check)
** DataFrameExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> ** SimpleParamsExample.scala --> merge with 
> LogisticRegressionSummaryExample.scala
> * Unsure code duplications (need double check)
> ** DataFrameExample.scala
> ** SimpleTextClassificationPipeline.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample
** DeveloperApiExample.scala --> I delete it for now because it's only about 
how to create your own classifieri, etc, which can be learned easily from other 
examples and ml codes.

* Unsure code duplications (need double check)
** DataFrameExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

* Unsure code duplications (need double check)
** DataFrameExample.scala
** DeveloperApiExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> ** DeveloperApiExample.scala --> I delete it for now because it's only about 
> how to create your own classifieri, etc, which can be learned easily from 
> other examples and ml codes.
> * Unsure code duplications (need double check)
> ** DataFrameExample.scala
> ** SimpleParamsExample.scala
> ** SimpleTextClassificationPipeline.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

* Unsure code duplications (need double check)
** DataFrameExample.scala
** DeveloperApiExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

* Unsure code duplications (need double check)
** DataFrameExample.scala
** DeveloperApiExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> * Unsure code duplications (need double check)
> ** DataFrameExample.scala
> ** DeveloperApiExample.scala
> ** SimpleParamsExample.scala
> ** SimpleTextClassificationPipeline.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.
> I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-31 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220083#comment-15220083
 ] 

Xusen Yin commented on SPARK-14041:
---

I've split them into 4 JIRAs.

> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> To find out all examples of ml/mllib that don't contain "example on": 
> {code}grep -L "example on" /path/to/ml-or-mllib/examples{code}
> Duplicates need to be deleted:
> * scala/ml
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14302:
--
Description: 
Duplicated code that I found in python/examples/mllib and python/examples/ml:

* python/ml
** None

* Unsure duplications, double check
** dataframe_example.py
** kmeans_example.py
** simple_params_example.py
** simple_text_classification_pipeline.py

* python/mllib
** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

* Unsure duplications, double check
** correlations.py
** random_rdd_generation.py
** sampled_rdds.py
** word2vec.py

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in python/examples/mllib and python/examples/ml:

* python/ml
** None

* python/mllib
** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.


> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * Unsure duplications, double check
> ** dataframe_example.py
> ** kmeans_example.py
> ** simple_params_example.py
> ** simple_text_classification_pipeline.py
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> * Unsure duplications, double check
> ** correlations.py
> ** random_rdd_generation.py
> ** sampled_rdds.py
> ** word2vec.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14301:
--
Description: 
Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml
** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* Unsure code duplications of java/ml, double check
** JavaDeveloperApiExample.java
** JavaSimpleParamsExample.java
** JavaSimpleTextClassificationPipeline.java

* java/mllib
** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* Unsure code duplications of java/mllib, double check
** JavaALS.java
** JavaFPGrowthExample.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml
** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib
** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.


> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * Unsure code duplications of java/ml, double check
> ** JavaDeveloperApiExample.java
> ** JavaSimpleParamsExample.java
> ** JavaSimpleTextClassificationPipeline.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * Unsure code duplications of java/mllib, double check
> ** JavaALS.java
> ** JavaFPGrowthExample.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14300:
--
Description: 
Duplicated code that I found in scala/examples/mllib:

* scala/mllib
** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* Unsure code duplications (need doube check)
** AbstractParams.scala
** BinaryClassification.scala
** Correlations.scala
** CosineSimilarity.scala
** DenseGaussianMixture.scala
** FPGrowthExample.scala
** MovieLensALS.scala
** MultivariateSummarizer.scala
** RandomRDDGeneration.scala
** SampledRDDs.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in scala/examples/mllib:

* scala/mllib
** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.


> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14299) Scala examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14299:
-

 Summary: Scala examples code merge and clean up
 Key: SPARK-14299
 URL: https://issues.apache.org/jira/browse/SPARK-14299
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Reporter: Xusen Yin
Priority: Minor


Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14300:
-

 Summary: Scala MLlib examples code merge and clean up
 Key: SPARK-14300
 URL: https://issues.apache.org/jira/browse/SPARK-14300
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Reporter: Xusen Yin
Priority: Minor


Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

* Unsure code duplications
** DataFrameExample.scala
** DeveloperApiExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> * Unsure code duplications
> ** DataFrameExample.scala
> ** DeveloperApiExample.scala
> ** SimpleParamsExample.scala
> ** SimpleTextClassificationPipeline.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks. I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Description: 
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

* Unsure code duplications (need double check)
** DataFrameExample.scala
** DeveloperApiExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

* Unsure code duplications
** DataFrameExample.scala
** DeveloperApiExample.scala
** SimpleParamsExample.scala
** SimpleTextClassificationPipeline.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 


> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> * Unsure code duplications (need double check)
> ** DataFrameExample.scala
> ** DeveloperApiExample.scala
> ** SimpleParamsExample.scala
> ** SimpleTextClassificationPipeline.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks. I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib
** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml
** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib
** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml
** None

* python/mllib
** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> To find out all examples of ml/mllib that don't contain "example on": 
> {code}grep -L "example on" /path/to/ml-or-mllib/examples{code}
> Duplicates need to be deleted:
> * scala/ml
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14302:
--
Description: 
Duplicated code that I found in python/examples/mllib and python/examples/ml:

* python/ml
** None

* python/mllib
** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml
** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib
** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.


> Python examples code merge and clean up
> ---
>
> Key: SPARK-14302
> URL: https://issues.apache.org/jira/browse/SPARK-14302
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in python/examples/mllib and python/examples/ml:
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14301:
--
Description: 
Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml
** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib
** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.


> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14301:
--
Description: 
Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in scala/examples/mllib:

* scala/mllib
** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.


> Java examples code merge and clean up
> -
>
> Key: SPARK-14301
> URL: https://issues.apache.org/jira/browse/SPARK-14301
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in java/examples/mllib and java/examples/ml:
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14302) Python examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14302:
-

 Summary: Python examples code merge and clean up
 Key: SPARK-14302
 URL: https://issues.apache.org/jira/browse/SPARK-14302
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Reporter: Xusen Yin
Priority: Minor


Duplicated code that I found in java/examples/mllib and java/examples/ml:

* java/ml
** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib
** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14300:
--
Description: 
Duplicated code that I found in scala/examples/mllib:

* scala/mllib
** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.

  was:
Duplicated code that I found in scala/examples/ml:

* scala/ml
** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
DecisionTreeClassificationExample
** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
GradientBoostedTreeRegressorExample
** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, 
LogisticRegressionSummaryExample
** RandomForestExample.scala --> RandomForestRegressorExample, 
RandomForestClassifierExample
** TrainValidationSplitExample.scala --> 
ModelSelectionViaTrainValidationSplitExample

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks. I'll take this one as an example. 


> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14301) Java examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14301:
-

 Summary: Java examples code merge and clean up
 Key: SPARK-14301
 URL: https://issues.apache.org/jira/browse/SPARK-14301
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Reporter: Xusen Yin
Priority: Minor


Duplicated code that I found in scala/examples/mllib:

* scala/mllib
** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

When merging and cleaning those code, be sure not disturb the previous example 
on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up

2016-03-31 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14299:
--
Summary: Scala ML examples code merge and clean up  (was: Scala examples 
code merge and clean up)

> Scala ML examples code merge and clean up
> -
>
> Key: SPARK-14299
> URL: https://issues.apache.org/jira/browse/SPARK-14299
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/ml:
> * scala/ml
> ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample
> ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, 
> DecisionTreeClassificationExample
> ** GBTExample.scala --> GradientBoostedTreeClassifierExample, 
> GradientBoostedTreeRegressorExample
> ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample
> ** LogisticRegressionExample.scala --> 
> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample
> ** RandomForestExample.scala --> RandomForestRegressorExample, 
> RandomForestClassifierExample
> ** TrainValidationSplitExample.scala --> 
> ModelSelectionViaTrainValidationSplitExample
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks. I'll take this one as an example. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14181) TrainValidationSplit should have HasSeed

2016-03-27 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-14181:
-

 Summary: TrainValidationSplit should have HasSeed
 Key: SPARK-14181
 URL: https://issues.apache.org/jira/browse/SPARK-14181
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Xusen Yin
Priority: Minor


TrainValidationSplit should have HasSeed, just like its Python companion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import

2016-03-27 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213357#comment-15213357
 ] 

Xusen Yin commented on SPARK-13786:
---

I have finished the CrossValidator, but need to wait until SPARK-11893 merged 
first.

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import

2016-03-25 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212364#comment-15212364
 ] 

Xusen Yin commented on SPARK-13786:
---

I'll work on it.

> Pyspark ml.tuning support export/import
> ---
>
> Key: SPARK-13786
> URL: https://issues.apache.org/jira/browse/SPARK-13786
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> This should follow whatever implementation is chosen for Pipeline (since 
> these are all meta-algorithms).
> Note this will also require persistence for Evaluators.  Hopefully that can 
> leverage the Java implementations; there is not a real need to make Python 
> Evaluators be MLWritable, as far as I can tell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-25 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
Please go through the current example code and list possible duplicates.

To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> To find out all examples of ml/mllib that don't contain "example on": 
> {code}grep -L "example on" /path/to/ml-or-mllib/examples{code}
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-23 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
Please go through the current example code and list possible duplicates.

To find out all examples of ml/mllib that don't contain "example on": 
{code}grep -L "example on" /path/to/ml-or-mllib/examples{code}

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> To find out all examples of ml/mllib that don't contain "example on": 
> {code}grep -L "example on" /path/to/ml-or-mllib/examples{code}
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-22 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207417#comment-15207417
 ] 

Xusen Yin commented on SPARK-14041:
---

[~mengxr] Maybe no need to divide them into several JIRAs, since what we need 
to do is deleting them.

> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-22 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

* java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

*java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks

2016-03-22 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-14041:
--
Description: 
Please go through the current example code and list possible duplicates.

Duplicates need to be deleted:

* scala/ml
  
** CrossValidatorExample.scala
** DecisionTreeExample.scala
** GBTExample.scala
** LinearRegressionExample.scala
** LogisticRegressionExample.scala
** RandomForestExample.scala
** TrainValidationSplitExample.scala

* scala/mllib

** DecisionTreeRunner.scala 
** DenseGaussianMixture.scala
** DenseKMeans.scala
** GradientBoostedTreesRunner.scala
** LDAExample.scala
** LinearRegression.scala
** SparseNaiveBayes.scala
** StreamingLinearRegression.scala
** StreamingLogisticRegression.scala
** TallSkinnyPCA.scala
** TallSkinnySVD.scala

*java/ml

** JavaCrossValidatorExample.java
** JavaDocument.java
** JavaLabeledDocument.java
** JavaTrainValidationSplitExample.java

* java/mllib

** JavaKMeans.java
** JavaLDAExample.java
** JavaLR.java

* python/ml

** None

* python/mllib

** gaussian_mixture_model.py
** kmeans.py
** logistic_regression.py

  was:Please go through the current example code and list possible duplicates.


> Locate possible duplicates and group them into subtasks
> ---
>
> Key: SPARK-14041
> URL: https://issues.apache.org/jira/browse/SPARK-14041
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>
> Please go through the current example code and list possible duplicates.
> Duplicates need to be deleted:
> * scala/ml
>   
> ** CrossValidatorExample.scala
> ** DecisionTreeExample.scala
> ** GBTExample.scala
> ** LinearRegressionExample.scala
> ** LogisticRegressionExample.scala
> ** RandomForestExample.scala
> ** TrainValidationSplitExample.scala
> * scala/mllib
> 
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> *java/ml
> ** JavaCrossValidatorExample.java
> ** JavaDocument.java
> ** JavaLabeledDocument.java
> ** JavaTrainValidationSplitExample.java
> * java/mllib
> ** JavaKMeans.java
> ** JavaLDAExample.java
> ** JavaLR.java
> * python/ml
> ** None
> * python/mllib
> ** gaussian_mixture_model.py
> ** kmeans.py
> ** logistic_regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup

2016-03-20 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203569#comment-15203569
 ] 

Xusen Yin commented on SPARK-13461:
---

I delete it. It's from another JIRA

> Duplicated example code merge and cleanup
> -
>
> Key: SPARK-13461
> URL: https://issues.apache.org/jira/browse/SPARK-13461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Merge duplicated code after we finishing the example code substitution.
> Duplications include:
> * JavaTrainValidationSplitExample 
> * TrainValidationSplitExample
> * Others can be added here ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13461) Duplicated example code merge and cleanup

2016-03-20 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-13461:
--
Description: 
Merge duplicated code after we finishing the example code substitution.

Duplications include:

* JavaTrainValidationSplitExample 

* TrainValidationSplitExample

* Others can be added here ...

  was:
Merge duplicated code after we finishing the example code substitution.

Duplications include:

* JavaTrainValidationSplitExample 

* TrainValidationSplitExample

* Random data generation in mllib-statistics.md need to remove "-" in each line.

* Others can be added here ...


> Duplicated example code merge and cleanup
> -
>
> Key: SPARK-13461
> URL: https://issues.apache.org/jira/browse/SPARK-13461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Merge duplicated code after we finishing the example code substitution.
> Duplications include:
> * JavaTrainValidationSplitExample 
> * TrainValidationSplitExample
> * Others can be added here ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup

2016-03-19 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203076#comment-15203076
 ] 

Xusen Yin commented on SPARK-13461:
---

Yes we'll delete it.

> Duplicated example code merge and cleanup
> -
>
> Key: SPARK-13461
> URL: https://issues.apache.org/jira/browse/SPARK-13461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Merge duplicated code after we finishing the example code substitution.
> Duplications include:
> * JavaTrainValidationSplitExample 
> * TrainValidationSplitExample
> * Random data generation in mllib-statistics.md need to remove "-" in each 
> line.
> * Others can be added here ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import

2016-03-19 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-13993:
-

 Summary: PySpark ml.feature.RFormula/RFormulaModel support 
export/import
 Key: SPARK-13993
 URL: https://issues.apache.org/jira/browse/SPARK-13993
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Xusen Yin
Priority: Minor


Add save/load for RFormula and its model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines

2016-03-18 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198642#comment-15198642
 ] 

Xusen Yin commented on SPARK-13951:
---

I start work on it now.

> PySpark ml.pipeline support export/import - nested Piplines
> ---
>
> Key: SPARK-13951
> URL: https://issues.apache.org/jira/browse/SPARK-13951
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196765#comment-15196765
 ] 

Xusen Yin commented on SPARK-13641:
---

[~muralidh] I gonna close this JIRA since I find that it is intended to do so 
by One-hot encoder to index discrete feature. 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196765#comment-15196765
 ] 

Xusen Yin edited comment on SPARK-13641 at 3/16/16 5:00 AM:


I gonna close this JIRA since I find that it is intended to do so by One-hot 
encoder to index discrete feature. 


was (Author: yinxusen):
[~muralidh] I gonna close this JIRA since I find that it is intended to do so 
by One-hot encoder to index discrete feature. 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator

2016-03-14 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193991#comment-15193991
 ] 

Xusen Yin commented on SPARK-11136:
---

I agree. Will add it in the new commit. Thanks!

> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13868) Random forest accuracy exploration

2016-03-14 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193899#comment-15193899
 ] 

Xusen Yin commented on SPARK-13868:
---

I'd love to explore this.

> Random forest accuracy exploration
> --
>
> Key: SPARK-13868
> URL: https://issues.apache.org/jira/browse/SPARK-13868
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is a JIRA for exploring accuracy improvements for Random Forests.
> h2. Background
> Initial exploration was based on reports of poor accuracy from 
> [http://datascience.la/benchmarking-random-forest-implementations/]
> Essentially, Spark 1.2 showed poor performance relative to other libraries 
> for training set sizes of 1M and 10M.
> h3.  Initial improvements
> The biggest issue was that the metric being used was AUC and Spark 1.2 was 
> using hard predictions, not class probabilities.  This was fixed in 
> [SPARK-9528], and that brought Spark up to performance parity with 
> scikit-learn, Vowpal Wabbit, and R for the training set size of 1M.
> h3.  Remaining issues
> For training set size 10M, Spark does not yet match the AUC of the other 2 
> libraries benchmarked (H2O and xgboost).
> Note that, on 1M instances, these 2 libraries also show better results than 
> scikit-learn, VW, and R.  I'm not too familiar with the H2O implementation 
> and how it differs, but xgboost is a very different algorithm, so it's not 
> surprising it has different behavior.
> h2. My explorations
> I've run Spark on the test set of 10M instances.  (Note that the benchmark 
> linked above used somewhat different settings for the different algorithms, 
> but those settings are actually not that important for this problem.  This 
> included gini vs. entropy impurity and limits on splitting nodes.)
> I've tried adjusting:
> * maxDepth: Past depth 20, going deeper does not seem to matter
> * maxBins: I've gone up to 500, but this too does not seem to matter.  
> However, this is a hard thing to verify since slight differences in 
> discretization could become significant in a large tree.
> h2. Current questions
> * H2O: It would be good to understand how this implementation differs from 
> standard RF implementations (in R, VW, scikit-learn, and Spark).
> * xgboost: There's a JIRA for it: [SPARK-8547].  It would be great to see the 
> Spark package linked from that JIRA tested vs. MLlib on the benchmark data 
> (or other data).  From what I've heard/read, xgboost is sometimes better, 
> sometimes worse in accuracy (but of course faster with more localized 
> training).
> * Based on the above explorations, are there changes we should make to Spark 
> RFs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator

2016-03-14 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193886#comment-15193886
 ] 

Xusen Yin commented on SPARK-11136:
---

This is a good point. Actually in our settings now, the new KMeans only uses 
the model itself (i.e. the array of cluster centers) without its parameters. 
E.g.

{code}
if (isSet(initialModel)) {
  require($(initialModel).parentModel.clusterCenters.length == $(k), 
"mismatched cluster count")
  require(rdd.first().size == $(initialModel).clusterCenters.head.size, 
"mismatched dimension")
  algo.setInitialModel($(initialModel).parentModel)
}
{code}

But I think you're right. We should also extend the parameters in some 
scenarios. IMHO, the parameter overriding order should be (initialModel 
parameter < default parameter < user-set parameter). What do you think about it?

> Warm-start support for ML estimator
> ---
>
> Key: SPARK-11136
> URL: https://issues.apache.org/jira/browse/SPARK-11136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>Priority: Minor
>
> The current implementation of Estimator does not support warm-start fitting, 
> i.e. estimator.fit(data, params, partialModel). But first we need to add 
> warm-start for all ML estimators. This is an umbrella JIRA to add support for 
> the warm-start estimator. 
> Treat model as a special parameter, passing it through ParamMap. e.g. val 
> partialModel: Param[Option[M]] = new Param(...). In the case of model 
> existing, we use it to warm-start, else we start the training process from 
> the beginning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13461) Duplicated example code merge and cleanup

2016-03-07 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-13461:
--
Description: 
Merge duplicated code after we finishing the example code substitution.

Duplications include:

* JavaTrainValidationSplitExample 

* TrainValidationSplitExample

* Random data generation in mllib-statistics.md need to remove "-" in each line.

* Others can be added here ...

  was:
Merge duplicated code after we finishing the example code substitution.

Duplications include:

* JavaTrainValidationSplitExample 

* TrainValidationSplitExample

* Others can be added here ...


> Duplicated example code merge and cleanup
> -
>
> Key: SPARK-13461
> URL: https://issues.apache.org/jira/browse/SPARK-13461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Merge duplicated code after we finishing the example code substitution.
> Duplications include:
> * JavaTrainValidationSplitExample 
> * TrainValidationSplitExample
> * Random data generation in mllib-statistics.md need to remove "-" in each 
> line.
> * Others can be added here ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-05 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181872#comment-15181872
 ] 

Xusen Yin commented on SPARK-13641:
---

You can checkout code from https://github.com/apache/spark/pull/11486.

Run ./bin/sparkR with this [test 
example|https://github.com/yinxusen/spark/blob/SPARK-13449/R/pkg/inst/tests/testthat/test_mllib.R#L145].
 With summary(model) you can see the column names are not the original.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >