[jira] [Created] (SPARK-17785) Find a more robust way to detect the existing of the initialModel
Xusen Yin created SPARK-17785: - Summary: Find a more robust way to detect the existing of the initialModel Key: SPARK-17785 URL: https://issues.apache.org/jira/browse/SPARK-17785 Project: Spark Issue Type: Improvement Components: ML Reporter: Xusen Yin Priority: Minor Currently, we use initialModelFlag to check whether an estimator has an initial model. Figure out a more robust way to detect the existing of the initialModel. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17784) Add fromCenters method for KMeans
Xusen Yin created SPARK-17784: - Summary: Add fromCenters method for KMeans Key: SPARK-17784 URL: https://issues.apache.org/jira/browse/SPARK-17784 Project: Spark Issue Type: Improvement Components: ML Reporter: Xusen Yin Priority: Minor Add a new factory method fromCenters(centers: Array[Vector]) for KMeans. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434030#comment-15434030 ] Xusen Yin commented on SPARK-16581: --- Sure, no problem. > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428557#comment-15428557 ] Xusen Yin commented on SPARK-14381: --- I believe we can resolve this. > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425792#comment-15425792 ] Xusen Yin commented on SPARK-16581: --- I'll find related JIRAs and link them if possible. > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16581) Making JVM backend calling functions public
[ https://issues.apache.org/jira/browse/SPARK-16581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15425775#comment-15425775 ] Xusen Yin commented on SPARK-16581: --- [~shivaram] [~sunrui] Still work on it? I can help work on this if it's available. > Making JVM backend calling functions public > --- > > Key: SPARK-16581 > URL: https://issues.apache.org/jira/browse/SPARK-16581 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > As described in the design doc in SPARK-15799, to help packages that need to > call into the JVM, it will be good to expose some of the R -> JVM functions > we have. > As a part of this we could also rename, reformat the functions to make them > more user friendly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405069#comment-15405069 ] Xusen Yin commented on SPARK-16857: --- I agree the cluster assignments could be arbitrary. Yes under this condition we shouldn't use MulticlassClassificationEvaluator to evaluate the result. > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16857) CrossValidator and KMeans throws IllegalArgumentException
[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405050#comment-15405050 ] Xusen Yin commented on SPARK-16857: --- Using CrossValidator with KMeans should be supported. As a kind of external evaluation for KMeans, I think using MulticlassClassificationEvaluator with KMeans should also be supported. Why not send a PR since it would be a quick fix. CC [~yanboliang] > CrossValidator and KMeans throws IllegalArgumentException > - > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 >Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3728) RandomForest: Learn models too large to store in memory
[ https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381583#comment-15381583 ] Xusen Yin edited comment on SPARK-3728 at 7/17/16 11:46 PM: Not now. Because I thought the BFS style could reach the best parallelism, while the DFS may hurt the parallel ability. And IMHO the BFS style training is not the root cause of out-of-memory during the training phase of RandomForest. Do you have any suggestions on this? was (Author: yinxusen): Not now. Because I thought the BFS style could reach the best parallelism, while the DFS may harm the parallel ability. And IMHO the BFS style training is not the root cause of out-of-memory during the training phase of RandomForest. Do you have any suggestions on this? > RandomForest: Learn models too large to store in memory > --- > > Key: SPARK-3728 > URL: https://issues.apache.org/jira/browse/SPARK-3728 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Proposal: Write trees to disk as they are learned. > RandomForest currently uses a FIFO queue, which means training all trees at > once via breadth-first search. Using a FILO queue would encourage the code > to finish one tree before moving on to new ones. This would allow the code > to write trees to disk as they are learned. > Note: It would also be possible to write nodes to disk as they are learned > using a FIFO queue, once the example--node mapping is cached [JIRA]. The > [Sequoia Forest package]() does this. However, it could be useful to learn > trees progressively, so that future functionality such as early stopping > (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory
[ https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15381583#comment-15381583 ] Xusen Yin commented on SPARK-3728: -- Not now. Because I thought the BFS style could reach the best parallelism, while the DFS may harm the parallel ability. And IMHO the BFS style training is not the root cause of out-of-memory during the training phase of RandomForest. Do you have any suggestions on this? > RandomForest: Learn models too large to store in memory > --- > > Key: SPARK-3728 > URL: https://issues.apache.org/jira/browse/SPARK-3728 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Proposal: Write trees to disk as they are learned. > RandomForest currently uses a FIFO queue, which means training all trees at > once via breadth-first search. Using a FILO queue would encourage the code > to finish one tree before moving on to new ones. This would allow the code > to write trees to disk as they are learned. > Note: It would also be possible to write nodes to disk as they are learned > using a FIFO queue, once the example--node mapping is cached [JIRA]. The > [Sequoia Forest package]() does this. However, it could be useful to learn > trees progressively, so that future functionality such as early stopping > (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16558) examples/mllib/LDAExample should use MLVector instead of MLlib Vector
Xusen Yin created SPARK-16558: - Summary: examples/mllib/LDAExample should use MLVector instead of MLlib Vector Key: SPARK-16558 URL: https://issues.apache.org/jira/browse/SPARK-16558 Project: Spark Issue Type: Bug Components: Examples, MLlib Reporter: Xusen Yin Priority: Minor mllib.LDAExample uses ML pipeline and MLlib LDA algorithm. The former transforms original data into MLVector format, while the latter uses MLlibVector format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16447) LDA wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368149#comment-15368149 ] Xusen Yin commented on SPARK-16447: --- [~mengxr] I'd like to work on this. > LDA wrapper in SparkR > - > > Key: SPARK-16447 > URL: https://issues.apache.org/jira/browse/SPARK-16447 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng > > Follow instructions in SPARK-16442 and implement LDA wrapper in SparkR. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16372) Retag RDD to tallSkinnyQR of RowMatrix
[ https://issues.apache.org/jira/browse/SPARK-16372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-16372: -- Summary: Retag RDD to tallSkinnyQR of RowMatrix (was: RowMatrix constructor should use retag for Java compatibility) > Retag RDD to tallSkinnyQR of RowMatrix > -- > > Key: SPARK-16372 > URL: https://issues.apache.org/jira/browse/SPARK-16372 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Xusen Yin >Priority: Minor > > The following Java code because of type erasing: > {code} > JavaRDD rows = jsc.parallelize(...); > RowMatrix mat = new RowMatrix(rows.rdd()); > QRDecompositionresult = mat.tallSkinnyQR(true); > {code} > We should use retag to restore the type to prevent the following exception: > {code} > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16372) RowMatrix constructor should use retag for Java compatibility
Xusen Yin created SPARK-16372: - Summary: RowMatrix constructor should use retag for Java compatibility Key: SPARK-16372 URL: https://issues.apache.org/jira/browse/SPARK-16372 Project: Spark Issue Type: Bug Components: MLlib Reporter: Xusen Yin Priority: Minor The following Java code because of type erasing: {code} JavaRDD rows = jsc.parallelize(...); RowMatrix mat = new RowMatrix(rows.rdd()); QRDecompositionresult = mat.tallSkinnyQR(true); {code} We should use retag to restore the type to prevent the following exception: {code} java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16372) RowMatrix constructor should use retag for Java compatibility
[ https://issues.apache.org/jira/browse/SPARK-16372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361822#comment-15361822 ] Xusen Yin commented on SPARK-16372: --- SPARK-11497 fixed this for PySpark. > RowMatrix constructor should use retag for Java compatibility > - > > Key: SPARK-16372 > URL: https://issues.apache.org/jira/browse/SPARK-16372 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Xusen Yin >Priority: Minor > > The following Java code because of type erasing: > {code} > JavaRDD rows = jsc.parallelize(...); > RowMatrix mat = new RowMatrix(rows.rdd()); > QRDecompositionresult = mat.tallSkinnyQR(true); > {code} > We should use retag to restore the type to prevent the following exception: > {code} > java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to > [Lorg.apache.spark.mllib.linalg.Vector; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16369) tallSkinnyQR of RowMatrix should aware of empty partition
Xusen Yin created SPARK-16369: - Summary: tallSkinnyQR of RowMatrix should aware of empty partition Key: SPARK-16369 URL: https://issues.apache.org/jira/browse/SPARK-16369 Project: Spark Issue Type: Bug Components: MLlib Reporter: Xusen Yin Priority: Minor tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition. See the [archived dev mail|https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3ccaf7adnrycvpl3qx-vzjhq4oymiuuhoscut_tkom63cm18ik...@mail.gmail.com%3E] for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351942#comment-15351942 ] Xusen Yin commented on SPARK-16144: --- I'd like to work on this. > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala
[ https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332475#comment-15332475 ] Xusen Yin commented on SPARK-15574: --- I just finished the prototype of PythonTransformer in Scala as the transformer wrapper of pure Python transformers. It works well if I run it alone from Scala side. But if I chained the PythonTransformer with other transformers/estimators in Pipeline, it fails for lacking of transformSchema in Python side. AFAIK, we need to add transformSchema in Python ML for pure Python PipelineStages. [~josephkb] [~mengxr] > Python meta-algorithms in Scala > --- > > Key: SPARK-15574 > URL: https://issues.apache.org/jira/browse/SPARK-15574 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This is an experimental idea for implementing Python ML meta-algorithms > (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala. > This would require a Scala wrapper for algorithms implemented in Python, > somewhat analogous to Python UDFs. > The benefit of this change would be that we could avoid currently awkward > conversions between Scala/Python meta-algorithms required for persistence. > It would let us have full support for Python persistence and would generally > simplify the implementation within MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11106) Should ML Models contains single models or Pipelines?
[ https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15319685#comment-15319685 ] Xusen Yin commented on SPARK-11106: --- RFormula is easy to use, but it may not always do right things. For example, RFormula indexes categorical features with OneHotEncoder, but in some scenario (like RandomForest), a StringIndexer is better. > Should ML Models contains single models or Pipelines? > - > > Key: SPARK-11106 > URL: https://issues.apache.org/jira/browse/SPARK-11106 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > This JIRA is for discussing whether an ML Estimators should do feature > processing. > h2. Issue > Currently, almost all ML Estimators require strict input types. E.g., > DecisionTreeClassifier requires that the label column be Double type and have > metadata indicating the number of classes. > This requires users to know how to preprocess data. > h2. Ideal workflow > A user should be able to pass any reasonable data to a Transformer or > Estimator and have it "do the right thing." > E.g.: > * If DecisionTreeClassifier is given a String column for labels, it should > know to index the Strings. > * See [SPARK-10513] for a similar issue with OneHotEncoder. > h2. Possible solutions > There are a few solutions I have thought of. Please comment with feedback or > alternative ideas! > h3. Leave as is > Pro: The current setup is good in that it forces the user to be very aware of > what they are doing. Feature transformations will not happen silently. > Con: The user has to write boilerplate code for transformations. The API is > not what some users would expect; e.g., coming from R, a user might expect > some automatic transformations. > h3. All Transformers can contain PipelineModels > We could allow all Transformers and Models to contain arbitrary > PipelineModels. E.g., if a DecisionTreeClassifier were given a String label > column, it might return a Model which contains a simple fitted PipelineModel > containing StringIndexer + DecisionTreeClassificationModel. > The API could present this to the user, or it could be hidden from the user. > Ideally, it would be hidden from the beginner user, but accessible for > experts. > The main problem is that we might have to break APIs. E.g., OneHotEncoder > may need to do indexing if given a String input column. This means it should > no longer be a Transformer; it should be an Estimator. > h3. All Estimators should use RFormula > The best option I have thought of is to make RFormula be the primary method > for automatic feature transformation. We could start adding an RFormula > Param to all Estimators, and it could handle most of these feature > transformation issues. > We could maintain old APIs: > * If a user sets the input column names, then those can be used in the > traditional (no automatic transformation) way. > * If a user sets the RFormula Param, then it can be used instead. (This > should probably take precedence over the old API.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15574) Python meta-algorithms in Scala
[ https://issues.apache.org/jira/browse/SPARK-15574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317503#comment-15317503 ] Xusen Yin commented on SPARK-15574: --- [~josephkb] Can I work on this one? > Python meta-algorithms in Scala > --- > > Key: SPARK-15574 > URL: https://issues.apache.org/jira/browse/SPARK-15574 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This is an experimental idea for implementing Python ML meta-algorithms > (CrossValidator, TrainValidationSplit, Pipeline, OneVsRest, etc.) in Scala. > This would require a Scala wrapper for algorithms implemented in Python, > somewhat analogous to Python UDFs. > The benefit of this change would be that we could avoid currently awkward > conversions between Scala/Python meta-algorithms required for persistence. > It would let us have full support for Python persistence and would generally > simplify the implementation within MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15317459#comment-15317459 ] Xusen Yin commented on SPARK-14381: --- Comparing mllib.feature with ml.feature, there are only two APIs missing for ml. 1. HashingTF should have setAlgorithm. However, it is intended to do so according to JIRA: https://issues.apache.org/jira/browse/SPARK-14899 2. Word2vec should have maxSentenceLength. I created a new JIRA: https://issues.apache.org/jira/browse/SPARK-15793 > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15793) Word2vec in ML package should have maxSentenceLength method
Xusen Yin created SPARK-15793: - Summary: Word2vec in ML package should have maxSentenceLength method Key: SPARK-15793 URL: https://issues.apache.org/jira/browse/SPARK-15793 Project: Spark Issue Type: Improvement Components: ML Reporter: Xusen Yin Priority: Minor Word2vec in ML package should have maxSentenceLength method for feature parity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15315060#comment-15315060 ] Xusen Yin commented on SPARK-14381: --- I can work on this one. > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory
[ https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15314795#comment-15314795 ] Xusen Yin commented on SPARK-3728: -- Hi [~josephkb], as I [surveyed on H2O|https://issues.apache.org/jira/browse/SPARK-13868?focusedCommentId=15313400=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15313400], it trains model in a tree-by-tree style. Can I work on this one? > RandomForest: Learn models too large to store in memory > --- > > Key: SPARK-3728 > URL: https://issues.apache.org/jira/browse/SPARK-3728 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley > > Proposal: Write trees to disk as they are learned. > RandomForest currently uses a FIFO queue, which means training all trees at > once via breadth-first search. Using a FILO queue would encourage the code > to finish one tree before moving on to new ones. This would allow the code > to write trees to disk as they are learned. > Note: It would also be possible to write nodes to disk as they are learned > using a FIFO queue, once the example--node mapping is cached [JIRA]. The > [Sequoia Forest package]() does this. However, it could be useful to learn > trees progressively, so that future functionality such as early stopping > (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13868) Random forest accuracy exploration
[ https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313400#comment-15313400 ] Xusen Yin edited comment on SPARK-13868 at 6/3/16 12:40 AM: [~josephkb] [~tanwanirahul] Here is what I found: 1. Dataset preprocessing In this dataset, all columns except DepTime and Distance are categorical features. The easiest way to transform the data into LabeledPoint style is RFormula. However, RFormula is not suitable here because it produces different shapes of the dataset in comparison with the original one. RFormula uses One-hot encoder, so it expands the original dataset into thousands of columns. It brings two drawbacks: a. The volume of the dataset is expanded, which may hurt the performance. b. One-hot encoder splits one column into cardinality size of new columns, while Random Forest cannot take groups of features into consideration so that it may hurt the accuracy. The RFormula also recognizes DepTime and Distance as categorical features, so it brings more unnecessary new columns and reduces the accuracy a step further because DepTime and Distance are the two most important features for this task. On the contrary, H2O uses the original dataset, without further preprocessing. 2. Spark RandomForest can also get a good result In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed result, see https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing Note that those "NA"s mean Spark got OOM on my laptop. 3. OOM of Spark Random Forest In the single machine environment, Spark RF is slower than H2O. What's worse, OOM frequently occurs on Spark with larger bins, larger trees, and larger maxDepth. The reason is Spark creates new Double array quite often inside each partition. Say in one partition of our dataset, Spark creates numNodes Double array with length numFeatures * numBins * statsSize. If we use a single machine with 16 partitions, we may generate a new Double array with O(numPartitions * numNodes * numFeatures * numBins * statsSize) Double in total. And I can see from my experiment that the parameter maxMemoryInMB barely useful. It will be better if we use multi-server and spread out those tasks. Spark trains random forest in a BFS mode, i.e. the 1st layer of all trees, then the 2nd layer of all trees, while H2O does tree-by-tree, and inside each tree, it trains layer-by-layer. H2O also uses smaller arrays to collect histograms than Spark. It uses Java Fork/Join to split tasks, and inside each task, it generates Double arrays with size numNodes * numFeatures * numBins, then merges them inside a shared DHistogram in each process. (I am not quite sure about the process since DRF code in H2O is more complicated than Spark, and without detailed comments.) Besides, H2O also has a MemoryManager to allocate arrays and gets around OOM as long as possible. However, H2O also crashes with OOM once a time on my laptop when I was training 500 trees with 20 maxDepth on 10m dataset. was (Author: yinxusen): [~josephkb] [~tanwanirahul] Here is what I found: 1. Dataset preprocessing In this dataset, all columns except DepTime and Distance are categorical features. The easiest way to transform the data into LabeledPoint style is RFormula. However, RFormula is not suitable here because it produces different shapes of the dataset in comparison with the original one. RFormula uses One-hot encoder, so it expands the original dataset into thousands of columns. It brings two drawbacks: a. The volume of the dataset is expanded, which may hurt the performance. b. One-hot encoder splits one column into cardinality size of new columns, while Random Forest cannot take groups of features into consideration so that it may hurt the accuracy. The RFormula also recognizes DepTime and Distance as categorical features, so it brings more unnecessary new columns and reduces the accuracy a step further because DepTime and Distance are the two most important features for this task. On the contrary, H2O uses the original dataset, without further preprocessing. 2. Spark RandomForest can also get a good result In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed result, see https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing Note that those "NA"s mean Spark got OOM on my laptop. 3. OOM of Spark Random Forest In the single machine environment, Spark RF is slower than H2O. What's worse, OOM frequently occurs on Spark with larger bins, larger trees, and larger maxDepth. The reason is Spark creates new Double array quite often inside each partition. Say in one partition of our
[jira] [Commented] (SPARK-13868) Random forest accuracy exploration
[ https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15313400#comment-15313400 ] Xusen Yin commented on SPARK-13868: --- [~josephkb] [~tanwanirahul] Here is what I found: 1. Dataset preprocessing In this dataset, all columns except DepTime and Distance are categorical features. The easiest way to transform the data into LabeledPoint style is RFormula. However, RFormula is not suitable here because it produces different shapes of the dataset in comparison with the original one. RFormula uses One-hot encoder, so it expands the original dataset into thousands of columns. It brings two drawbacks: a. The volume of the dataset is expanded, which may hurt the performance. b. One-hot encoder splits one column into cardinality size of new columns, while Random Forest cannot take groups of features into consideration so that it may hurt the accuracy. The RFormula also recognizes DepTime and Distance as categorical features, so it brings more unnecessary new columns and reduces the accuracy a step further because DepTime and Distance are the two most important features for this task. On the contrary, H2O uses the original dataset, without further preprocessing. 2. Spark RandomForest can also get a good result In my experiment, Spark RF with 10 trees, 20 maxDepth, and 1m training data gets AUC 0.744321364. In the same setting, H2O gets AUC 0.695598. For detailed result, see https://docs.google.com/document/d/1l7SGFtUkZeM4WEXFlpc08pfBfnu6d25KQFToFHC6CTo/edit?usp=sharing Note that those "NA"s mean Spark got OOM on my laptop. 3. OOM of Spark Random Forest In the single machine environment, Spark RF is slower than H2O. What's worse, OOM frequently occurs on Spark with larger bins, larger trees, and larger maxDepth. The reason is Spark creates new Double array quite often inside each partition. Say in one partition of our dataset, Spark creates numNodes Double array with length numFeatures * numBins * statsSize. If we use a single machine with 16 partitions, we may generate a new Double array with O(numPartitions * numNodes * numFeatures * numBins * statsSize) Double in total. And I can see from my experiment that the parameter maxMemoryInMB barely useful. It will be better if we use multi-server and spread out those tasks. Spark trains random forest in a BFS mode, i.e. the 1st layer of all trees, then the 2nd layer of all trees, while H2O does tree-by-tree, and inside each tree, it trains layer-by-layer. H2O also uses smaller arrays to collect histograms than Spark. It uses Java Fork/Join to split tasks, and inside each task, it generates Double arrays with size numNodes * numFeatures * numBins, then merges them inside a shared DHistogram in each process. (I am not quite sure about the process since DRF code in H2O is more complicated than Spark, and without comments.) Besides, H2O also has a MemoryManager to allocate arrays and gets around OOM as long as possible. However, H2O also crashes with OOM once a time on my laptop when I was training 500 trees with 20 maxDepth on 10m dataset. > Random forest accuracy exploration > -- > > Key: SPARK-13868 > URL: https://issues.apache.org/jira/browse/SPARK-13868 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This is a JIRA for exploring accuracy improvements for Random Forests. > h2. Background > Initial exploration was based on reports of poor accuracy from > [http://datascience.la/benchmarking-random-forest-implementations/] > Essentially, Spark 1.2 showed poor performance relative to other libraries > for training set sizes of 1M and 10M. > h3. Initial improvements > The biggest issue was that the metric being used was AUC and Spark 1.2 was > using hard predictions, not class probabilities. This was fixed in > [SPARK-9528], and that brought Spark up to performance parity with > scikit-learn, Vowpal Wabbit, and R for the training set size of 1M. > h3. Remaining issues > For training set size 10M, Spark does not yet match the AUC of the other 2 > libraries benchmarked (H2O and xgboost). > Note that, on 1M instances, these 2 libraries also show better results than > scikit-learn, VW, and R. I'm not too familiar with the H2O implementation > and how it differs, but xgboost is a very different algorithm, so it's not > surprising it has different behavior. > h2. My explorations > I've run Spark on the test set of 10M instances. (Note that the benchmark > linked above used somewhat different settings for the different algorithms, > but those settings are actually not that important for this problem. This > included gini vs. entropy impurity and limits on splitting nodes.) > I've tried adjusting: > * maxDepth: Past depth 20, going deeper does not seem to matter > *
[jira] [Updated] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading
[ https://issues.apache.org/jira/browse/SPARK-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14973: -- Description: The CrossValidator and TrainValidationSplit miss the seed when saving and loading. Need to fix both Spark side code and test suite. (was: The CrossValidator and TrainValidationSplit miss the seed when saving and loading. Need to fix both Spark side code and test suite, plus PySpark side code and test suite.) > The CrossValidator and TrainValidationSplit miss the seed when saving and > loading > - > > Key: SPARK-14973 > URL: https://issues.apache.org/jira/browse/SPARK-14973 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Xusen Yin > > The CrossValidator and TrainValidationSplit miss the seed when saving and > loading. Need to fix both Spark side code and test suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin resolved SPARK-14302. --- Resolution: Won't Fix > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266093#comment-15266093 ] Xusen Yin commented on SPARK-14302: --- I'll close it, anything else I'll let you know. Thanks! > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15266006#comment-15266006 ] Xusen Yin commented on SPARK-14302: --- [~kanjilal] Thanks for working on this. However, I check the duplicated examples again and find out that we should not delete all of them. As I depicted below: * python/ml ** None * Unsure duplications, double check ** dataframe_example.py --> serves for an example of dataframe usage. ** kmeans_example.py --> serves as an application ** simple_params_example.py --> serves for an example of params usage. ** simple_text_classification_pipeline.py --> serves as an application. * python/mllib ** gaussian_mixture_model.py --> serves as an application. ** kmeans.py --> ditto ** logistic_regression.py --> ditto * Unsure duplications, double check ** correlations.py --> ditto ** random_rdd_generation.py --> ditto ** sampled_rdds.py --> ditto ** word2vec.py --> ditto So I think we can close this JIRA as won't fix. What do you think about it? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262828#comment-15262828 ] Xusen Yin commented on SPARK-14302: --- Thanks! And sorry for the late response, I forgot it. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262810#comment-15262810 ] Xusen Yin commented on SPARK-14302: --- We should leave them unmerged e.g. ml.bisecting_k_means_example and mllib.bisecting_k_means_example. Even though they are similar, but each serves different purposes, i.e. is used for different document files. This JIRA aims to merge duplicated codes inside examples/python/ml, examples/python/mllib, but not between them two. For example, we have python/mllib/gaussian_mixture_model.py, which is duplicated with python/mllib/gaussian_mixture_example.py. The latter has $example on$ and $example off$ blocks in it which means it serves as a part of document files. So we should delete the former one and keep the latter. Also, according to here https://github.com/apache/spark/pull/12092#issuecomment-204276885, we should leave the example code with command-line parameters untouched, so we should keep the python/mllib/gaussian_mixture_model.py. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262794#comment-15262794 ] Xusen Yin commented on SPARK-14302: --- Hi Saikat, any updates? > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading
[ https://issues.apache.org/jira/browse/SPARK-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261608#comment-15261608 ] Xusen Yin commented on SPARK-14973: --- Will fix it with SPARK-14706 > The CrossValidator and TrainValidationSplit miss the seed when saving and > loading > - > > Key: SPARK-14973 > URL: https://issues.apache.org/jira/browse/SPARK-14973 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Xusen Yin > > The CrossValidator and TrainValidationSplit miss the seed when saving and > loading. Need to fix both Spark side code and test suite, plus PySpark side > code and test suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14973) The CrossValidator and TrainValidationSplit miss the seed when saving and loading
Xusen Yin created SPARK-14973: - Summary: The CrossValidator and TrainValidationSplit miss the seed when saving and loading Key: SPARK-14973 URL: https://issues.apache.org/jira/browse/SPARK-14973 Project: Spark Issue Type: Bug Components: ML, PySpark Reporter: Xusen Yin The CrossValidator and TrainValidationSplit miss the seed when saving and loading. Need to fix both Spark side code and test suite, plus PySpark side code and test suite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark
Xusen Yin created SPARK-14931: - Summary: Mismatched default values between pipelines in Spark and PySpark Key: SPARK-14931 URL: https://issues.apache.org/jira/browse/SPARK-14931 Project: Spark Issue Type: Bug Reporter: Xusen Yin Mismatched default values between pipelines in Spark and PySpark lead to different pipelines in PySpark after saving and loading. Find generic ways to check JavaParams then fix them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14924) OneVsRest with classifier in estimatorParamMaps of tuning fail to persistence
Xusen Yin created SPARK-14924: - Summary: OneVsRest with classifier in estimatorParamMaps of tuning fail to persistence Key: SPARK-14924 URL: https://issues.apache.org/jira/browse/SPARK-14924 Project: Spark Issue Type: Bug Components: ML, PySpark Reporter: Xusen Yin {code} ovr = OneVsRest() epms = [{ovr.classifier: }, {ovr.classifier: xxx}] cv = CrossValidator(estimator=ovr, estimatorParamMaps=epms, ...) cv.load() {code} fails because classifier cannot be serialized via JSON. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11337) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256751#comment-15256751 ] Xusen Yin commented on SPARK-11337: --- [~mengxr] We can close this now. > Make example code in user guide testable > > > Key: SPARK-11337 > URL: https://issues.apache.org/jira/browse/SPARK-11337 > Project: Spark > Issue Type: Umbrella > Components: Documentation >Reporter: Xiangrui Meng >Assignee: Xusen Yin >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "example" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Sub-tasks are created to move example code from user guide to `examples/`. > *self-check list for contributors in this JIRA* > * Be sure to match Scala/Java/Python code style guide. If unsure of a code > style, please refer to other merged example code under examples/. > * Remove useless imports > * It's better to have a side-effect operation at the end of each example > code, usually it's a {code}print(...){code} > * Make sure the code example is runnable without error. > * After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll > serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the > generated html looks good. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11399) Include_example should support labels to cut out different parts in one example code
[ https://issues.apache.org/jira/browse/SPARK-11399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin closed SPARK-11399. - Resolution: Won't Fix > Include_example should support labels to cut out different parts in one > example code > > > Key: SPARK-11399 > URL: https://issues.apache.org/jira/browse/SPARK-11399 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin > > There are many small examples that do not need to create a single example > file. Take the MLlib datatype page – mllib-data-types.md – as an example, > code examples like creating vectors and matrices are trivial works. We can > merge them into one single vector/matrix creation example. Then we use labels > to distinguish each other, such as {% include_example .scala > vector_creation %}. > The "label way" is also useful in the dialog-style code example: > http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14706) Python ML persistence integration test
[ https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252435#comment-15252435 ] Xusen Yin commented on SPARK-14706: --- Sure. I'll take care of it. There are more issues with CrossValidator, TrainValidationSplit and OneVsRest such as no implementation of _transfer_param_map_from/to_java(), which lead that they cannot be wrapped in another meta-estimator. I'm fixing them together. > Python ML persistence integration test > -- > > Key: SPARK-14706 > URL: https://issues.apache.org/jira/browse/SPARK-14706 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Reporter: Joseph K. Bradley > > Goal: extend integration test in {{ml/tests.py}}. > In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}. > This issue includes: > * Extending {{_compare_pipelines}} to handle CrossValidator, > TrainValidationSplit, and OneVsRest > * Adding an integration test in PersistenceTest which includes nested > meta-algorithms. E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ > OneVsRest[ LogisticRegression ] ] ] ]}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14706) Python ML persistence integration test
[ https://issues.apache.org/jira/browse/SPARK-14706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246950#comment-15246950 ] Xusen Yin commented on SPARK-14706: --- I am starting write it. > Python ML persistence integration test > -- > > Key: SPARK-14706 > URL: https://issues.apache.org/jira/browse/SPARK-14706 > Project: Spark > Issue Type: Test > Components: ML, PySpark >Reporter: Joseph K. Bradley > > Goal: extend integration test in {{ml/tests.py}}. > In the {{PersistenceTest}} suite, there is a method {{_compare_pipelines}}. > This issue includes: > * Extending {{_compare_pipelines}} to handle CrossValidator, > TrainValidationSplit, and OneVsRest > * Adding an integration test in PersistenceTest which includes nested > meta-algorithms. E.g.: {{Pipeline[ CrossValidator[ TrainValidationSplit[ > OneVsRest[ LogisticRegression ] ] ] ]}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer
[ https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14440: -- Description: Since the PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader are just extended from JavaMLWriter and JavaMLReader without other modifications of attributes and methods, there is no need to keep them, just like what we did in the save/load of ml/tuning.py. was: Since the PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader are just extends from JavaMLWriter and JavaMLReader without other modifications of attributes and methods, there is no need to keep them, just like what we did in the save/load of ml/tuning.py. > Remove PySpark ml.pipeline's specific Reader and Writer > --- > > Key: SPARK-14440 > URL: https://issues.apache.org/jira/browse/SPARK-14440 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Xusen Yin >Priority: Trivial > > Since the > PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader > are just extended from JavaMLWriter and JavaMLReader without other > modifications of attributes and methods, there is no need to keep them, just > like what we did in the save/load of ml/tuning.py. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer
[ https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14440: -- Description: Since the PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader are just extends from JavaMLWriter and JavaMLReader without other modifications of attributes and methods, there is no need to keep them, just like what we did in the save/load of ml/tuning.py. was: Since the PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader are just extends from JavaMLWriter and JavaMLReader without other modifications of attributes and methods, there is no need to keep them, just like what we did in Remove * PipelineMLWriter * PipelineMLReader * PipelineModelMLWriter * PipelineModelMLReader and modify comments. > Remove PySpark ml.pipeline's specific Reader and Writer > --- > > Key: SPARK-14440 > URL: https://issues.apache.org/jira/browse/SPARK-14440 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Xusen Yin >Priority: Trivial > > Since the > PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader > are just extends from JavaMLWriter and JavaMLReader without other > modifications of attributes and methods, there is no need to keep them, just > like what we did in the save/load of ml/tuning.py. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer
[ https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14440: -- Description: Since the PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader are just extends from JavaMLWriter and JavaMLReader without other modifications of attributes and methods, there is no need to keep them, just like what we did in Remove * PipelineMLWriter * PipelineMLReader * PipelineModelMLWriter * PipelineModelMLReader and modify comments. was: Remove * PipelineMLWriter * PipelineMLReader * PipelineModelMLWriter * PipelineModelMLReader and modify comments. > Remove PySpark ml.pipeline's specific Reader and Writer > --- > > Key: SPARK-14440 > URL: https://issues.apache.org/jira/browse/SPARK-14440 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Xusen Yin >Priority: Trivial > > Since the > PipelineMLWriter/PipelineMLReader/PipelineModelMLWriter/PipelineModelMLReader > are just extends from JavaMLWriter and JavaMLReader without other > modifications of attributes and methods, there is no need to keep them, just > like what we did in > Remove > * PipelineMLWriter > * PipelineMLReader > * PipelineModelMLWriter > * PipelineModelMLReader > and modify comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer
[ https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14440: -- Description: Remove * PipelineMLWriter * PipelineMLReader * PipelineModelMLWriter * PipelineModelMLReader and modify comments. was: Remove * PipelineMLWriter * PipelineMLReader * PipelineModelMLWriter * PipelineModelMLReader and modify comments. > Remove PySpark ml.pipeline's specific Reader and Writer > --- > > Key: SPARK-14440 > URL: https://issues.apache.org/jira/browse/SPARK-14440 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Xusen Yin >Priority: Trivial > > Remove > * PipelineMLWriter > * PipelineMLReader > * PipelineModelMLWriter > * PipelineModelMLReader > and modify comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer
[ https://issues.apache.org/jira/browse/SPARK-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242147#comment-15242147 ] Xusen Yin commented on SPARK-14440: --- Sorry for the late response, I'll update it soon. > Remove PySpark ml.pipeline's specific Reader and Writer > --- > > Key: SPARK-14440 > URL: https://issues.apache.org/jira/browse/SPARK-14440 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Xusen Yin >Priority: Trivial > > Remove > * PipelineMLWriter > * PipelineMLReader > * PipelineModelMLWriter > * PipelineModelMLReader > and modify comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14306) PySpark ml.classification OneVsRest support export/import
[ https://issues.apache.org/jira/browse/SPARK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239678#comment-15239678 ] Xusen Yin commented on SPARK-14306: --- Yes, but blocked by this https://github.com/apache/spark/pull/12124 > PySpark ml.classification OneVsRest support export/import > - > > Key: SPARK-14306 > URL: https://issues.apache.org/jira/browse/SPARK-14306 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14440) Remove PySpark ml.pipeline's specific Reader and Writer
Xusen Yin created SPARK-14440: - Summary: Remove PySpark ml.pipeline's specific Reader and Writer Key: SPARK-14440 URL: https://issues.apache.org/jira/browse/SPARK-14440 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Xusen Yin Priority: Trivial Remove * PipelineMLWriter * PipelineMLReader * PipelineModelMLWriter * PipelineModelMLReader and modify comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228693#comment-15228693 ] Xusen Yin commented on SPARK-14301: --- Thanks, we'll make sure that. :) > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * Unsure code duplications of java/ml, double check > ** JavaDeveloperApiExample.java > ** JavaSimpleParamsExample.java > ** JavaSimpleTextClassificationPipeline.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * Unsure code duplications of java/mllib, double check > ** JavaALS.java > ** JavaFPGrowthExample.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala ** SimpleTextClassificationPipeline.scala --> ModelSelectionViaCrossValidationExample ** DataFrameExample.scala --> merge with LogisticRegressionSummaryExample.scala * Intend to reserve with command-line support: ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala ** SimpleTextClassificationPipeline.scala --> ModelSelectionViaCrossValidationExample ** DataFrameExample.scala --> merge with LogisticRegressionSummaryExample.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > ** DeveloperApiExample.scala --> I delete it for now because it's only about > how to create your own classifieri, etc, which can be learned easily from > other examples and ml codes. > ** SimpleParamsExample.scala --> merge with > LogisticRegressionSummaryExample.scala > ** SimpleTextClassificationPipeline.scala --> > ModelSelectionViaCrossValidationExample > ** DataFrameExample.scala --> merge with > LogisticRegressionSummaryExample.scala > * Intend to reserve with command-line support: > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. > I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14306) PySpark ml.classification OneVsRest support export/import
[ https://issues.apache.org/jira/browse/SPARK-14306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220519#comment-15220519 ] Xusen Yin commented on SPARK-14306: --- start work on it now. > PySpark ml.classification OneVsRest support export/import > - > > Key: SPARK-14306 > URL: https://issues.apache.org/jira/browse/SPARK-14306 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220337#comment-15220337 ] Xusen Yin commented on SPARK-14302: --- This JIRA only focuses on Python examples. I.e. spark/examples/src/main/python/ml and spark/examples/src/main/python/mllib > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220324#comment-15220324 ] Xusen Yin commented on SPARK-14302: --- And this JIRA is to delete or merge some example codes, not to compare code in python/examples/mllib and python/examples/ml. See https://github.com/apache/spark/pull/12092 as an example. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220321#comment-15220321 ] Xusen Yin commented on SPARK-14302: --- Java code is in this JIRA: https://issues.apache.org/jira/browse/SPARK-14301 > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220304#comment-15220304 ] Xusen Yin commented on SPARK-14302: --- Sure, thanks > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220242#comment-15220242 ] Xusen Yin commented on SPARK-14301: --- Go ahead. Thanks! > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * Unsure code duplications of java/ml, double check > ** JavaDeveloperApiExample.java > ** JavaSimpleParamsExample.java > ** JavaSimpleTextClassificationPipeline.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * Unsure code duplications of java/mllib, double check > ** JavaALS.java > ** JavaFPGrowthExample.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-13462) Vector serialization error in example code of ModelSelectionViaTrainValidationSplitExample and JavaModelSelectionViaTrainValidationSplitExample
[ https://issues.apache.org/jira/browse/SPARK-13462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin closed SPARK-13462. - Resolution: Won't Fix > Vector serialization error in example code of > ModelSelectionViaTrainValidationSplitExample and > JavaModelSelectionViaTrainValidationSplitExample > --- > > Key: SPARK-13462 > URL: https://issues.apache.org/jira/browse/SPARK-13462 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Reporter: Xusen Yin >Priority: Minor > > ModelSelectionViaTrainValidationSplitExample and > JavaModelSelectionViaTrainValidationSplitExample fail to run. If finally it's > a bug of TrainValidationSplit or LinearRegression, let's move the JIRA out of > SPARK-11337. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13462) Vector serialization error in example code of ModelSelectionViaTrainValidationSplitExample and JavaModelSelectionViaTrainValidationSplitExample
[ https://issues.apache.org/jira/browse/SPARK-13462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220239#comment-15220239 ] Xusen Yin commented on SPARK-13462: --- Well, this is a false alarm. They can run with current github master. I'll close it. > Vector serialization error in example code of > ModelSelectionViaTrainValidationSplitExample and > JavaModelSelectionViaTrainValidationSplitExample > --- > > Key: SPARK-13462 > URL: https://issues.apache.org/jira/browse/SPARK-13462 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Examples >Reporter: Xusen Yin >Priority: Minor > > ModelSelectionViaTrainValidationSplitExample and > JavaModelSelectionViaTrainValidationSplitExample fail to run. If finally it's > a bug of TrainValidationSplit or LinearRegression, let's move the JIRA out of > SPARK-11337. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14300) Scala MLlib examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220182#comment-15220182 ] Xusen Yin commented on SPARK-14300: --- Thanks! Be sure to check every code example. > Scala MLlib examples code merge and clean up > > > Key: SPARK-14300 > URL: https://issues.apache.org/jira/browse/SPARK-14300 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/mllib: > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * Unsure code duplications (need doube check) > ** AbstractParams.scala > ** BinaryClassification.scala > ** Correlations.scala > ** CosineSimilarity.scala > ** DenseGaussianMixture.scala > ** FPGrowthExample.scala > ** MovieLensALS.scala > ** MultivariateSummarizer.scala > ** RandomRDDGeneration.scala > ** SampledRDDs.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala ** SimpleTextClassificationPipeline.scala --> ModelSelectionViaCrossValidationExample ** DataFrameExample.scala --> merge with LogisticRegressionSummaryExample.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala ** SimpleTextClassificationPipeline.scala --> ModelSelectionViaCrossValidationExample * Unsure code duplications (need double check) ** DataFrameExample.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > ** DeveloperApiExample.scala --> I delete it for now because it's only about > how to create your own classifieri, etc, which can be learned easily from > other examples and ml codes. > ** SimpleParamsExample.scala --> merge with > LogisticRegressionSummaryExample.scala > ** SimpleTextClassificationPipeline.scala --> > ModelSelectionViaCrossValidationExample > ** DataFrameExample.scala --> merge with > LogisticRegressionSummaryExample.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. > I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala ** SimpleTextClassificationPipeline.scala --> ModelSelectionViaCrossValidationExample * Unsure code duplications (need double check) ** DataFrameExample.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala * Unsure code duplications (need double check) ** DataFrameExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > ** DeveloperApiExample.scala --> I delete it for now because it's only about > how to create your own classifieri, etc, which can be learned easily from > other examples and ml codes. > ** SimpleParamsExample.scala --> merge with > LogisticRegressionSummaryExample.scala > ** SimpleTextClassificationPipeline.scala --> > ModelSelectionViaCrossValidationExample > * Unsure code duplications (need double check) > ** DataFrameExample.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. > I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. ** SimpleParamsExample.scala --> merge with LogisticRegressionSummaryExample.scala * Unsure code duplications (need double check) ** DataFrameExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. * Unsure code duplications (need double check) ** DataFrameExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > ** DeveloperApiExample.scala --> I delete it for now because it's only about > how to create your own classifieri, etc, which can be learned easily from > other examples and ml codes. > ** SimpleParamsExample.scala --> merge with > LogisticRegressionSummaryExample.scala > * Unsure code duplications (need double check) > ** DataFrameExample.scala > ** SimpleTextClassificationPipeline.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. > I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample ** DeveloperApiExample.scala --> I delete it for now because it's only about how to create your own classifieri, etc, which can be learned easily from other examples and ml codes. * Unsure code duplications (need double check) ** DataFrameExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample * Unsure code duplications (need double check) ** DataFrameExample.scala ** DeveloperApiExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > ** DeveloperApiExample.scala --> I delete it for now because it's only about > how to create your own classifieri, etc, which can be learned easily from > other examples and ml codes. > * Unsure code duplications (need double check) > ** DataFrameExample.scala > ** SimpleParamsExample.scala > ** SimpleTextClassificationPipeline.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. > I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample * Unsure code duplications (need double check) ** DataFrameExample.scala ** DeveloperApiExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample * Unsure code duplications (need double check) ** DataFrameExample.scala ** DeveloperApiExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > * Unsure code duplications (need double check) > ** DataFrameExample.scala > ** DeveloperApiExample.scala > ** SimpleParamsExample.scala > ** SimpleTextClassificationPipeline.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. > I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15220083#comment-15220083 ] Xusen Yin commented on SPARK-14041: --- I've split them into 4 JIRAs. > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > To find out all examples of ml/mllib that don't contain "example on": > {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} > Duplicates need to be deleted: > * scala/ml > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14302: -- Description: Duplicated code that I found in python/examples/mllib and python/examples/ml: * python/ml ** None * Unsure duplications, double check ** dataframe_example.py ** kmeans_example.py ** simple_params_example.py ** simple_text_classification_pipeline.py * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py * Unsure duplications, double check ** correlations.py ** random_rdd_generation.py ** sampled_rdds.py ** word2vec.py When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in python/examples/mllib and python/examples/ml: * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py When merging and cleaning those code, be sure not disturb the previous example on and off blocks. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * Unsure duplications, double check > ** dataframe_example.py > ** kmeans_example.py > ** simple_params_example.py > ** simple_text_classification_pipeline.py > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > * Unsure duplications, double check > ** correlations.py > ** random_rdd_generation.py > ** sampled_rdds.py > ** word2vec.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14301: -- Description: Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * Unsure code duplications of java/ml, double check ** JavaDeveloperApiExample.java ** JavaSimpleParamsExample.java ** JavaSimpleTextClassificationPipeline.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * Unsure code duplications of java/mllib, double check ** JavaALS.java ** JavaFPGrowthExample.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * Unsure code duplications of java/ml, double check > ** JavaDeveloperApiExample.java > ** JavaSimpleParamsExample.java > ** JavaSimpleTextClassificationPipeline.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * Unsure code duplications of java/mllib, double check > ** JavaALS.java > ** JavaFPGrowthExample.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14300) Scala MLlib examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14300: -- Description: Duplicated code that I found in scala/examples/mllib: * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * Unsure code duplications (need doube check) ** AbstractParams.scala ** BinaryClassification.scala ** Correlations.scala ** CosineSimilarity.scala ** DenseGaussianMixture.scala ** FPGrowthExample.scala ** MovieLensALS.scala ** MultivariateSummarizer.scala ** RandomRDDGeneration.scala ** SampledRDDs.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in scala/examples/mllib: * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. > Scala MLlib examples code merge and clean up > > > Key: SPARK-14300 > URL: https://issues.apache.org/jira/browse/SPARK-14300 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/mllib: > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * Unsure code duplications (need doube check) > ** AbstractParams.scala > ** BinaryClassification.scala > ** Correlations.scala > ** CosineSimilarity.scala > ** DenseGaussianMixture.scala > ** FPGrowthExample.scala > ** MovieLensALS.scala > ** MultivariateSummarizer.scala > ** RandomRDDGeneration.scala > ** SampledRDDs.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14299) Scala examples code merge and clean up
Xusen Yin created SPARK-14299: - Summary: Scala examples code merge and clean up Key: SPARK-14299 URL: https://issues.apache.org/jira/browse/SPARK-14299 Project: Spark Issue Type: Sub-task Components: Examples Reporter: Xusen Yin Priority: Minor Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14300) Scala MLlib examples code merge and clean up
Xusen Yin created SPARK-14300: - Summary: Scala MLlib examples code merge and clean up Key: SPARK-14300 URL: https://issues.apache.org/jira/browse/SPARK-14300 Project: Spark Issue Type: Sub-task Components: Examples Reporter: Xusen Yin Priority: Minor Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample * Unsure code duplications ** DataFrameExample.scala ** DeveloperApiExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > * Unsure code duplications > ** DataFrameExample.scala > ** DeveloperApiExample.scala > ** SimpleParamsExample.scala > ** SimpleTextClassificationPipeline.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Description: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample * Unsure code duplications (need double check) ** DataFrameExample.scala ** DeveloperApiExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample * Unsure code duplications ** DataFrameExample.scala ** DeveloperApiExample.scala ** SimpleParamsExample.scala ** SimpleTextClassificationPipeline.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > * Unsure code duplications (need double check) > ** DataFrameExample.scala > ** DeveloperApiExample.scala > ** SimpleParamsExample.scala > ** SimpleTextClassificationPipeline.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was: To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > To find out all examples of ml/mllib that don't contain "example on": > {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} > Duplicates need to be deleted: > * scala/ml > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14302) Python examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14302: -- Description: Duplicated code that I found in python/examples/mllib and python/examples/ml: * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. > Python examples code merge and clean up > --- > > Key: SPARK-14302 > URL: https://issues.apache.org/jira/browse/SPARK-14302 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in python/examples/mllib and python/examples/ml: > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14301: -- Description: Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14301) Java examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14301: -- Description: Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in scala/examples/mllib: * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. > Java examples code merge and clean up > - > > Key: SPARK-14301 > URL: https://issues.apache.org/jira/browse/SPARK-14301 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in java/examples/mllib and java/examples/ml: > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14302) Python examples code merge and clean up
Xusen Yin created SPARK-14302: - Summary: Python examples code merge and clean up Key: SPARK-14302 URL: https://issues.apache.org/jira/browse/SPARK-14302 Project: Spark Issue Type: Sub-task Components: Examples Reporter: Xusen Yin Priority: Minor Duplicated code that I found in java/examples/mllib and java/examples/ml: * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java When merging and cleaning those code, be sure not disturb the previous example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14300) Scala MLlib examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14300: -- Description: Duplicated code that I found in scala/examples/mllib: * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. was: Duplicated code that I found in scala/examples/ml: * scala/ml ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, DecisionTreeClassificationExample ** GBTExample.scala --> GradientBoostedTreeClassifierExample, GradientBoostedTreeRegressorExample ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample ** LogisticRegressionExample.scala --> LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample ** RandomForestExample.scala --> RandomForestRegressorExample, RandomForestClassifierExample ** TrainValidationSplitExample.scala --> ModelSelectionViaTrainValidationSplitExample When merging and cleaning those code, be sure not disturb the previous example on and off blocks. I'll take this one as an example. > Scala MLlib examples code merge and clean up > > > Key: SPARK-14300 > URL: https://issues.apache.org/jira/browse/SPARK-14300 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/mllib: > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14301) Java examples code merge and clean up
Xusen Yin created SPARK-14301: - Summary: Java examples code merge and clean up Key: SPARK-14301 URL: https://issues.apache.org/jira/browse/SPARK-14301 Project: Spark Issue Type: Sub-task Components: Examples Reporter: Xusen Yin Priority: Minor Duplicated code that I found in scala/examples/mllib: * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala When merging and cleaning those code, be sure not disturb the previous example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14299) Scala ML examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14299: -- Summary: Scala ML examples code merge and clean up (was: Scala examples code merge and clean up) > Scala ML examples code merge and clean up > - > > Key: SPARK-14299 > URL: https://issues.apache.org/jira/browse/SPARK-14299 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/ml: > * scala/ml > ** CrossValidatorExample.scala --> ModelSelectionViaCrossValidationExample > ** DecisionTreeExample.scala --> DecisionTreeRegressionExample, > DecisionTreeClassificationExample > ** GBTExample.scala --> GradientBoostedTreeClassifierExample, > GradientBoostedTreeRegressorExample > ** LinearRegressionExample.scala --> LinearRegressionWithElasticNetExample > ** LogisticRegressionExample.scala --> > LogisticRegressionWithElasticNetExample, LogisticRegressionSummaryExample > ** RandomForestExample.scala --> RandomForestRegressorExample, > RandomForestClassifierExample > ** TrainValidationSplitExample.scala --> > ModelSelectionViaTrainValidationSplitExample > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. I'll take this one as an example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14181) TrainValidationSplit should have HasSeed
Xusen Yin created SPARK-14181: - Summary: TrainValidationSplit should have HasSeed Key: SPARK-14181 URL: https://issues.apache.org/jira/browse/SPARK-14181 Project: Spark Issue Type: Improvement Components: ML Reporter: Xusen Yin Priority: Minor TrainValidationSplit should have HasSeed, just like its Python companion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import
[ https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213357#comment-15213357 ] Xusen Yin commented on SPARK-13786: --- I have finished the CrossValidator, but need to wait until SPARK-11893 merged first. > Pyspark ml.tuning support export/import > --- > > Key: SPARK-13786 > URL: https://issues.apache.org/jira/browse/SPARK-13786 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This should follow whatever implementation is chosen for Pipeline (since > these are all meta-algorithms). > Note this will also require persistence for Evaluators. Hopefully that can > leverage the Java implementations; there is not a real need to make Python > Evaluators be MLWritable, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13786) Pyspark ml.tuning support export/import
[ https://issues.apache.org/jira/browse/SPARK-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212364#comment-15212364 ] Xusen Yin commented on SPARK-13786: --- I'll work on it. > Pyspark ml.tuning support export/import > --- > > Key: SPARK-13786 > URL: https://issues.apache.org/jira/browse/SPARK-13786 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > > This should follow whatever implementation is chosen for Pipeline (since > these are all meta-algorithms). > Note this will also require persistence for Evaluators. Hopefully that can > leverage the Java implementations; there is not a real need to make Python > Evaluators be MLWritable, as far as I can tell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was: Please go through the current example code and list possible duplicates. To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > To find out all examples of ml/mllib that don't contain "example on": > {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: Please go through the current example code and list possible duplicates. To find out all examples of ml/mllib that don't contain "example on": {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > To find out all examples of ml/mllib that don't contain "example on": > {code}grep -L "example on" /path/to/ml-or-mllib/examples{code} > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15207417#comment-15207417 ] Xusen Yin commented on SPARK-14041: --- [~mengxr] Maybe no need to divide them into several JIRAs, since what we need to do is deleting them. > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala * java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala *java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14041) Locate possible duplicates and group them into subtasks
[ https://issues.apache.org/jira/browse/SPARK-14041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-14041: -- Description: Please go through the current example code and list possible duplicates. Duplicates need to be deleted: * scala/ml ** CrossValidatorExample.scala ** DecisionTreeExample.scala ** GBTExample.scala ** LinearRegressionExample.scala ** LogisticRegressionExample.scala ** RandomForestExample.scala ** TrainValidationSplitExample.scala * scala/mllib ** DecisionTreeRunner.scala ** DenseGaussianMixture.scala ** DenseKMeans.scala ** GradientBoostedTreesRunner.scala ** LDAExample.scala ** LinearRegression.scala ** SparseNaiveBayes.scala ** StreamingLinearRegression.scala ** StreamingLogisticRegression.scala ** TallSkinnyPCA.scala ** TallSkinnySVD.scala *java/ml ** JavaCrossValidatorExample.java ** JavaDocument.java ** JavaLabeledDocument.java ** JavaTrainValidationSplitExample.java * java/mllib ** JavaKMeans.java ** JavaLDAExample.java ** JavaLR.java * python/ml ** None * python/mllib ** gaussian_mixture_model.py ** kmeans.py ** logistic_regression.py was:Please go through the current example code and list possible duplicates. > Locate possible duplicates and group them into subtasks > --- > > Key: SPARK-14041 > URL: https://issues.apache.org/jira/browse/SPARK-14041 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin > > Please go through the current example code and list possible duplicates. > Duplicates need to be deleted: > * scala/ml > > ** CrossValidatorExample.scala > ** DecisionTreeExample.scala > ** GBTExample.scala > ** LinearRegressionExample.scala > ** LogisticRegressionExample.scala > ** RandomForestExample.scala > ** TrainValidationSplitExample.scala > * scala/mllib > > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > *java/ml > ** JavaCrossValidatorExample.java > ** JavaDocument.java > ** JavaLabeledDocument.java > ** JavaTrainValidationSplitExample.java > * java/mllib > ** JavaKMeans.java > ** JavaLDAExample.java > ** JavaLR.java > * python/ml > ** None > * python/mllib > ** gaussian_mixture_model.py > ** kmeans.py > ** logistic_regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup
[ https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203569#comment-15203569 ] Xusen Yin commented on SPARK-13461: --- I delete it. It's from another JIRA > Duplicated example code merge and cleanup > - > > Key: SPARK-13461 > URL: https://issues.apache.org/jira/browse/SPARK-13461 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Merge duplicated code after we finishing the example code substitution. > Duplications include: > * JavaTrainValidationSplitExample > * TrainValidationSplitExample > * Others can be added here ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13461) Duplicated example code merge and cleanup
[ https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-13461: -- Description: Merge duplicated code after we finishing the example code substitution. Duplications include: * JavaTrainValidationSplitExample * TrainValidationSplitExample * Others can be added here ... was: Merge duplicated code after we finishing the example code substitution. Duplications include: * JavaTrainValidationSplitExample * TrainValidationSplitExample * Random data generation in mllib-statistics.md need to remove "-" in each line. * Others can be added here ... > Duplicated example code merge and cleanup > - > > Key: SPARK-13461 > URL: https://issues.apache.org/jira/browse/SPARK-13461 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Merge duplicated code after we finishing the example code substitution. > Duplications include: > * JavaTrainValidationSplitExample > * TrainValidationSplitExample > * Others can be added here ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13461) Duplicated example code merge and cleanup
[ https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15203076#comment-15203076 ] Xusen Yin commented on SPARK-13461: --- Yes we'll delete it. > Duplicated example code merge and cleanup > - > > Key: SPARK-13461 > URL: https://issues.apache.org/jira/browse/SPARK-13461 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Merge duplicated code after we finishing the example code substitution. > Duplications include: > * JavaTrainValidationSplitExample > * TrainValidationSplitExample > * Random data generation in mllib-statistics.md need to remove "-" in each > line. > * Others can be added here ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13993) PySpark ml.feature.RFormula/RFormulaModel support export/import
Xusen Yin created SPARK-13993: - Summary: PySpark ml.feature.RFormula/RFormulaModel support export/import Key: SPARK-13993 URL: https://issues.apache.org/jira/browse/SPARK-13993 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Xusen Yin Priority: Minor Add save/load for RFormula and its model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13951) PySpark ml.pipeline support export/import - nested Piplines
[ https://issues.apache.org/jira/browse/SPARK-13951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15198642#comment-15198642 ] Xusen Yin commented on SPARK-13951: --- I start work on it now. > PySpark ml.pipeline support export/import - nested Piplines > --- > > Key: SPARK-13951 > URL: https://issues.apache.org/jira/browse/SPARK-13951 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196765#comment-15196765 ] Xusen Yin commented on SPARK-13641: --- [~muralidh] I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196765#comment-15196765 ] Xusen Yin edited comment on SPARK-13641 at 3/16/16 5:00 AM: I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. was (Author: yinxusen): [~muralidh] I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193991#comment-15193991 ] Xusen Yin commented on SPARK-11136: --- I agree. Will add it in the new commit. Thanks! > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13868) Random forest accuracy exploration
[ https://issues.apache.org/jira/browse/SPARK-13868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193899#comment-15193899 ] Xusen Yin commented on SPARK-13868: --- I'd love to explore this. > Random forest accuracy exploration > -- > > Key: SPARK-13868 > URL: https://issues.apache.org/jira/browse/SPARK-13868 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This is a JIRA for exploring accuracy improvements for Random Forests. > h2. Background > Initial exploration was based on reports of poor accuracy from > [http://datascience.la/benchmarking-random-forest-implementations/] > Essentially, Spark 1.2 showed poor performance relative to other libraries > for training set sizes of 1M and 10M. > h3. Initial improvements > The biggest issue was that the metric being used was AUC and Spark 1.2 was > using hard predictions, not class probabilities. This was fixed in > [SPARK-9528], and that brought Spark up to performance parity with > scikit-learn, Vowpal Wabbit, and R for the training set size of 1M. > h3. Remaining issues > For training set size 10M, Spark does not yet match the AUC of the other 2 > libraries benchmarked (H2O and xgboost). > Note that, on 1M instances, these 2 libraries also show better results than > scikit-learn, VW, and R. I'm not too familiar with the H2O implementation > and how it differs, but xgboost is a very different algorithm, so it's not > surprising it has different behavior. > h2. My explorations > I've run Spark on the test set of 10M instances. (Note that the benchmark > linked above used somewhat different settings for the different algorithms, > but those settings are actually not that important for this problem. This > included gini vs. entropy impurity and limits on splitting nodes.) > I've tried adjusting: > * maxDepth: Past depth 20, going deeper does not seem to matter > * maxBins: I've gone up to 500, but this too does not seem to matter. > However, this is a hard thing to verify since slight differences in > discretization could become significant in a large tree. > h2. Current questions > * H2O: It would be good to understand how this implementation differs from > standard RF implementations (in R, VW, scikit-learn, and Spark). > * xgboost: There's a JIRA for it: [SPARK-8547]. It would be great to see the > Spark package linked from that JIRA tested vs. MLlib on the benchmark data > (or other data). From what I've heard/read, xgboost is sometimes better, > sometimes worse in accuracy (but of course faster with more localized > training). > * Based on the above explorations, are there changes we should make to Spark > RFs? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11136) Warm-start support for ML estimator
[ https://issues.apache.org/jira/browse/SPARK-11136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15193886#comment-15193886 ] Xusen Yin commented on SPARK-11136: --- This is a good point. Actually in our settings now, the new KMeans only uses the model itself (i.e. the array of cluster centers) without its parameters. E.g. {code} if (isSet(initialModel)) { require($(initialModel).parentModel.clusterCenters.length == $(k), "mismatched cluster count") require(rdd.first().size == $(initialModel).clusterCenters.head.size, "mismatched dimension") algo.setInitialModel($(initialModel).parentModel) } {code} But I think you're right. We should also extend the parameters in some scenarios. IMHO, the parameter overriding order should be (initialModel parameter < default parameter < user-set parameter). What do you think about it? > Warm-start support for ML estimator > --- > > Key: SPARK-11136 > URL: https://issues.apache.org/jira/browse/SPARK-11136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin >Priority: Minor > > The current implementation of Estimator does not support warm-start fitting, > i.e. estimator.fit(data, params, partialModel). But first we need to add > warm-start for all ML estimators. This is an umbrella JIRA to add support for > the warm-start estimator. > Treat model as a special parameter, passing it through ParamMap. e.g. val > partialModel: Param[Option[M]] = new Param(...). In the case of model > existing, we use it to warm-start, else we start the training process from > the beginning. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13461) Duplicated example code merge and cleanup
[ https://issues.apache.org/jira/browse/SPARK-13461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-13461: -- Description: Merge duplicated code after we finishing the example code substitution. Duplications include: * JavaTrainValidationSplitExample * TrainValidationSplitExample * Random data generation in mllib-statistics.md need to remove "-" in each line. * Others can be added here ... was: Merge duplicated code after we finishing the example code substitution. Duplications include: * JavaTrainValidationSplitExample * TrainValidationSplitExample * Others can be added here ... > Duplicated example code merge and cleanup > - > > Key: SPARK-13461 > URL: https://issues.apache.org/jira/browse/SPARK-13461 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Merge duplicated code after we finishing the example code substitution. > Duplications include: > * JavaTrainValidationSplitExample > * TrainValidationSplitExample > * Random data generation in mllib-statistics.md need to remove "-" in each > line. > * Others can be added here ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181872#comment-15181872 ] Xusen Yin commented on SPARK-13641: --- You can checkout code from https://github.com/apache/spark/pull/11486. Run ./bin/sparkR with this [test example|https://github.com/yinxusen/spark/blob/SPARK-13449/R/pkg/inst/tests/testthat/test_mllib.R#L145]. With summary(model) you can see the column names are not the original. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org