[jira] [Created] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
StanZhai created SPARK-9010: --- Summary: Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: StanZhai Priority: Minor The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch
[ https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8941. -- Resolution: Duplicate Standalone cluster worker does not accept multiple masters on launch Key: SPARK-8941 URL: https://issues.apache.org/jira/browse/SPARK-8941 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.4.0, 1.4.1 Reporter: Jesper Lundgren Priority: Critical Before 1.4 it was possible to launch a worker node using a comma separated list of master nodes. ex: sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078 starting org.apache.spark.deploy.worker.Worker, logging to /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/07/09 12:33:06 INFO Utils: Shutdown hook called Spark 1.2 and 1.3.1 accepts multiple masters in this format. update: start-slave.sh only expects master lists in 1.4 (no instance number) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624483#comment-14624483 ] Shivam Verma commented on SPARK-9011: - Thanks Sean, I did some more experiments. It is really a bug because pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain classifiers. So it is the question of making a design choice: either ensuring consistency across classifier outputs in Spark.ML or making the BinaryClassificationEvaluator generic enough. I have appropriately modified the description above and I am reopening the issue. Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Critical Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning Hi, I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF) Bug: There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles. Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest). Detailed Observation: While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484 ] kumar ranganathan commented on SPARK-9009: -- Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} 15/07/13 15:52:32 ERROR SecurityManager: Uncaught exception: java.io.FileNotFoundException: file:\C:\Spark\conf\spark.truststore (The filenam e, directory name, or volume label syntax is incorrect) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java :114) {code} SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9012) Accumulators in the task table should be escaped
[ https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624572#comment-14624572 ] Apache Spark commented on SPARK-9012: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/7369 Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9012) Accumulators in the task table should be escaped
[ https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9012: --- Assignee: (was: Apache Spark) Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9012) Accumulators in the task table should be escaped
[ https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9012: --- Assignee: Apache Spark Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Assignee: Apache Spark Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7751: --- Assignee: Apache Spark (was: Xiangrui Meng) Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7751: --- Assignee: Xiangrui Meng (was: Apache Spark) Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624590#comment-14624590 ] Apache Spark commented on SPARK-7751: - User 'petz2000' has created a pull request for this issue: https://github.com/apache/spark/pull/7370 Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484 ] kumar ranganathan edited comment on SPARK-9009 at 7/13/15 10:27 AM: Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s ark.truststore (The filename, directory name, or volume label syntax is incorre t) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc {code} D: is meant for keeping truststore file in different disk (not in C: ) was (Author: kumar): Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s ark.truststore (The filename, directory name, or volume label syntax is incorre t) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc {code} D: is meant for keeping truststore file in different disk (not in C:) SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore -
[jira] [Updated] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivam Verma updated SPARK-9011: Description: Hi, I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF) Bug: There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles. Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest). Detailed Observation: While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. was: Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624487#comment-14624487 ] Sean Owen commented on SPARK-9009: -- Try {{file:///C:/Spark/conf/...}} Don't use backslashes. I'm saying that the exception for D: says the device isn't ready, but this has nothing to do with Spark. SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9011: - Priority: Minor (was: Critical) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Minor Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning Hi, I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF) Bug: There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles. Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest). Detailed Observation: While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9012) Accumulators in the task table should be escaped
[ https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-9012: Attachment: (was: screenshot-1.png) Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9012) Accumulators in the task table should be escaped
[ https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-9012: Attachment: Screen Shot 2015-07-13 at 8.02.44 PM.png Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9012) Accumulators in the task table should be escaped
[ https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-9012: Attachment: screenshot-1.png Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu Attachments: screenshot-1.png If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624594#comment-14624594 ] Apache Spark commented on SPARK-8915: - User 'petz2000' has created a pull request for this issue: https://github.com/apache/spark/pull/7371 Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8915: --- Assignee: Apache Spark Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Assignee: Apache Spark Priority: Minor Labels: starter Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8915) Add @since tags to mllib.classification
[ https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8915: --- Assignee: (was: Apache Spark) Add @since tags to mllib.classification --- Key: SPARK-8915 URL: https://issues.apache.org/jira/browse/SPARK-8915 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Priority: Minor Labels: starter Original Estimate: 1h Remaining Estimate: 1h -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
[ https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9010: --- Assignee: Apache Spark Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: StanZhai Assignee: Apache Spark Priority: Minor Labels: documentation The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
[ https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9010: --- Assignee: (was: Apache Spark) Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: StanZhai Priority: Minor Labels: documentation The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
[ https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624381#comment-14624381 ] Apache Spark commented on SPARK-9010: - User 'stanzhai' has created a pull request for this issue: https://github.com/apache/spark/pull/7368 Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: StanZhai Priority: Minor Labels: documentation The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivam Verma updated SPARK-9011: Description: Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double. Is it an issue with the evaluator? Is there a workaround? was: Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for logistic regression. I also pass the DF through a StringIndexer transformation for the RF: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double (I tried it out with a single parameter). Is it an issue with the evaluator, or is there anything else that I'm missing? Issue with running CrossValidator with RandomForestClassifier on dataset Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug
[jira] [Created] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset
Shivam Verma created SPARK-9011: --- Summary: Issue with running CrossValidator with RandomForestClassifier on dataset Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Critical Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. I tried the following code, using the dataset given in Spark's CV documentation for logistic regression. I also pass the DF through a StringIndexer transformation for the RF: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double (I tried it out with a single parameter). Is it an issue with the evaluator, or is there anything else that I'm missing? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface
[ https://issues.apache.org/jira/browse/SPARK-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9008: - Priority: Minor (was: Major) Component/s: Deploy Can you not just kill -9 the driver process? You can propose a doc change if that would help. Have a look at: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Stop and remove driver from supervised mode in spark-master interface - Key: SPARK-9008 URL: https://issues.apache.org/jira/browse/SPARK-9008 Project: Spark Issue Type: New Feature Components: Deploy Reporter: Jesper Lundgren Priority: Minor The cluster will automatically restart failing drivers when launched in supervised cluster mode. However there is no official way for a operation team to stop and remove a driver from restarting in case it is malfunctioning. I know there is bin/spark-class org.apache.spark.deploy.Client kill but this is undocumented and does not always work so well. It would be great if there was a way to remove supervised mode to allow kill -9 to work on a driver program. The documentation surrounding this could also see some improvements. It would be nice to have some best practice examples on how to work with supervised mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will end with an exit code that triggers restart in supervised mode unless you change the exit code in the application logic) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624485#comment-14624485 ] Shivam Verma edited comment on SPARK-9011 at 7/13/15 10:24 AM: --- Thanks Sean, I did some more experiments. It is really a bug because pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain classifiers. So it is the question of making a design choice: either ensuring consistency across classifier outputs in Spark.ML or making the BinaryClassificationEvaluator generic enough. I have appropriately modified the description above and I am reopening the issue. was (Author: shivamverma): I did some more experiments. It is really a bug because pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain classifiers. So it is the question of making a design choice: either ensuring consistency across classifier outputs in Spark.ML or making the BinaryClassificationEvaluator generic enough. I have appropriately modified the description above and I am reopening the issue. Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Critical Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning Hi, I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF) Bug: There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles. Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest). Detailed Observation: While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivam Verma reopened SPARK-9011: - I did some more experiments. It is really a bug because pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain classifiers. So it is the question of making a design choice: either ensuring consistency across classifier outputs in Spark.ML or making the BinaryClassificationEvaluator generic enough. I have appropriately modified the description above and I am reopening the issue. Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Critical Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning Hi, I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF) Bug: There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles. Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest). Detailed Observation: While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484 ] kumar ranganathan edited comment on SPARK-9009 at 7/13/15 10:24 AM: Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s ark.truststore (The filename, directory name, or volume label syntax is incorre t) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc {code} was (Author: kumar): Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} 15/07/13 15:52:32 ERROR SecurityManager: Uncaught exception: java.io.FileNotFoundException: file:\C:\Spark\conf\spark.truststore (The filenam e, directory name, or volume label syntax is incorrect) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java :114) {code} SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when
[jira] [Issue Comment Deleted] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivam Verma updated SPARK-9011: Comment: was deleted (was: Thanks Sean, I did some more experiments. It is really a bug because pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain classifiers. So it is the question of making a design choice: either ensuring consistency across classifier outputs in Spark.ML or making the BinaryClassificationEvaluator generic enough. I have appropriately modified the description above and I am reopening the issue. ) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Critical Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning Hi, I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF (Random Forest) classifier to classify a small dataset using the pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR (Logistic Regression) but not on RF) Bug: There is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected to contain vectors, whereas with RF, the prediction column contains doubles. Suggested Resolution: Either enable BinaryClassificationEvaluator to work with doubles, or let RF output a column rawPredictions containing the probability vectors (with probability of 1 assigned to predicted label, and 0 assigned to the rest). Detailed Observation: While running grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484 ] kumar ranganathan edited comment on SPARK-9009 at 7/13/15 10:26 AM: Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s ark.truststore (The filename, directory name, or volume label syntax is incorre t) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc {code} D: is meant for keeping truststore file in different disk (not in C:) was (Author: kumar): Yes, all this in a single machine only. The file exist in the specified location for sure. I just tried prefixing with file:/ but getting below exception in the command line itself. {code} Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s ark.truststore (The filename, directory name, or volume label syntax is incorre t) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav :114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc {code} SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input =
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624523#comment-14624523 ] kumar ranganathan commented on SPARK-9009: -- I have tried the below code and gets the output as true. {code} try { URI uri = new URI(file:///C:/Spark/conf/spark.truststore); File f=new File(uri); System.out.println(f.canRead()); } catch(Exception ex) { System.out.println(ex); } {code} SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
[ https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] StanZhai updated SPARK-9010: Component/s: (was: SQL) Documentation Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: StanZhai Priority: Minor Labels: documentation The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`
[ https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9010: - Target Version/s: 1.4.2, 1.5.0 Priority: Trivial (was: Minor) Improve the Spark Configuration document about `spark.kryoserializer.buffer` Key: SPARK-9010 URL: https://issues.apache.org/jira/browse/SPARK-9010 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.4.0 Reporter: StanZhai Priority: Trivial Labels: documentation The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's serialization buffer. Note that there will be one buffer per core on each worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed.. The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9007) start-slave.sh changed API in 1.4 and the documentation got updated to mention the old API
[ https://issues.apache.org/jira/browse/SPARK-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9007: - Priority: Trivial (was: Major) Component/s: (was: Deploy) Documentation [~koudelka] please set the JIRA fields reasonably. Are you going to open a PR? start-slave.sh changed API in 1.4 and the documentation got updated to mention the old API -- Key: SPARK-9007 URL: https://issues.apache.org/jira/browse/SPARK-9007 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.4.0 Reporter: Jesper Lundgren Priority: Trivial In Spark version 1.4 start-slave.sh accepted two parameters. worker# and a list of master addresses. With Spark 1.4 the start-slave.sh worker# parameter was removed, which broke our custom standalone cluster setup. With Spark 1.4 the documentation was also updated to mention spark-slave.sh (not previously mentioned), but it describes the old API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624434#comment-14624434 ] Sean Owen commented on SPARK-9009: -- Is this all on one machine? because the file would not exist on other machines running your jobs. The D: exception is unrelated to Spark. It's probably because you need to specify paths in Windows specially. Try prefixing with file: SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9009: - Priority: Minor (was: Major) Component/s: (was: YARN) SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624499#comment-14624499 ] kumar ranganathan commented on SPARK-9009: -- Yes i tried with file:/ and file:/// but both results the same exception. I am used the forward slash but the exception shows with the backslash. SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624516#comment-14624516 ] Sean Owen commented on SPARK-9009: -- Can you paste exactly what worked? I'm still not sure we're talking about the same file URIs. SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624521#comment-14624521 ] Lianhui Wang commented on SPARK-8646: - [~juliet] from your spark1.4-verbose.log, i find that master= local[*]. so maybe in spark-defaults.conf, you config spark.master=local? other situation is in your data_transform.py, maybe you use sparkConf.set(spark.master,local). Can you check whether these situations have been happened? PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivam Verma updated SPARK-9011: Description: Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for logistic regression. I also pass the DF through a StringIndexer transformation for the RF: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double (I tried it out with a single parameter). Is it an issue with the evaluator, or is there anything else that I'm missing? was: Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. I tried the following code, using the dataset given in Spark's CV documentation for logistic regression. I also pass the DF through a StringIndexer transformation for the RF: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) Note that the above dataset works on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double (I tried it out with a single parameter). Is it an issue with the evaluator, or is there anything else that I'm missing? Issue with running CrossValidator with RandomForestClassifier on dataset Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug
[jira] [Resolved] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset
[ https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9011. -- Resolution: Invalid This is really a question, which you should ask on user@ first. Until you have identified a bug and ideally a code change, I don't think a JIRA is the right next step. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Issue with running CrossValidator with RandomForestClassifier on dataset Key: SPARK-9011 URL: https://issues.apache.org/jira/browse/SPARK-9011 Project: Spark Issue Type: Bug Components: ML, MLlib, PySpark Affects Versions: 1.4.0 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single node running CentOS Reporter: Shivam Verma Priority: Critical Labels: cross-validation, ml, mllib, pyspark, randomforest, tuning Hi I'm a beginner to Spark, and am trying to run grid search on an RF classifier to classify a small dataset using the pyspark.ml.tuning module, specifically the ParamGridBuilder and CrossValidator classes. I get the following error when I try passing a DataFrame of Features-Labels to CrossValidator: {noformat} Py4JJavaError: An error occurred while calling o1464.evaluate. : java.lang.IllegalArgumentException: requirement failed: Column rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually DoubleType. {noformat} I tried the following code, using the dataset given in Spark's CV documentation for [cross validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator]. I also pass the DF through a StringIndexer transformation for the RF: {noformat} dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, label]) stringIndexer = StringIndexer(inputCol=label, outputCol=indexed) si_model = stringIndexer.fit(dataset) dataset2 = si_model.transform(dataset) keep = [dataset2.features, dataset2.indexed] dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label') rf = RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5, maxDepth=7) grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset3) {noformat} Note that the above dataset *works* on logistic regression. I have also tried a larger dataset with sparse vectors as features (which I was originally trying to fit) but received the same error on RF. My guess is that there is an issue with how BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the prediction column is a double. Is it an issue with the evaluator? Is there a workaround? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9012) Accumulators in the task table should be escaped
Shixiong Zhu created SPARK-9012: --- Summary: Accumulators in the task table should be escaped Key: SPARK-9012 URL: https://issues.apache.org/jira/browse/SPARK-9012 Project: Spark Issue Type: Bug Components: Web UI Reporter: Shixiong Zhu If running the following codes, the task table will be broken because accumulators aren't escaped. {code} val a = sc.accumulator(1, table) sc.parallelize(1 to 10).foreach(i = a += i) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624598#comment-14624598 ] Patrick Baier commented on SPARK-7751: -- sorry, wrong ticket number Add @since to stable and experimental methods in MLlib -- Key: SPARK-7751 URL: https://issues.apache.org/jira/browse/SPARK-7751 Project: Spark Issue Type: Umbrella Components: Documentation, MLlib Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Labels: starter This is useful to check whether a feature exists in some version of Spark. This is an umbrella JIRA to track the progress. We want to have @since tag for both stable (those without any Experimental/DeveloperApi/AlphaComponent annotations) and experimental methods in MLlib: * an example PR for Scala: https://github.com/apache/spark/pull/6101 * an example PR for Python: https://github.com/apache/spark/pull/6295 We need to dig the history of git commit to figure out what was the Spark version when a method was first introduced. Take `NaiveBayes.setModelType` as an example. We can grep `def setModelType` at different version git tags. {code} meng@xm:~/src/spark $ git show v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType meng@xm:~/src/spark $ git show v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala | grep def setModelType def setModelType(modelType: String): NaiveBayes = { {code} If there are better ways, please let us know. We cannot add all @since tags in a single PR, which is hard to review. So we made some subtasks for each package, for example `org.apache.spark.classification`. Feel free to add more sub-tasks for Python and the `spark.ml` package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624501#comment-14624501 ] Sean Owen commented on SPARK-9009: -- Try a small Java program using the File object to see if you can read the file using that exact URI. I doubt this has to do with Spark; maybe the file is not readable to your process? SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624514#comment-14624514 ] kumar ranganathan commented on SPARK-9009: -- Yes I tried, I could read the file using Java Program. SPARK Encryption FileNotFoundException for truststore - Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: kumar ranganathan Priority: Minor I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6851) Wrong answers for self joins of converted parquet relations
[ https://issues.apache.org/jira/browse/SPARK-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625755#comment-14625755 ] Apache Spark commented on SPARK-6851: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/7387 Wrong answers for self joins of converted parquet relations --- Key: SPARK-6851 URL: https://issues.apache.org/jira/browse/SPARK-6851 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker Fix For: 1.3.1, 1.4.0 From the user list ( /cc [~chinnitv]) When the same relation exists twice in a query plan, our new caching logic replaces both instances with identical replacements. The bug can be see in the following transformation: {code} === Applying Rule org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions === !Project [state#59,month#60] 'Project [state#105,month#106] ! Join Inner, Some(((state#69 = state#59) (month#70 = month#60)))'Join Inner, Some(((state#105 = state#105) (month#106 = month#106))) ! MetastoreRelation default, orders, None Subquery orders ! Subquery ao Relation[id#97,category#98,make#99,type#100,price#101,pdate#102,customer#103,city#104,state#105,month#106] org.apache.spark.sql.parquet.ParquetRelation2 ! Distinct Subquery ao !Project [state#69,month#70] Distinct ! Join Inner, Some((id#81 = id#71)) Project [state#105,month#106] ! MetastoreRelation default, orders, None Join Inner, Some((id#115 = id#97)) ! MetastoreRelation default, orderupdates, None Subquery orders ! Relation[id#97,category#98,make#99,type#100,price#101,pdate#102,customer#103,city#104,state#105,month#106] org.apache.spark.sql.parquet.ParquetRelation2 ! Subquery orderupdates ! Relation[id#115,category#116,make#117,type#118,price#119,pdate#120,customer#121,city#122,state#123,month#124] org.apache.spark.sql.parquet.ParquetRelation2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9030) Add Kinesis.createStream unit tests that actual send data
Tathagata Das created SPARK-9030: Summary: Add Kinesis.createStream unit tests that actual send data Key: SPARK-9030 URL: https://issues.apache.org/jira/browse/SPARK-9030 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.1 Reporter: Tathagata Das Assignee: Tathagata Das Current Kinesis unit tests do not test createStream by sending data. This JIRA is to add such unit test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9027) Generalize predicate pushdown into the metastore
[ https://issues.apache.org/jira/browse/SPARK-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625733#comment-14625733 ] Apache Spark commented on SPARK-9027: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/7386 Generalize predicate pushdown into the metastore Key: SPARK-9027 URL: https://issues.apache.org/jira/browse/SPARK-9027 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback
[ https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9026: --- Assignee: Apache Spark (was: Josh Rosen) SimpleFutureAction.onComplete should not tie up a separate thread for each callback --- Key: SPARK-9026 URL: https://issues.apache.org/jira/browse/SPARK-9026 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Apache Spark As [~zsxwing] points out at https://github.com/apache/spark/pull/7276#issuecomment-121097747, SimpleFutureAction currently blocks a separate execution context thread for each callback registered via onComplete: {code} override def onComplete[U](func: (Try[T]) = U)(implicit executor: ExecutionContext) { executor.execute(new Runnable { override def run() { func(awaitResult()) } }) } {code} We should fix this so that callbacks do not steal threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9027) Generalize predicate pushdown into the metastore
Michael Armbrust created SPARK-9027: --- Summary: Generalize predicate pushdown into the metastore Key: SPARK-9027 URL: https://issues.apache.org/jira/browse/SPARK-9027 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback
[ https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9026: --- Assignee: Josh Rosen (was: Apache Spark) SimpleFutureAction.onComplete should not tie up a separate thread for each callback --- Key: SPARK-9026 URL: https://issues.apache.org/jira/browse/SPARK-9026 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen As [~zsxwing] points out at https://github.com/apache/spark/pull/7276#issuecomment-121097747, SimpleFutureAction currently blocks a separate execution context thread for each callback registered via onComplete: {code} override def onComplete[U](func: (Try[T]) = U)(implicit executor: ExecutionContext) { executor.execute(new Runnable { override def run() { func(awaitResult()) } }) } {code} We should fix this so that callbacks do not steal threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback
[ https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625730#comment-14625730 ] Apache Spark commented on SPARK-9026: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/7385 SimpleFutureAction.onComplete should not tie up a separate thread for each callback --- Key: SPARK-9026 URL: https://issues.apache.org/jira/browse/SPARK-9026 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen As [~zsxwing] points out at https://github.com/apache/spark/pull/7276#issuecomment-121097747, SimpleFutureAction currently blocks a separate execution context thread for each callback registered via onComplete: {code} override def onComplete[U](func: (Try[T]) = U)(implicit executor: ExecutionContext) { executor.execute(new Runnable { override def run() { func(awaitResult()) } }) } {code} We should fix this so that callbacks do not steal threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625725#comment-14625725 ] Lianhui Wang commented on SPARK-8646: - [~juliet] can you provide your spark-submit command? i think the correct command in spark 1.4 is $SPARK_HOME/bin/spark-submit --master yarn-client outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/ is it the same as your command? PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning
[ https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6910. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7216 [https://github.com/apache/spark/pull/7216] Support for pushing predicates down to metastore for partition pruning -- Key: SPARK-6910 URL: https://issues.apache.org/jira/browse/SPARK-6910 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Cheolsoo Park Priority: Critical Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9027) Generalize predicate pushdown into the metastore
[ https://issues.apache.org/jira/browse/SPARK-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9027: --- Assignee: Apache Spark (was: Michael Armbrust) Generalize predicate pushdown into the metastore Key: SPARK-9027 URL: https://issues.apache.org/jira/browse/SPARK-9027 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9028) Add CountVectorizer as an estimator to generate CountVectorizerModel
yuhao yang created SPARK-9028: - Summary: Add CountVectorizer as an estimator to generate CountVectorizerModel Key: SPARK-9028 URL: https://issues.apache.org/jira/browse/SPARK-9028 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9021) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition
[ https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625789#comment-14625789 ] Apache Spark commented on SPARK-9021: - User 'njhwang' has created a pull request for this issue: https://github.com/apache/spark/pull/7378 Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition -- Key: SPARK-9021 URL: https://issues.apache.org/jira/browse/SPARK-9021 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Ubuntu 14.04 LTS Reporter: Nicholas Hwang Please see pull request for more information. I initially patched this arguably unexpected behavior by serializing zeroValue, but ended up mimicking the deepcopy approach used by other RDD methods. I also contemplated having fold/aggregate accept zero value generator functions instead of an actual object, but that obviously changes the API. Looking forward to hearing back and/or being educated on how I'm inappropriately using this functionality (relatively new to Spark and functional programming). Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9021) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition
[ https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9021: --- Assignee: (was: Apache Spark) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition -- Key: SPARK-9021 URL: https://issues.apache.org/jira/browse/SPARK-9021 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Ubuntu 14.04 LTS Reporter: Nicholas Hwang Please see pull request for more information. I initially patched this arguably unexpected behavior by serializing zeroValue, but ended up mimicking the deepcopy approach used by other RDD methods. I also contemplated having fold/aggregate accept zero value generator functions instead of an actual object, but that obviously changes the API. Looking forward to hearing back and/or being educated on how I'm inappropriately using this functionality (relatively new to Spark and functional programming). Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9021) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition
[ https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9021: --- Assignee: Apache Spark Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition -- Key: SPARK-9021 URL: https://issues.apache.org/jira/browse/SPARK-9021 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.4.0 Environment: Ubuntu 14.04 LTS Reporter: Nicholas Hwang Assignee: Apache Spark Please see pull request for more information. I initially patched this arguably unexpected behavior by serializing zeroValue, but ended up mimicking the deepcopy approach used by other RDD methods. I also contemplated having fold/aggregate accept zero value generator functions instead of an actual object, but that obviously changes the API. Looking forward to hearing back and/or being educated on how I'm inappropriately using this functionality (relatively new to Spark and functional programming). Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8965) Add ml-guide Python Example: Estimator, Transformer, and Param
[ https://issues.apache.org/jira/browse/SPARK-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625831#comment-14625831 ] Arijit Saha commented on SPARK-8965: Hi Joseph, I would like to take up this task. Being a starter, will help me, to understand flow. Thanks, Arijit. Add ml-guide Python Example: Estimator, Transformer, and Param -- Key: SPARK-8965 URL: https://issues.apache.org/jira/browse/SPARK-8965 Project: Spark Issue Type: Sub-task Components: Documentation, ML, PySpark Reporter: Joseph K. Bradley Priority: Minor Labels: starter Look at: [http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param] We need a Python example doing exactly the same thing, but in Python. It should be tested using the PySpark shell. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3703) Ensemble learning methods
[ https://issues.apache.org/jira/browse/SPARK-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625859#comment-14625859 ] Manoj Kumar commented on SPARK-3703: Hi, I am interested in working on ensemble methods in general (as seen from my initial few pull requests). Are any of these targeted towards the 1.5 release? I'm asking because I might not be able to commit enough time after September. Ensemble learning methods - Key: SPARK-3703 URL: https://issues.apache.org/jira/browse/SPARK-3703 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley This is a general JIRA for coordinating on adding ensemble learning methods to MLlib. These methods include a variety of boosting and bagging algorithms. Below is a general design doc for ensemble methods (currently focused on boosting). Please comment here about general discussion and coordination; for comments about specific algorithms, please comment on their respective JIRAs. [Design doc for ensemble methods | https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9028) Add CountVectorizer as an estimator to generate CountVectorizerModel
[ https://issues.apache.org/jira/browse/SPARK-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang updated SPARK-9028: -- Description: Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. Add CountVectorizer as an estimator to generate CountVectorizerModel Key: SPARK-9028 URL: https://issues.apache.org/jira/browse/SPARK-9028 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9029) shortcut CaseKeyWhen if key is null
[ https://issues.apache.org/jira/browse/SPARK-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9029: --- Assignee: Apache Spark shortcut CaseKeyWhen if key is null --- Key: SPARK-9029 URL: https://issues.apache.org/jira/browse/SPARK-9029 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9029) shortcut CaseKeyWhen if key is null
[ https://issues.apache.org/jira/browse/SPARK-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625778#comment-14625778 ] Apache Spark commented on SPARK-9029: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/7389 shortcut CaseKeyWhen if key is null --- Key: SPARK-9029 URL: https://issues.apache.org/jira/browse/SPARK-9029 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9029) shortcut CaseKeyWhen if key is null
[ https://issues.apache.org/jira/browse/SPARK-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9029: --- Assignee: (was: Apache Spark) shortcut CaseKeyWhen if key is null --- Key: SPARK-9029 URL: https://issues.apache.org/jira/browse/SPARK-9029 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1403. Resolution: Fixed Target Version/s: (was: 1.5.0) Hey All, This issue should remain fixed. [~mandoskippy] I think you are just running into a different issue that is also in some way related to classloading. Can you open a new JIRA for your issue, paste in the stack trace and give as much information as possible without the environment? Thanks! Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.3.0, 1.4.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1403) Spark on Mesos does not set Thread's context class loader
[ https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625739#comment-14625739 ] Patrick Wendell edited comment on SPARK-1403 at 7/14/15 2:59 AM: - Hey All, This issue should remain fixed. [~mandoskippy] I think you are just running into a different issue that is also in some way related to classloading. Can you open a new JIRA for your issue, paste in the stack trace and give as much information as possible about the environment? Thanks! was (Author: pwendell): Hey All, This issue should remain fixed. [~mandoskippy] I think you are just running into a different issue that is also in some way related to classloading. Can you open a new JIRA for your issue, paste in the stack trace and give as much information as possible without the environment? Thanks! Spark on Mesos does not set Thread's context class loader - Key: SPARK-1403 URL: https://issues.apache.org/jira/browse/SPARK-1403 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0, 1.3.0, 1.4.0 Environment: ubuntu 12.04 on vagrant Reporter: Bharath Bhushan Priority: Blocker Fix For: 1.0.0 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark executor on mesos slave throws a java.lang.ClassNotFoundException for org.apache.spark.serializer.JavaSerializer. The lengthy discussion is here: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9028) Add CountVectorizer as an estimator to generate CountVectorizerModel
[ https://issues.apache.org/jira/browse/SPARK-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9028: --- Assignee: Apache Spark Add CountVectorizer as an estimator to generate CountVectorizerModel Key: SPARK-9028 URL: https://issues.apache.org/jira/browse/SPARK-9028 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9029) shortcut CaseKeyWhen if key is null
Wenchen Fan created SPARK-9029: -- Summary: shortcut CaseKeyWhen if key is null Key: SPARK-9029 URL: https://issues.apache.org/jira/browse/SPARK-9029 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9013) generate MutableProjection directly instead of return a function
Wenchen Fan created SPARK-9013: -- Summary: generate MutableProjection directly instead of return a function Key: SPARK-9013 URL: https://issues.apache.org/jira/browse/SPARK-9013 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9013) generate MutableProjection directly instead of return a function
[ https://issues.apache.org/jira/browse/SPARK-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9013: --- Assignee: (was: Apache Spark) generate MutableProjection directly instead of return a function Key: SPARK-9013 URL: https://issues.apache.org/jira/browse/SPARK-9013 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622041#comment-14622041 ] Walter Petersen edited comment on SPARK-3155 at 7/13/15 12:57 PM: -- Hi all, I'm new out there. Please tell me: - Is the proposed implementation based on a well-known research paper ? If so, which one ? - Is this issue still relevant ? Is someone currently implementing the feature ? Thanks was (Author: petersen): Hi all, I'm new out there. Please tell me: - Is the proposed implementation based on a well-known research paper ? If so, which one ? - Is is issue still relevant ? Is someone currently implementing the feature ? Thanks Support DecisionTree pruning Key: SPARK-3155 URL: https://issues.apache.org/jira/browse/SPARK-3155 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Improvement: accuracy, computation Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning. Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed. A naive implementation prunes as follows: (1) Train a depth K tree using a training set. (2) Compute the optimal prediction at each node (including internal nodes) based on the training set. (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).) (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error. A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch. It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9013) generate MutableProjection directly instead of return a function
[ https://issues.apache.org/jira/browse/SPARK-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9013: --- Assignee: Apache Spark generate MutableProjection directly instead of return a function Key: SPARK-9013 URL: https://issues.apache.org/jira/browse/SPARK-9013 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7549) Support aggregating over nested fields
[ https://issues.apache.org/jira/browse/SPARK-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625704#comment-14625704 ] Chen Song commented on SPARK-7549: -- I prefer the former. I thought about using explode, it's a good way to implement the nested aggregations. But I wanna take advantage of codegen by implement these directly. Support aggregating over nested fields -- Key: SPARK-7549 URL: https://issues.apache.org/jira/browse/SPARK-7549 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Would be nice to be able to run sum, avg, min, max (and other numeric aggregate expressions) on arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed
[ https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7126: - Target Version/s: (was: 1.5.0) For spark.ml Classifiers, automatically index labels if they are not yet indexed Key: SPARK-7126 URL: https://issues.apache.org/jira/browse/SPARK-7126 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Now that we have StringIndexer, we could have spark.ml.classification.Classifier (the abstraction) automatically handle label indexing if the labels are not yet indexed. This would require a bit of design: * Should predict() output the original labels or the indices? * How should we notify users that the labels are being automatically indexed? * How should we provide that index to the users? * If multiple parts of a Pipeline automatically index labels, what do we need to do to make sure they are consistent? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed
[ https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625708#comment-14625708 ] Joseph K. Bradley commented on SPARK-7126: -- I agree we should emulate scikit-learn. I've spoken with [~mengxr], who strongly supports having transform() maintain the current semantics of using 0-based label indices. This means that, to solve this JIRA, we will need to add a new method analogous to fit() which returns a PipelineModel rather than a specific model (like LogisticRegressionModel). That PipelineModel can include indexing and de-indexing labels, and perhaps other transformations as well. This addition to the API will require some significant design, which we hope to do before long...but maybe not for 1.5. I'll remove that target version. For spark.ml Classifiers, automatically index labels if they are not yet indexed Key: SPARK-7126 URL: https://issues.apache.org/jira/browse/SPARK-7126 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Now that we have StringIndexer, we could have spark.ml.classification.Classifier (the abstraction) automatically handle label indexing if the labels are not yet indexed. This would require a bit of design: * Should predict() output the original labels or the indices? * How should we notify users that the labels are being automatically indexed? * How should we provide that index to the users? * If multiple parts of a Pipeline automatically index labels, what do we need to do to make sure they are consistent? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625710#comment-14625710 ] Joseph K. Bradley commented on SPARK-6884: -- Once [SPARK-7131] gets merged, then we can extend trees (and then forests) to provide class probabilities. I'd watch that JIRA to get pinged when it's merged. Thanks! Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: ML Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625716#comment-14625716 ] Apache Spark commented on SPARK-8998: - User 'zhangjiajin' has created a pull request for this issue: https://github.com/apache/spark/pull/7383 Collect enough frequent prefixes before projection in PrefixSpan Key: SPARK-8998 URL: https://issues.apache.org/jira/browse/SPARK-8998 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Zhang JiaJin Original Estimate: 48h Remaining Estimate: 48h The implementation in SPARK-6487 might have scalability issues when the number of frequent items is very small. In this case, we can generate candidate sets of higher orders using Apriori-like algorithms and count them, until we collect enough prefixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8998: --- Assignee: Zhang JiaJin (was: Apache Spark) Collect enough frequent prefixes before projection in PrefixSpan Key: SPARK-8998 URL: https://issues.apache.org/jira/browse/SPARK-8998 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Zhang JiaJin Original Estimate: 48h Remaining Estimate: 48h The implementation in SPARK-6487 might have scalability issues when the number of frequent items is very small. In this case, we can generate candidate sets of higher orders using Apriori-like algorithms and count them, until we collect enough prefixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8998: --- Assignee: Apache Spark (was: Zhang JiaJin) Collect enough frequent prefixes before projection in PrefixSpan Key: SPARK-8998 URL: https://issues.apache.org/jira/browse/SPARK-8998 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Apache Spark Original Estimate: 48h Remaining Estimate: 48h The implementation in SPARK-6487 might have scalability issues when the number of frequent items is very small. In this case, we can generate candidate sets of higher orders using Apriori-like algorithms and count them, until we collect enough prefixes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback
Josh Rosen created SPARK-9026: - Summary: SimpleFutureAction.onComplete should not tie up a separate thread for each callback Key: SPARK-9026 URL: https://issues.apache.org/jira/browse/SPARK-9026 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen As [~zsxwing] points out at https://github.com/apache/spark/pull/7276#issuecomment-121097747, SimpleFutureAction currently blocks a separate execution context thread for each callback registered via onComplete: {code} override def onComplete[U](func: (Try[T]) = U)(implicit executor: ExecutionContext) { executor.execute(new Runnable { override def run() { func(awaitResult()) } }) } {code} We should fix this so that callbacks do not steal threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback
[ https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-9026: - Assignee: Josh Rosen SimpleFutureAction.onComplete should not tie up a separate thread for each callback --- Key: SPARK-9026 URL: https://issues.apache.org/jira/browse/SPARK-9026 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Josh Rosen Assignee: Josh Rosen As [~zsxwing] points out at https://github.com/apache/spark/pull/7276#issuecomment-121097747, SimpleFutureAction currently blocks a separate execution context thread for each callback registered via onComplete: {code} override def onComplete[U](func: (Try[T]) = U)(implicit executor: ExecutionContext) { executor.execute(new Runnable { override def run() { func(awaitResult()) } }) } {code} We should fix this so that callbacks do not steal threads. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9015) Maven cleanup / Clean Project Import in scala-ide
[ https://issues.apache.org/jira/browse/SPARK-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jan Prach updated SPARK-9015: - Comment: was deleted (was: PR #7375) Maven cleanup / Clean Project Import in scala-ide - Key: SPARK-9015 URL: https://issues.apache.org/jira/browse/SPARK-9015 Project: Spark Issue Type: Improvement Components: Build Reporter: Jan Prach Cleanup maven for a clean import in scala-ide / eclipse. The outstanging PR contains things like removal of groovy plugin and some more maven cleanup goes here. In order to make it a seamless experience two more things have to be merged upstream: 1) ide automatically generate jva sources from idl - https://issues.apache.org/jira/browse/AVRO-1671 2) set scala version in ide based on maven config - https://github.com/sonatype/m2eclipse-scala/issues/30 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-6319: -- Priority: Critical (was: Major) DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Priority: Critical Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625045#comment-14625045 ] Josh Rosen commented on SPARK-6319: --- I think that we should revisit this issue. It seems that we currently return wrong answers for groupBy queries involving binary typed columns. If we're not going to support this properly, then I think we should fail-fast with an analysis error rather than returning an incorrect answer. DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8907) Speed up path construction in DynamicPartitionWriterContainer.outputWriterForRow
[ https://issues.apache.org/jira/browse/SPARK-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625150#comment-14625150 ] Ilya Ganelin commented on SPARK-8907: - [~rxin] The code for this in master has eliminated usage of zip and map as of [SPARK-8961|https://github.com/apache/spark/commit/33630883685eafcc3ee4521ea8363be342f6e6b4]. Do you think this can be further optimized and if so, how? There doesn't seem to be much within the existing catalyst expressions that would facilitate this, but I could be wrong. The relevant code fragment is below: {code} val partitionPath = { val partitionPathBuilder = new StringBuilder var i = 0 while (i partitionColumns.length) { val col = partitionColumns(i) val partitionValueString = { val string = row.getString(i) if (string.eq(null)) defaultPartitionName else PartitioningUtils.escapePathName(string) } if (i 0) { partitionPathBuilder.append(Path.SEPARATOR_CHAR) } partitionPathBuilder.append(s$col=$partitionValueString) i += 1 } partitionPathBuilder.toString() } {code} Speed up path construction in DynamicPartitionWriterContainer.outputWriterForRow Key: SPARK-8907 URL: https://issues.apache.org/jira/browse/SPARK-8907 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Don't use zip and scala collection methods to avoid garbage collection {code} val partitionPath = partitionColumns.zip(row.toSeq).map { case (col, rawValue) = val string = if (rawValue == null) null else String.valueOf(rawValue) val valueString = if (string == null || string.isEmpty) { defaultPartitionName } else { PartitioningUtils.escapePathName(string) } s/$col=$valueString }.mkString.stripPrefix(Path.SEPARATOR) {code} We can probably use catalyst expressions themselves to construct the path, and then we can leverage code generation to do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4362) Make prediction probability available in NaiveBayesModel
[ https://issues.apache.org/jira/browse/SPARK-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625157#comment-14625157 ] Apache Spark commented on SPARK-4362: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/7376 Make prediction probability available in NaiveBayesModel Key: SPARK-4362 URL: https://issues.apache.org/jira/browse/SPARK-4362 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jatinpreet Singh Priority: Minor Labels: naive-bayes There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8954) Building Docker Images Fails in 1.4 branch
[ https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8954. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7346 [https://github.com/apache/spark/pull/7346] Building Docker Images Fails in 1.4 branch -- Key: SPARK-8954 URL: https://issues.apache.org/jira/browse/SPARK-8954 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Environment: Docker Reporter: Pradeep Bashyal Fix For: 1.5.0 Docker build on branch 1.4 fails when installing the jdk. It expects tzdata-java as a dependency but adding that to the apt-get install list doesn't help. ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/ ◼ Sending build context to Docker daemon 3.072 kB Sending build context to Docker daemon Step 0 : FROM ubuntu:precise --- 78cef618c77e Step 1 : RUN echo deb http://archive.ubuntu.com/ubuntu precise main universe /etc/apt/sources.list --- Using cache --- 2017472bec85 Step 2 : RUN apt-get update --- Using cache --- 86b8911ead16 Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server --- Running in dc8197a0ea31 Reading package lists... Building dependency tree... Reading state information... Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be installed E: Unable to correct problems, you have held broken packages. INFO[0004] The command [/bin/sh -c apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a non-zero code: 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8991) Update SharedParamsCodeGen's Generated Documentation
[ https://issues.apache.org/jira/browse/SPARK-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-8991. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7367 [https://github.com/apache/spark/pull/7367] Update SharedParamsCodeGen's Generated Documentation Key: SPARK-8991 URL: https://issues.apache.org/jira/browse/SPARK-8991 Project: Spark Issue Type: Improvement Components: ML Reporter: Feynman Liang Priority: Trivial Labels: Starter Fix For: 1.5.0 We no longer need Specifically, the [generated documentation in SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala#L137] should be modified from {{code}} |/** | * (private[ml]) Trait for shared param $name$defaultValueDoc. | */ {{code}} to {{code}} |/** | * Trait for shared param $name$defaultValueDoc. | */ {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9017) More timers for MLlib algorithms
Xiangrui Meng created SPARK-9017: Summary: More timers for MLlib algorithms Key: SPARK-9017 URL: https://issues.apache.org/jira/browse/SPARK-9017 Project: Spark Issue Type: Umbrella Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng It is useful to provide more instrumentation to MLlib algorithms, like training time for each stage in k-means. This is an umbrella JIRA for implementing more timers to MLlib algorithms. The first PR would be a generic timer utility based on the one used in trees. Then we can distribute the work. It is also helpful for contributors to understand the code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9018) Implement a generic Timer utility for ML algorithms
Xiangrui Meng created SPARK-9018: Summary: Implement a generic Timer utility for ML algorithms Key: SPARK-9018 URL: https://issues.apache.org/jira/browse/SPARK-9018 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng The Timer utility should be based on the one implemented in trees. In particular, we should offer two versions: 1. a global timer that is initialized on the driver and use accumulator to aggregate time 2. a local timer that is initialized on the worker, and only provide per task measurement. 1) needs some performance benchmark and guidance on the granularity. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9005: - Shepherd: Joseph K. Bradley Assignee: Feynman Liang RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Assignee: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two coincide only when the predictor is unbiased (e.g. an intercept term is included in a linear model), but this is not always the case. We should change to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8954) Building Docker Images Fails in 1.4 branch
[ https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-8954: -- Assignee: Yong Tang Building Docker Images Fails in 1.4 branch -- Key: SPARK-8954 URL: https://issues.apache.org/jira/browse/SPARK-8954 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.4.0 Environment: Docker Reporter: Pradeep Bashyal Assignee: Yong Tang Fix For: 1.5.0 Docker build on branch 1.4 fails when installing the jdk. It expects tzdata-java as a dependency but adding that to the apt-get install list doesn't help. ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/ ◼ Sending build context to Docker daemon 3.072 kB Sending build context to Docker daemon Step 0 : FROM ubuntu:precise --- 78cef618c77e Step 1 : RUN echo deb http://archive.ubuntu.com/ubuntu precise main universe /etc/apt/sources.list --- Using cache --- 2017472bec85 Step 2 : RUN apt-get update --- Using cache --- 86b8911ead16 Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server --- Running in dc8197a0ea31 Reading package lists... Building dependency tree... Reading state information... Some packages could not be installed. This may mean that you have requested an impossible situation or if you are using the unstable distribution that some required packages have not yet been created or been moved out of Incoming. The following information may help to resolve the situation: The following packages have unmet dependencies: openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be installed E: Unable to correct problems, you have held broken packages. INFO[0004] The command [/bin/sh -c apt-get install -y less openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a non-zero code: 100 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8991) Update SharedParamsCodeGen's Generated Documentation
[ https://issues.apache.org/jira/browse/SPARK-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-8991: - Assignee: Vinod KC Update SharedParamsCodeGen's Generated Documentation Key: SPARK-8991 URL: https://issues.apache.org/jira/browse/SPARK-8991 Project: Spark Issue Type: Improvement Components: ML Reporter: Feynman Liang Assignee: Vinod KC Priority: Trivial Labels: Starter Fix For: 1.5.0 We no longer need Specifically, the [generated documentation in SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala#L137] should be modified from {{code}} |/** | * (private[ml]) Trait for shared param $name$defaultValueDoc. | */ {{code}} to {{code}} |/** | * Trait for shared param $name$defaultValueDoc. | */ {{code}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8838) Add config to enable/disable merging part-files when merging parquet schema
[ https://issues.apache.org/jira/browse/SPARK-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-8838: Shepherd: Cheng Lian Add config to enable/disable merging part-files when merging parquet schema --- Key: SPARK-8838 URL: https://issues.apache.org/jira/browse/SPARK-8838 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Currently all part-files are merged when merging parquet schema. However, in case there are many part-files and we can make sure that all the part-files have the same schema as their summary file. If so, we provide a configuration to disable merging part-files when merging parquet schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6319) DISTINCT doesn't work for binary type
[ https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625082#comment-14625082 ] Michael Armbrust commented on SPARK-6319: - +1 to throwing an {{AnalysisException}} DISTINCT doesn't work for binary type - Key: SPARK-6319 URL: https://issues.apache.org/jira/browse/SPARK-6319 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0 Reporter: Cheng Lian Priority: Critical Spark shell session for reproduction: {noformat} scala import sqlContext.implicits._ scala import org.apache.spark.sql.types._ scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c cast BinaryType).distinct.show() ... CAST(c, BinaryType) [B@43f13160 [B@5018b648 [B@3be22500 [B@476fc8a1 {noformat} Spark SQL uses plain byte arrays to represent binary values. However, arrays are compared by reference rather than by value. On the other hand, the DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check for duplicated values. These two facts together cause the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8646) PySpark does not run on YARN
[ https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625087#comment-14625087 ] Marcelo Vanzin commented on SPARK-8646: --- [~j_houg] could you also run the command with the SPARK_PRINT_LAUNCH_COMMAND=1 env variable set, and post the command logged to stderr? PySpark does not run on YARN Key: SPARK-8646 URL: https://issues.apache.org/jira/browse/SPARK-8646 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 1.4.0 Environment: SPARK_HOME=local/path/to/spark1.4install/dir also with SPARK_HOME=local/path/to/spark1.4install/dir PYTHONPATH=$SPARK_HOME/python/lib Spark apps are submitted with the command: $SPARK_HOME/bin/spark-submit outofstock/data_transform.py hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client data_transform contains a main method, and the rest of the args are parsed in my own code. Reporter: Juliet Hougland Attachments: executor.log, pi-test.log, spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log Running pyspark jobs result in a no module named pyspark when run in yarn-client mode in spark 1.4. [I believe this JIRA represents the change that introduced this error.| https://issues.apache.org/jira/browse/SPARK-6869 ] This does not represent a binary compatible change to spark. Scripts that worked on previous spark versions (ie comands the use spark-submit) should continue to work without modification between minor versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8950) Correct the calculation of SchedulerDelayTime in StagePage
[ https://issues.apache.org/jira/browse/SPARK-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-8950. --- Resolution: Fixed Assignee: Carson Wang Fix Version/s: 1.5.0 Correct the calculation of SchedulerDelayTime in StagePage --- Key: SPARK-8950 URL: https://issues.apache.org/jira/browse/SPARK-8950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Assignee: Carson Wang Priority: Minor Fix For: 1.5.0 In StagePage, the SchedulerDelay is calculated as totalExecutionTime - executorRunTime - executorOverhead - gettingResultTime. But the totalExecutionTime is calculated in the way that doesn't include the gettingResultTime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9016) Make the random forest classifiers implement classification trait
holdenk created SPARK-9016: -- Summary: Make the random forest classifiers implement classification trait Key: SPARK-9016 URL: https://issues.apache.org/jira/browse/SPARK-9016 Project: Spark Issue Type: Improvement Components: ML Reporter: holdenk Priority: Minor This is a blocking issue for https://issues.apache.org/jira/browse/SPARK-8069 . Since we want to add thresholding/cutoff support to RandomForest and we wish to do this in a general way we should move RandomForest over to the Clasisfication trait. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org