[jira] [Created] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread StanZhai (JIRA)
StanZhai created SPARK-9010:
---

 Summary: Improve the Spark Configuration document about 
`spark.kryoserializer.buffer`
 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Minor


The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
serialization buffer. Note that there will be one buffer per core on each 
worker. This buffer will grow up to spark.kryoserializer.buffer.max if needed..

The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8941.
--
Resolution: Duplicate

 Standalone cluster worker does not accept multiple masters on launch
 

 Key: SPARK-8941
 URL: https://issues.apache.org/jira/browse/SPARK-8941
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Documentation
Affects Versions: 1.4.0, 1.4.1
Reporter: Jesper Lundgren
Priority: Critical

 Before 1.4 it was possible to launch a worker node using a comma separated 
 list of master nodes. 
 ex:
 sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078
 starting org.apache.spark.deploy.worker.Worker, logging to 
 /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out
 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/07/09 12:33:06 INFO Utils: Shutdown hook called
 Spark 1.2 and 1.3.1 accepts multiple masters in this format.
 update: start-slave.sh only expects master lists in 1.4 (no instance number)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF

2015-07-13 Thread Shivam Verma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624483#comment-14624483
 ] 

Shivam Verma commented on SPARK-9011:
-

Thanks Sean, 
I did some more experiments. It is really a bug because 
pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain 
classifiers. So it is the question of making a design choice: either ensuring 
consistency across classifier outputs in Spark.ML or making the 
BinaryClassificationEvaluator generic enough.
I have appropriately modified the description above and I am reopening the 
issue.


 Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search 
 working on LR but not on RF
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
 node running CentOS
Reporter: Shivam Verma
Priority: Critical
  Labels: cross-validation, ml, mllib, pyspark, randomforest, 
 tuning

 Hi,
 I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
 (Random Forest) classifier to classify a small dataset using the 
 pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
 (Logistic Regression) but not on RF)
 Bug:
 There is an issue with how BinaryClassificationEvaluator(self, 
 rawPredictionCol=rawPrediction, labelCol=label, 
 metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the 
 rawPredictionCol is expected to contain vectors, whereas with RF, the 
 prediction column contains doubles. 
 Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
 with doubles, or let RF output a column rawPredictions containing the 
 probability vectors (with probability of 1 assigned to predicted label, and 0 
 assigned to the rest).
 Detailed Observation:
 While running grid search on an RF classifier to classify a small dataset 
 using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
 CrossValidator classes. I get the following error when I try passing a 
 DataFrame of Features-Labels to CrossValidator:
 {noformat}
 Py4JJavaError: An error occurred while calling o1464.evaluate.
 : java.lang.IllegalArgumentException: requirement failed: Column 
 rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
 but was actually DoubleType.
 {noformat}
 I tried the following code, using the dataset given in Spark's CV 
 documentation for [cross 
 validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
  I also pass the DF through a StringIndexer transformation for the RF:
  
 {noformat}
 dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
 10,[features, label])
 stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
 si_model = stringIndexer.fit(dataset)
 dataset2 = si_model.transform(dataset)
 keep = [dataset2.features, dataset2.indexed]
 dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
 rf = 
 RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
  maxDepth=7)
 grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
 evaluator = BinaryClassificationEvaluator()
 cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
 evaluator=evaluator)
 cvModel = cv.fit(dataset3)
 {noformat}
 Note that the above dataset *works* on logistic regression. I have also tried 
 a larger dataset with sparse vectors as features (which I was originally 
 trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484
 ] 

kumar ranganathan commented on SPARK-9009:
--

Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
15/07/13 15:52:32 ERROR SecurityManager: Uncaught exception:
java.io.FileNotFoundException: file:\C:\Spark\conf\spark.truststore (The filenam
e, directory name, or volume label syntax is incorrect)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java
:114)
{code}

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624572#comment-14624572
 ] 

Apache Spark commented on SPARK-9012:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7369

 Accumulators in the task table should be escaped
 

 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
 Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png


 If running the following codes, the task table will be broken because 
 accumulators aren't escaped.
 {code}
 val a = sc.accumulator(1, table)
 sc.parallelize(1 to 10).foreach(i = a += i)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9012:
---

Assignee: (was: Apache Spark)

 Accumulators in the task table should be escaped
 

 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
 Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png


 If running the following codes, the task table will be broken because 
 accumulators aren't escaped.
 {code}
 val a = sc.accumulator(1, table)
 sc.parallelize(1 to 10).foreach(i = a += i)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9012:
---

Assignee: Apache Spark

 Accumulators in the task table should be escaped
 

 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
Assignee: Apache Spark
 Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png


 If running the following codes, the task table will be broken because 
 accumulators aren't escaped.
 {code}
 val a = sc.accumulator(1, table)
 sc.parallelize(1 to 10).foreach(i = a += i)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7751:
---

Assignee: Apache Spark  (was: Xiangrui Meng)

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7751:
---

Assignee: Xiangrui Meng  (was: Apache Spark)

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624590#comment-14624590
 ] 

Apache Spark commented on SPARK-7751:
-

User 'petz2000' has created a pull request for this issue:
https://github.com/apache/spark/pull/7370

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484
 ] 

kumar ranganathan edited comment on SPARK-9009 at 7/13/15 10:27 AM:


Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s
ark.truststore (The filename, directory name, or volume label syntax is incorre
t)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:114)
at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc
{code}

D: is meant for keeping truststore file in different disk (not in C: ) 


was (Author: kumar):
Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s
ark.truststore (The filename, directory name, or volume label syntax is incorre
t)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:114)
at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc
{code}

D: is meant for keeping truststore file in different disk (not in C:) 

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - 

[jira] [Updated] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF

2015-07-13 Thread Shivam Verma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Verma updated SPARK-9011:

Description: 
Hi,

I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
(Random Forest) classifier to classify a small dataset using the 
pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
(Logistic Regression) but not on RF)

Bug:
There is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected 
to contain vectors, whereas with RF, the prediction column contains doubles. 

Suggested Resolution: Either enable BinaryClassificationEvaluator to work with 
doubles, or let RF output a column rawPredictions containing the probability 
vectors (with probability of 1 assigned to predicted label, and 0 assigned to 
the rest).

Detailed Observation:
While running grid search on an RF classifier to classify a small dataset using 
the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
CrossValidator classes. I get the following error when I try passing a 
DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for [cross 
validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
 I also pass the DF through a StringIndexer transformation for the RF:
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}

Note that the above dataset *works* on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.


  was:
Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for [cross 
validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
 I also pass the DF through a StringIndexer transformation for the RF:
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}

Note that the above dataset *works* on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
interprets the 'rawPredict' column - with LR, the rawPredictionCol is a 
list/vector, whereas with RF, the prediction column is a 

[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624487#comment-14624487
 ] 

Sean Owen commented on SPARK-9009:
--

Try {{file:///C:/Spark/conf/...}} Don't use backslashes.
I'm saying that the exception for D: says the device isn't ready, but this has 
nothing to do with Spark.

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9011:
-
Priority: Minor  (was: Critical)

 Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search 
 working on LR but not on RF
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
 node running CentOS
Reporter: Shivam Verma
Priority: Minor
  Labels: cross-validation, ml, mllib, pyspark, randomforest, 
 tuning

 Hi,
 I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
 (Random Forest) classifier to classify a small dataset using the 
 pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
 (Logistic Regression) but not on RF)
 Bug:
 There is an issue with how BinaryClassificationEvaluator(self, 
 rawPredictionCol=rawPrediction, labelCol=label, 
 metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the 
 rawPredictionCol is expected to contain vectors, whereas with RF, the 
 prediction column contains doubles. 
 Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
 with doubles, or let RF output a column rawPredictions containing the 
 probability vectors (with probability of 1 assigned to predicted label, and 0 
 assigned to the rest).
 Detailed Observation:
 While running grid search on an RF classifier to classify a small dataset 
 using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
 CrossValidator classes. I get the following error when I try passing a 
 DataFrame of Features-Labels to CrossValidator:
 {noformat}
 Py4JJavaError: An error occurred while calling o1464.evaluate.
 : java.lang.IllegalArgumentException: requirement failed: Column 
 rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
 but was actually DoubleType.
 {noformat}
 I tried the following code, using the dataset given in Spark's CV 
 documentation for [cross 
 validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
  I also pass the DF through a StringIndexer transformation for the RF:
  
 {noformat}
 dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
 10,[features, label])
 stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
 si_model = stringIndexer.fit(dataset)
 dataset2 = si_model.transform(dataset)
 keep = [dataset2.features, dataset2.indexed]
 dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
 rf = 
 RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
  maxDepth=7)
 grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
 evaluator = BinaryClassificationEvaluator()
 cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
 evaluator=evaluator)
 cvModel = cv.fit(dataset3)
 {noformat}
 Note that the above dataset *works* on logistic regression. I have also tried 
 a larger dataset with sparse vectors as features (which I was originally 
 trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-9012:

Attachment: (was: screenshot-1.png)

 Accumulators in the task table should be escaped
 

 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
 Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png


 If running the following codes, the task table will be broken because 
 accumulators aren't escaped.
 {code}
 val a = sc.accumulator(1, table)
 sc.parallelize(1 to 10).foreach(i = a += i)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-9012:

Attachment: Screen Shot 2015-07-13 at 8.02.44 PM.png

 Accumulators in the task table should be escaped
 

 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
 Attachments: Screen Shot 2015-07-13 at 8.02.44 PM.png


 If running the following codes, the task table will be broken because 
 accumulators aren't escaped.
 {code}
 val a = sc.accumulator(1, table)
 sc.parallelize(1 to 10).foreach(i = a += i)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-9012:

Attachment: screenshot-1.png

 Accumulators in the task table should be escaped
 

 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu
 Attachments: screenshot-1.png


 If running the following codes, the task table will be broken because 
 accumulators aren't escaped.
 {code}
 val a = sc.accumulator(1, table)
 sc.parallelize(1 to 10).foreach(i = a += i)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8915) Add @since tags to mllib.classification

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624594#comment-14624594
 ] 

Apache Spark commented on SPARK-8915:
-

User 'petz2000' has created a pull request for this issue:
https://github.com/apache/spark/pull/7371

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8915) Add @since tags to mllib.classification

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8915:
---

Assignee: Apache Spark

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Minor
  Labels: starter
   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8915) Add @since tags to mllib.classification

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8915:
---

Assignee: (was: Apache Spark)

 Add @since tags to mllib.classification
 ---

 Key: SPARK-8915
 URL: https://issues.apache.org/jira/browse/SPARK-8915
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Priority: Minor
  Labels: starter
   Original Estimate: 1h
  Remaining Estimate: 1h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9010:
---

Assignee: Apache Spark

 Improve the Spark Configuration document about `spark.kryoserializer.buffer`
 

 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: StanZhai
Assignee: Apache Spark
Priority: Minor
  Labels: documentation

 The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
 serialization buffer. Note that there will be one buffer per core on each 
 worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
 needed..
 The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9010:
---

Assignee: (was: Apache Spark)

 Improve the Spark Configuration document about `spark.kryoserializer.buffer`
 

 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Minor
  Labels: documentation

 The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
 serialization buffer. Note that there will be one buffer per core on each 
 worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
 needed..
 The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624381#comment-14624381
 ] 

Apache Spark commented on SPARK-9010:
-

User 'stanzhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/7368

 Improve the Spark Configuration document about `spark.kryoserializer.buffer`
 

 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Minor
  Labels: documentation

 The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
 serialization buffer. Note that there will be one buffer per core on each 
 worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
 needed..
 The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread Shivam Verma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Verma updated SPARK-9011:

Description: 
Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for [cross 
validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
 I also pass the DF through a StringIndexer transformation for the RF:
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}

Note that the above dataset *works* on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
interprets the 'rawPredict' column - with LR, the rawPredictionCol is a 
list/vector, whereas with RF, the prediction column is a double. 

Is it an issue with the evaluator? Is there a workaround?


  was:
Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for logistic regression. I also pass the DF through a StringIndexer 
transformation for the RF: 
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}
Note that the above dataset works on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, 
whereas with RF, the prediction column is a double (I tried it out with a 
single parameter). Is it an issue with the evaluator, or is there anything else 
that I'm missing?


 Issue with running CrossValidator with RandomForestClassifier on dataset
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  

[jira] [Created] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread Shivam Verma (JIRA)
Shivam Verma created SPARK-9011:
---

 Summary: Issue with running CrossValidator with 
RandomForestClassifier on dataset
 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
node running CentOS
Reporter: Shivam Verma
Priority: Critical


Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:

Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.

I tried the following code, using the dataset given in Spark's CV documentation 
for logistic regression. I also pass the DF through a StringIndexer 
transformation for the RF: 
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
 

dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)

Note that the above dataset works on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, 
whereas with RF, the prediction column is a double (I tried it out with a 
single parameter). Is it an issue with the evaluator, or is there anything else 
that I'm missing?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9008:
-
   Priority: Minor  (was: Major)
Component/s: Deploy

Can you not just kill -9 the driver process?
You can propose a doc change if that would help.

Have a look at:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Stop and remove driver from supervised mode in spark-master interface
 -

 Key: SPARK-9008
 URL: https://issues.apache.org/jira/browse/SPARK-9008
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Reporter: Jesper Lundgren
Priority: Minor

 The cluster will automatically restart failing drivers when launched in 
 supervised cluster mode. However there is no official way for a operation 
 team to stop and remove a driver from restarting in case  it is 
 malfunctioning. 
 I know there is bin/spark-class org.apache.spark.deploy.Client kill but 
 this is undocumented and does not always work so well.
 It would be great if there was a way to remove supervised mode to allow kill 
 -9 to work on a driver program.
 The documentation surrounding this could also see some improvements. It would 
 be nice to have some best practice examples on how to work with supervised 
 mode, how to manage graceful shutdown and catch TERM signals. (TERM signal 
 will end with an exit code that triggers restart in supervised mode unless 
 you change the exit code in the application logic)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF

2015-07-13 Thread Shivam Verma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624485#comment-14624485
 ] 

Shivam Verma edited comment on SPARK-9011 at 7/13/15 10:24 AM:
---

Thanks Sean,

I did some more experiments. It is really a bug because 
pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain 
classifiers. So it is the question of making a design choice: either ensuring 
consistency across classifier outputs in Spark.ML or making the 
BinaryClassificationEvaluator generic enough.
I have appropriately modified the description above and I am reopening the 
issue.


was (Author: shivamverma):
I did some more experiments. It is really a bug because 
pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain 
classifiers. So it is the question of making a design choice: either ensuring 
consistency across classifier outputs in Spark.ML or making the 
BinaryClassificationEvaluator generic enough.
I have appropriately modified the description above and I am reopening the 
issue.

 Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search 
 working on LR but not on RF
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
 node running CentOS
Reporter: Shivam Verma
Priority: Critical
  Labels: cross-validation, ml, mllib, pyspark, randomforest, 
 tuning

 Hi,
 I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
 (Random Forest) classifier to classify a small dataset using the 
 pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
 (Logistic Regression) but not on RF)
 Bug:
 There is an issue with how BinaryClassificationEvaluator(self, 
 rawPredictionCol=rawPrediction, labelCol=label, 
 metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the 
 rawPredictionCol is expected to contain vectors, whereas with RF, the 
 prediction column contains doubles. 
 Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
 with doubles, or let RF output a column rawPredictions containing the 
 probability vectors (with probability of 1 assigned to predicted label, and 0 
 assigned to the rest).
 Detailed Observation:
 While running grid search on an RF classifier to classify a small dataset 
 using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
 CrossValidator classes. I get the following error when I try passing a 
 DataFrame of Features-Labels to CrossValidator:
 {noformat}
 Py4JJavaError: An error occurred while calling o1464.evaluate.
 : java.lang.IllegalArgumentException: requirement failed: Column 
 rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
 but was actually DoubleType.
 {noformat}
 I tried the following code, using the dataset given in Spark's CV 
 documentation for [cross 
 validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
  I also pass the DF through a StringIndexer transformation for the RF:
  
 {noformat}
 dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
 10,[features, label])
 stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
 si_model = stringIndexer.fit(dataset)
 dataset2 = si_model.transform(dataset)
 keep = [dataset2.features, dataset2.indexed]
 dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
 rf = 
 RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
  maxDepth=7)
 grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
 evaluator = BinaryClassificationEvaluator()
 cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
 evaluator=evaluator)
 cvModel = cv.fit(dataset3)
 {noformat}
 Note that the above dataset *works* on logistic regression. I have also tried 
 a larger dataset with sparse vectors as features (which I was originally 
 trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF

2015-07-13 Thread Shivam Verma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Verma reopened SPARK-9011:
-

I did some more experiments. It is really a bug because 
pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain 
classifiers. So it is the question of making a design choice: either ensuring 
consistency across classifier outputs in Spark.ML or making the 
BinaryClassificationEvaluator generic enough.
I have appropriately modified the description above and I am reopening the 
issue.

 Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search 
 working on LR but not on RF
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
 node running CentOS
Reporter: Shivam Verma
Priority: Critical
  Labels: cross-validation, ml, mllib, pyspark, randomforest, 
 tuning

 Hi,
 I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
 (Random Forest) classifier to classify a small dataset using the 
 pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
 (Logistic Regression) but not on RF)
 Bug:
 There is an issue with how BinaryClassificationEvaluator(self, 
 rawPredictionCol=rawPrediction, labelCol=label, 
 metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the 
 rawPredictionCol is expected to contain vectors, whereas with RF, the 
 prediction column contains doubles. 
 Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
 with doubles, or let RF output a column rawPredictions containing the 
 probability vectors (with probability of 1 assigned to predicted label, and 0 
 assigned to the rest).
 Detailed Observation:
 While running grid search on an RF classifier to classify a small dataset 
 using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
 CrossValidator classes. I get the following error when I try passing a 
 DataFrame of Features-Labels to CrossValidator:
 {noformat}
 Py4JJavaError: An error occurred while calling o1464.evaluate.
 : java.lang.IllegalArgumentException: requirement failed: Column 
 rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
 but was actually DoubleType.
 {noformat}
 I tried the following code, using the dataset given in Spark's CV 
 documentation for [cross 
 validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
  I also pass the DF through a StringIndexer transformation for the RF:
  
 {noformat}
 dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
 10,[features, label])
 stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
 si_model = stringIndexer.fit(dataset)
 dataset2 = si_model.transform(dataset)
 keep = [dataset2.features, dataset2.indexed]
 dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
 rf = 
 RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
  maxDepth=7)
 grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
 evaluator = BinaryClassificationEvaluator()
 cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
 evaluator=evaluator)
 cvModel = cv.fit(dataset3)
 {noformat}
 Note that the above dataset *works* on logistic regression. I have also tried 
 a larger dataset with sparse vectors as features (which I was originally 
 trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484
 ] 

kumar ranganathan edited comment on SPARK-9009 at 7/13/15 10:24 AM:


Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s
ark.truststore (The filename, directory name, or volume label syntax is incorre
t)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:114)
at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc
{code}


was (Author: kumar):
Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
15/07/13 15:52:32 ERROR SecurityManager: Uncaught exception:
java.io.FileNotFoundException: file:\C:\Spark\conf\spark.truststore (The filenam
e, directory name, or volume label syntax is incorrect)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java
:114)
{code}

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when 

[jira] [Issue Comment Deleted] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search working on LR but not on RF

2015-07-13 Thread Shivam Verma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Verma updated SPARK-9011:

Comment: was deleted

(was: Thanks Sean, 
I did some more experiments. It is really a bug because 
pyspark.ml.tuning.CrossValidator seems to accept outputs of only certain 
classifiers. So it is the question of making a design choice: either ensuring 
consistency across classifier outputs in Spark.ML or making the 
BinaryClassificationEvaluator generic enough.
I have appropriately modified the description above and I am reopening the 
issue.
)

 Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent -- Grid search 
 working on LR but not on RF
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
 node running CentOS
Reporter: Shivam Verma
Priority: Critical
  Labels: cross-validation, ml, mllib, pyspark, randomforest, 
 tuning

 Hi,
 I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
 (Random Forest) classifier to classify a small dataset using the 
 pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
 (Logistic Regression) but not on RF)
 Bug:
 There is an issue with how BinaryClassificationEvaluator(self, 
 rawPredictionCol=rawPrediction, labelCol=label, 
 metricName=areaUnderROC) interprets the 'rawPredict' column - with LR, the 
 rawPredictionCol is expected to contain vectors, whereas with RF, the 
 prediction column contains doubles. 
 Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
 with doubles, or let RF output a column rawPredictions containing the 
 probability vectors (with probability of 1 assigned to predicted label, and 0 
 assigned to the rest).
 Detailed Observation:
 While running grid search on an RF classifier to classify a small dataset 
 using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
 CrossValidator classes. I get the following error when I try passing a 
 DataFrame of Features-Labels to CrossValidator:
 {noformat}
 Py4JJavaError: An error occurred while calling o1464.evaluate.
 : java.lang.IllegalArgumentException: requirement failed: Column 
 rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
 but was actually DoubleType.
 {noformat}
 I tried the following code, using the dataset given in Spark's CV 
 documentation for [cross 
 validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
  I also pass the DF through a StringIndexer transformation for the RF:
  
 {noformat}
 dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
 10,[features, label])
 stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
 si_model = stringIndexer.fit(dataset)
 dataset2 = si_model.transform(dataset)
 keep = [dataset2.features, dataset2.indexed]
 dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
 rf = 
 RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
  maxDepth=7)
 grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
 evaluator = BinaryClassificationEvaluator()
 cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
 evaluator=evaluator)
 cvModel = cv.fit(dataset3)
 {noformat}
 Note that the above dataset *works* on logistic regression. I have also tried 
 a larger dataset with sparse vectors as features (which I was originally 
 trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624484#comment-14624484
 ] 

kumar ranganathan edited comment on SPARK-9009 at 7/13/15 10:26 AM:


Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s
ark.truststore (The filename, directory name, or volume label syntax is incorre
t)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:114)
at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc
{code}

D: is meant for keeping truststore file in different disk (not in C:) 


was (Author: kumar):
Yes, all this in a single machine only. The file exist in the specified 
location for sure. I just tried prefixing with file:/ but getting below 
exception in the command line itself. 

{code}
Exception in thread main java.io.FileNotFoundException: file:\C:\Spark\conf\s
ark.truststore (The filename, directory name, or volume label syntax is incorre
t)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:146)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:124)
at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.jav
:114)
at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.sc
{code}

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 

[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624523#comment-14624523
 ] 

kumar ranganathan commented on SPARK-9009:
--

I have tried the below code and gets the output as true.

{code}
try {
URI uri = new URI(file:///C:/Spark/conf/spark.truststore);
File f=new File(uri);
System.out.println(f.canRead());  
}
catch(Exception ex) {
System.out.println(ex);   
}
{code}

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-9010:

Component/s: (was: SQL)
 Documentation

 Improve the Spark Configuration document about `spark.kryoserializer.buffer`
 

 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Minor
  Labels: documentation

 The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
 serialization buffer. Note that there will be one buffer per core on each 
 worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
 needed..
 The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9010) Improve the Spark Configuration document about `spark.kryoserializer.buffer`

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9010:
-
Target Version/s: 1.4.2, 1.5.0
Priority: Trivial  (was: Minor)

 Improve the Spark Configuration document about `spark.kryoserializer.buffer`
 

 Key: SPARK-9010
 URL: https://issues.apache.org/jira/browse/SPARK-9010
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.0
Reporter: StanZhai
Priority: Trivial
  Labels: documentation

 The meaning of spark.kryoserializer.buffer should be Initial size of Kryo's 
 serialization buffer. Note that there will be one buffer per core on each 
 worker. This buffer will grow up to spark.kryoserializer.buffer.max if 
 needed..
 The spark.kryoserializer.buffer.max.mb is out-of-date in spark 1.4.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9007) start-slave.sh changed API in 1.4 and the documentation got updated to mention the old API

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9007:
-
   Priority: Trivial  (was: Major)
Component/s: (was: Deploy)
 Documentation

[~koudelka] please set the JIRA fields reasonably. Are you going to open a PR?

 start-slave.sh changed API in 1.4 and the documentation got updated to 
 mention the old API
 --

 Key: SPARK-9007
 URL: https://issues.apache.org/jira/browse/SPARK-9007
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Jesper Lundgren
Priority: Trivial

 In Spark version  1.4 start-slave.sh accepted two parameters. worker# and a 
 list of master addresses.
 With Spark 1.4 the start-slave.sh worker# parameter was removed, which broke 
 our custom standalone cluster setup.
 With Spark 1.4 the documentation was also updated to mention spark-slave.sh 
 (not previously mentioned), but it describes the old API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624434#comment-14624434
 ] 

Sean Owen commented on SPARK-9009:
--

Is this all on one machine? because the file would not exist on other machines 
running your jobs.
The D: exception is unrelated to Spark.
It's probably because you need to specify paths in Windows specially. Try 
prefixing with file:

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9009:
-
   Priority: Minor  (was: Major)
Component/s: (was: YARN)

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624499#comment-14624499
 ] 

kumar ranganathan commented on SPARK-9009:
--

Yes i tried with file:/ and file:/// but both results the same exception. I am 
used the forward slash but the exception shows with the backslash. 

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624516#comment-14624516
 ] 

Sean Owen commented on SPARK-9009:
--

Can you paste exactly what worked? I'm still not sure we're talking about the 
same file URIs.

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624521#comment-14624521
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~juliet] from your spark1.4-verbose.log, i find that master= local[*]. so 
maybe in spark-defaults.conf, you config spark.master=local? other situation is 
in your data_transform.py, maybe you use sparkConf.set(spark.master,local). 
Can you check whether these situations have been happened?

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread Shivam Verma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivam Verma updated SPARK-9011:

Description: 
Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for logistic regression. I also pass the DF through a StringIndexer 
transformation for the RF: 
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}
Note that the above dataset works on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, 
whereas with RF, the prediction column is a double (I tried it out with a 
single parameter). Is it an issue with the evaluator, or is there anything else 
that I'm missing?

  was:
Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:

Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.

I tried the following code, using the dataset given in Spark's CV documentation 
for logistic regression. I also pass the DF through a StringIndexer 
transformation for the RF: 
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
 

dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,[features, 
label])
stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)

Note that the above dataset works on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol=rawPrediction, labelCol=label, metricName=areaUnderROC) 
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, 
whereas with RF, the prediction column is a double (I tried it out with a 
single parameter). Is it an issue with the evaluator, or is there anything else 
that I'm missing?


 Issue with running CrossValidator with RandomForestClassifier on dataset
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug

[jira] [Resolved] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset

2015-07-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9011.
--
Resolution: Invalid

This is really a question, which you should ask on user@ first. Until you have 
identified a bug and ideally a code change, I don't think a JIRA is the right 
next step.

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Issue with running CrossValidator with RandomForestClassifier on dataset
 

 Key: SPARK-9011
 URL: https://issues.apache.org/jira/browse/SPARK-9011
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib, PySpark
Affects Versions: 1.4.0
 Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
 node running CentOS
Reporter: Shivam Verma
Priority: Critical
  Labels: cross-validation, ml, mllib, pyspark, randomforest, 
 tuning

 Hi
 I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
 to classify a small dataset using the pyspark.ml.tuning module, specifically 
 the ParamGridBuilder and CrossValidator classes. I get the following error 
 when I try passing a DataFrame of Features-Labels to CrossValidator:
 {noformat}
 Py4JJavaError: An error occurred while calling o1464.evaluate.
 : java.lang.IllegalArgumentException: requirement failed: Column 
 rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
 but was actually DoubleType.
 {noformat}
 I tried the following code, using the dataset given in Spark's CV 
 documentation for [cross 
 validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
  I also pass the DF through a StringIndexer transformation for the RF:
  
 {noformat}
 dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
 10,[features, label])
 stringIndexer = StringIndexer(inputCol=label, outputCol=indexed)
 si_model = stringIndexer.fit(dataset)
 dataset2 = si_model.transform(dataset)
 keep = [dataset2.features, dataset2.indexed]
 dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
 rf = 
 RandomForestClassifier(predictionCol=rawPrediction,featuresCol=features,numTrees=5,
  maxDepth=7)
 grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
 evaluator = BinaryClassificationEvaluator()
 cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
 evaluator=evaluator)
 cvModel = cv.fit(dataset3)
 {noformat}
 Note that the above dataset *works* on logistic regression. I have also tried 
 a larger dataset with sparse vectors as features (which I was originally 
 trying to fit) but received the same error on RF.
 My guess is that there is an issue with how 
 BinaryClassificationEvaluator(self, rawPredictionCol=rawPrediction, 
 labelCol=label, metricName=areaUnderROC) interprets the 'rawPredict' 
 column - with LR, the rawPredictionCol is a list/vector, whereas with RF, the 
 prediction column is a double. 
 Is it an issue with the evaluator? Is there a workaround?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9012) Accumulators in the task table should be escaped

2015-07-13 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-9012:
---

 Summary: Accumulators in the task table should be escaped
 Key: SPARK-9012
 URL: https://issues.apache.org/jira/browse/SPARK-9012
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: Shixiong Zhu


If running the following codes, the task table will be broken because 
accumulators aren't escaped.
{code}
val a = sc.accumulator(1, table)
sc.parallelize(1 to 10).foreach(i = a += i)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7751) Add @since to stable and experimental methods in MLlib

2015-07-13 Thread Patrick Baier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624598#comment-14624598
 ] 

Patrick Baier commented on SPARK-7751:
--

sorry, wrong ticket number

 Add @since to stable and experimental methods in MLlib
 --

 Key: SPARK-7751
 URL: https://issues.apache.org/jira/browse/SPARK-7751
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
  Labels: starter

 This is useful to check whether a feature exists in some version of Spark. 
 This is an umbrella JIRA to track the progress. We want to have @since tag 
 for both stable (those without any Experimental/DeveloperApi/AlphaComponent 
 annotations) and experimental methods in MLlib:
 * an example PR for Scala: https://github.com/apache/spark/pull/6101
 * an example PR for Python: https://github.com/apache/spark/pull/6295
 We need to dig the history of git commit to figure out what was the Spark 
 version when a method was first introduced. Take `NaiveBayes.setModelType` as 
 an example. We can grep `def setModelType` at different version git tags.
 {code}
 meng@xm:~/src/spark
 $ git show 
 v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
 meng@xm:~/src/spark
 $ git show 
 v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
  | grep def setModelType
   def setModelType(modelType: String): NaiveBayes = {
 {code}
 If there are better ways, please let us know.
 We cannot add all @since tags in a single PR, which is hard to review. So we 
 made some subtasks for each package, for example 
 `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
 and the `spark.ml` package.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624501#comment-14624501
 ] 

Sean Owen commented on SPARK-9009:
--

Try a small Java program using the File object to see if you can read the file 
using that exact URI. I doubt this has to do with Spark; maybe the file is not 
readable to your process?

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore

2015-07-13 Thread kumar ranganathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624514#comment-14624514
 ] 

kumar ranganathan commented on SPARK-9009:
--

Yes I tried, I could read the file using Java Program. 

 SPARK Encryption FileNotFoundException for truststore
 -

 Key: SPARK-9009
 URL: https://issues.apache.org/jira/browse/SPARK-9009
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: kumar ranganathan
Priority: Minor

 I got FileNotFoundException in the application master when running the 
 SparkPi example in windows machine.
 The problem is that the truststore file found in 
 C:\Spark\conf\spark.truststore location but getting below exception as
 {code}
 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: 
 java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system 
 cannot find the path specified)
   at java.io.FileInputStream.open(Native Method)
   at java.io.FileInputStream.init(FileInputStream.java:146)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)
   at 
 org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261)
   at 
 org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254)
   at scala.Option.map(Option.scala:145)
   at org.apache.spark.SecurityManager.init(SecurityManager.scala:254)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
   at 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, 
 exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: 
 C:\Spark\conf\spark.truststore (The system cannot find the path specified))
 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called
 {code}
 If i change the truststore file location to different drive 
 (d:\spark_conf\spark.truststore) then getting exception as
 {code}
 java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is 
 not ready)
 {code}
 This exception throws from SecurityManager.scala at the line of openstream() 
 shown below
 {code:title=SecurityManager.scala|borderStyle=solid}
 val trustStoreManagers =
   for (trustStore - fileServerSSLOptions.trustStore) yield {
 val input = 
 Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream()
 try {
 {code}
 The same problem occurs for the keystore file when removed truststore 
 property in spark-defaults.conf.
 When disabled the encryption property to set spark.ssl.enabled as false then 
 the job completed successfully. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6851) Wrong answers for self joins of converted parquet relations

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625755#comment-14625755
 ] 

Apache Spark commented on SPARK-6851:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7387

 Wrong answers for self joins of converted parquet relations
 ---

 Key: SPARK-6851
 URL: https://issues.apache.org/jira/browse/SPARK-6851
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker
 Fix For: 1.3.1, 1.4.0


 From the user list (
 /cc [~chinnitv])  When the same relation exists twice in a query plan, our 
 new caching logic replaces both instances with identical replacements.  The 
 bug can be see in the following transformation:
 {code}
 === Applying Rule 
 org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions ===
 !Project [state#59,month#60]   
 'Project [state#105,month#106]
 ! Join Inner, Some(((state#69 = state#59)  (month#70 = month#60)))'Join 
 Inner, Some(((state#105 = state#105)  (month#106 = month#106)))
 !  MetastoreRelation default, orders, None   
 Subquery orders
 !  Subquery ao
 Relation[id#97,category#98,make#99,type#100,price#101,pdate#102,customer#103,city#104,state#105,month#106]
  org.apache.spark.sql.parquet.ParquetRelation2
 !   Distinct 
 Subquery ao
 !Project [state#69,month#70]  
 Distinct
 ! Join Inner, Some((id#81 = id#71))
 Project [state#105,month#106]
 !  MetastoreRelation default, orders, None  
 Join Inner, Some((id#115 = id#97))
 !  MetastoreRelation default, orderupdates, None 
 Subquery orders
 ! 
 Relation[id#97,category#98,make#99,type#100,price#101,pdate#102,customer#103,city#104,state#105,month#106]
  org.apache.spark.sql.parquet.ParquetRelation2
 !
 Subquery orderupdates
 ! 
 Relation[id#115,category#116,make#117,type#118,price#119,pdate#120,customer#121,city#122,state#123,month#124]
  org.apache.spark.sql.parquet.ParquetRelation2
 {code} 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9030) Add Kinesis.createStream unit tests that actual send data

2015-07-13 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-9030:


 Summary: Add Kinesis.createStream unit tests that actual send data
 Key: SPARK-9030
 URL: https://issues.apache.org/jira/browse/SPARK-9030
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Tathagata Das
Assignee: Tathagata Das


Current Kinesis unit tests do not test createStream by sending data. This JIRA 
is to add such unit test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9027) Generalize predicate pushdown into the metastore

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625733#comment-14625733
 ] 

Apache Spark commented on SPARK-9027:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/7386

 Generalize predicate pushdown into the metastore
 

 Key: SPARK-9027
 URL: https://issues.apache.org/jira/browse/SPARK-9027
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9026:
---

Assignee: Apache Spark  (was: Josh Rosen)

 SimpleFutureAction.onComplete should not tie up a separate thread for each 
 callback
 ---

 Key: SPARK-9026
 URL: https://issues.apache.org/jira/browse/SPARK-9026
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Apache Spark

 As [~zsxwing] points out at 
 https://github.com/apache/spark/pull/7276#issuecomment-121097747, 
 SimpleFutureAction currently blocks a separate execution context thread for 
 each callback registered via onComplete:
 {code}
   override def onComplete[U](func: (Try[T]) = U)(implicit executor: 
 ExecutionContext) {
 executor.execute(new Runnable {
   override def run() {
 func(awaitResult())
   }
 })
   }
 {code}
 We should fix this so that callbacks do not steal threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9027) Generalize predicate pushdown into the metastore

2015-07-13 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-9027:
---

 Summary: Generalize predicate pushdown into the metastore
 Key: SPARK-9027
 URL: https://issues.apache.org/jira/browse/SPARK-9027
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9026:
---

Assignee: Josh Rosen  (was: Apache Spark)

 SimpleFutureAction.onComplete should not tie up a separate thread for each 
 callback
 ---

 Key: SPARK-9026
 URL: https://issues.apache.org/jira/browse/SPARK-9026
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 As [~zsxwing] points out at 
 https://github.com/apache/spark/pull/7276#issuecomment-121097747, 
 SimpleFutureAction currently blocks a separate execution context thread for 
 each callback registered via onComplete:
 {code}
   override def onComplete[U](func: (Try[T]) = U)(implicit executor: 
 ExecutionContext) {
 executor.execute(new Runnable {
   override def run() {
 func(awaitResult())
   }
 })
   }
 {code}
 We should fix this so that callbacks do not steal threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625730#comment-14625730
 ] 

Apache Spark commented on SPARK-9026:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7385

 SimpleFutureAction.onComplete should not tie up a separate thread for each 
 callback
 ---

 Key: SPARK-9026
 URL: https://issues.apache.org/jira/browse/SPARK-9026
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 As [~zsxwing] points out at 
 https://github.com/apache/spark/pull/7276#issuecomment-121097747, 
 SimpleFutureAction currently blocks a separate execution context thread for 
 each callback registered via onComplete:
 {code}
   override def onComplete[U](func: (Try[T]) = U)(implicit executor: 
 ExecutionContext) {
 executor.execute(new Runnable {
   override def run() {
 func(awaitResult())
   }
 })
   }
 {code}
 We should fix this so that callbacks do not steal threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Lianhui Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625725#comment-14625725
 ] 

Lianhui Wang commented on SPARK-8646:
-

[~juliet] can you provide your spark-submit command? 
i think the correct command in spark 1.4 is $SPARK_HOME/bin/spark-submit 
--master yarn-client outofstock/data_transform.py 
hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex4/
is it the same as your command?

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2015-07-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6910.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7216
[https://github.com/apache/spark/pull/7216]

 Support for pushing predicates down to metastore for partition pruning
 --

 Key: SPARK-6910
 URL: https://issues.apache.org/jira/browse/SPARK-6910
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheolsoo Park
Priority: Critical
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9027) Generalize predicate pushdown into the metastore

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9027:
---

Assignee: Apache Spark  (was: Michael Armbrust)

 Generalize predicate pushdown into the metastore
 

 Key: SPARK-9027
 URL: https://issues.apache.org/jira/browse/SPARK-9027
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9028) Add CountVectorizer as an estimator to generate CountVectorizerModel

2015-07-13 Thread yuhao yang (JIRA)
yuhao yang created SPARK-9028:
-

 Summary: Add CountVectorizer as an estimator to generate 
CountVectorizerModel
 Key: SPARK-9028
 URL: https://issues.apache.org/jira/browse/SPARK-9028
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9021) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625789#comment-14625789
 ] 

Apache Spark commented on SPARK-9021:
-

User 'njhwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7378

 Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition
 --

 Key: SPARK-9021
 URL: https://issues.apache.org/jira/browse/SPARK-9021
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04 LTS
Reporter: Nicholas Hwang

 Please see pull request for more information.
 I initially patched this arguably unexpected behavior by serializing 
 zeroValue, but ended up mimicking the deepcopy approach used by other RDD 
 methods. I also contemplated having fold/aggregate accept zero value 
 generator functions instead of an actual object, but that obviously changes 
 the API.
 Looking forward to hearing back and/or being educated on how I'm 
 inappropriately using this functionality (relatively new to Spark and 
 functional programming). Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9021) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9021:
---

Assignee: (was: Apache Spark)

 Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition
 --

 Key: SPARK-9021
 URL: https://issues.apache.org/jira/browse/SPARK-9021
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04 LTS
Reporter: Nicholas Hwang

 Please see pull request for more information.
 I initially patched this arguably unexpected behavior by serializing 
 zeroValue, but ended up mimicking the deepcopy approach used by other RDD 
 methods. I also contemplated having fold/aggregate accept zero value 
 generator functions instead of an actual object, but that obviously changes 
 the API.
 Looking forward to hearing back and/or being educated on how I'm 
 inappropriately using this functionality (relatively new to Spark and 
 functional programming). Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9021) Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9021:
---

Assignee: Apache Spark

 Have pyspark's RDD.aggregate() make a deepcopy of zeroValue for each partition
 --

 Key: SPARK-9021
 URL: https://issues.apache.org/jira/browse/SPARK-9021
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Ubuntu 14.04 LTS
Reporter: Nicholas Hwang
Assignee: Apache Spark

 Please see pull request for more information.
 I initially patched this arguably unexpected behavior by serializing 
 zeroValue, but ended up mimicking the deepcopy approach used by other RDD 
 methods. I also contemplated having fold/aggregate accept zero value 
 generator functions instead of an actual object, but that obviously changes 
 the API.
 Looking forward to hearing back and/or being educated on how I'm 
 inappropriately using this functionality (relatively new to Spark and 
 functional programming). Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8965) Add ml-guide Python Example: Estimator, Transformer, and Param

2015-07-13 Thread Arijit Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625831#comment-14625831
 ] 

Arijit Saha commented on SPARK-8965:


Hi Joseph,

I would like to take up this task.
Being a starter, will help me, to understand flow.

Thanks,
Arijit.

 Add ml-guide Python Example: Estimator, Transformer, and Param
 --

 Key: SPARK-8965
 URL: https://issues.apache.org/jira/browse/SPARK-8965
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, PySpark
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 Look at: 
 [http://spark.apache.org/docs/latest/ml-guide.html#example-estimator-transformer-and-param]
 We need a Python example doing exactly the same thing, but in Python.  It 
 should be tested using the PySpark shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3703) Ensemble learning methods

2015-07-13 Thread Manoj Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625859#comment-14625859
 ] 

Manoj Kumar commented on SPARK-3703:


Hi, I am interested in working on ensemble methods in general (as seen from my 
initial few pull requests). Are any of these targeted towards the 1.5 release? 
I'm asking because I might not be able to commit enough time after September.

 Ensemble learning methods
 -

 Key: SPARK-3703
 URL: https://issues.apache.org/jira/browse/SPARK-3703
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 This is a general JIRA for coordinating on adding ensemble learning methods 
 to MLlib.  These methods include a variety of boosting and bagging 
 algorithms.  Below is a general design doc for ensemble methods (currently 
 focused on boosting).  Please comment here about general discussion and 
 coordination; for comments about specific algorithms, please comment on their 
 respective JIRAs.
 [Design doc for ensemble methods | 
 https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9028) Add CountVectorizer as an estimator to generate CountVectorizerModel

2015-07-13 Thread yuhao yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang updated SPARK-9028:
--
Description: Add an estimator for CountVectorizerModel. The estimator will 
extract a vocabulary from document collections according to the term frequency.

 Add CountVectorizer as an estimator to generate CountVectorizerModel
 

 Key: SPARK-9028
 URL: https://issues.apache.org/jira/browse/SPARK-9028
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang

 Add an estimator for CountVectorizerModel. The estimator will extract a 
 vocabulary from document collections according to the term frequency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9029) shortcut CaseKeyWhen if key is null

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9029:
---

Assignee: Apache Spark

 shortcut CaseKeyWhen if key is null
 ---

 Key: SPARK-9029
 URL: https://issues.apache.org/jira/browse/SPARK-9029
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9029) shortcut CaseKeyWhen if key is null

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625778#comment-14625778
 ] 

Apache Spark commented on SPARK-9029:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7389

 shortcut CaseKeyWhen if key is null
 ---

 Key: SPARK-9029
 URL: https://issues.apache.org/jira/browse/SPARK-9029
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9029) shortcut CaseKeyWhen if key is null

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9029:
---

Assignee: (was: Apache Spark)

 shortcut CaseKeyWhen if key is null
 ---

 Key: SPARK-9029
 URL: https://issues.apache.org/jira/browse/SPARK-9029
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-07-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1403.

  Resolution: Fixed
Target Version/s:   (was: 1.5.0)

Hey All,

This issue should remain fixed. [~mandoskippy] I think you are just running 
into a different issue that is also in some way related to classloading.

Can you open a new JIRA for your issue, paste in the stack trace and give as 
much information as possible without the environment? Thanks!

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-07-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625739#comment-14625739
 ] 

Patrick Wendell edited comment on SPARK-1403 at 7/14/15 2:59 AM:
-

Hey All,

This issue should remain fixed. [~mandoskippy] I think you are just running 
into a different issue that is also in some way related to classloading.

Can you open a new JIRA for your issue, paste in the stack trace and give as 
much information as possible about the environment? Thanks!


was (Author: pwendell):
Hey All,

This issue should remain fixed. [~mandoskippy] I think you are just running 
into a different issue that is also in some way related to classloading.

Can you open a new JIRA for your issue, paste in the stack trace and give as 
much information as possible without the environment? Thanks!

 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0, 1.3.0, 1.4.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9028) Add CountVectorizer as an estimator to generate CountVectorizerModel

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9028:
---

Assignee: Apache Spark

 Add CountVectorizer as an estimator to generate CountVectorizerModel
 

 Key: SPARK-9028
 URL: https://issues.apache.org/jira/browse/SPARK-9028
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: yuhao yang
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9029) shortcut CaseKeyWhen if key is null

2015-07-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9029:
--

 Summary: shortcut CaseKeyWhen if key is null
 Key: SPARK-9029
 URL: https://issues.apache.org/jira/browse/SPARK-9029
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9013) generate MutableProjection directly instead of return a function

2015-07-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-9013:
--

 Summary: generate MutableProjection directly instead of return a 
function
 Key: SPARK-9013
 URL: https://issues.apache.org/jira/browse/SPARK-9013
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9013) generate MutableProjection directly instead of return a function

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9013:
---

Assignee: (was: Apache Spark)

 generate MutableProjection directly instead of return a function
 

 Key: SPARK-9013
 URL: https://issues.apache.org/jira/browse/SPARK-9013
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning

2015-07-13 Thread Walter Petersen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622041#comment-14622041
 ] 

Walter Petersen edited comment on SPARK-3155 at 7/13/15 12:57 PM:
--

Hi all,

I'm new out there. Please tell me:
- Is the proposed implementation based on a well-known research paper ? If so, 
which one ?
- Is this issue still relevant ? Is someone currently implementing the feature 
? 

Thanks


was (Author: petersen):
Hi all,

I'm new out there. Please tell me:
- Is the proposed implementation based on a well-known research paper ? If so, 
which one ?
- Is is issue still relevant ? Is someone currently implementing the feature ? 

Thanks

 Support DecisionTree pruning
 

 Key: SPARK-3155
 URL: https://issues.apache.org/jira/browse/SPARK-3155
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 Improvement: accuracy, computation
 Summary: Pruning is a common method for preventing overfitting with decision 
 trees.  A smart implementation can prune the tree during training in order to 
 avoid training parts of the tree which would be pruned eventually anyways.  
 DecisionTree does not currently support pruning.
 Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
 with zero or more branches removed.
 A naive implementation prunes as follows:
 (1) Train a depth K tree using a training set.
 (2) Compute the optimal prediction at each node (including internal nodes) 
 based on the training set.
 (3) Take a held-out validation set, and use the tree to make predictions for 
 each validation example.  This allows one to compute the validation error 
 made at each node in the tree (based on the predictions computed in step (2).)
 (4) For each pair of leafs with the same parent, compare the total error on 
 the validation set made by the leafs’ predictions with the error made by the 
 parent’s predictions.  Remove the leafs if the parent has lower error.
 A smarter implementation prunes during training, computing the error on the 
 validation set made by each node as it is trained.  Whenever two children 
 increase the validation error, they are pruned, and no more training is 
 required on that branch.
 It is common to use about 1/3 of the data for pruning.  Note that pruning is 
 important when using a tree directly for prediction.  It is less important 
 when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9013) generate MutableProjection directly instead of return a function

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9013:
---

Assignee: Apache Spark

 generate MutableProjection directly instead of return a function
 

 Key: SPARK-9013
 URL: https://issues.apache.org/jira/browse/SPARK-9013
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7549) Support aggregating over nested fields

2015-07-13 Thread Chen Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625704#comment-14625704
 ] 

Chen Song commented on SPARK-7549:
--

I prefer the former. I thought about using explode, it's a good way to 
implement the nested aggregations. But I wanna take advantage of codegen by 
implement these directly.

 Support aggregating over nested fields
 --

 Key: SPARK-7549
 URL: https://issues.apache.org/jira/browse/SPARK-7549
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Would be nice to be able to run sum, avg, min, max (and other numeric 
 aggregate expressions) on arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

2015-07-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7126:
-
Target Version/s:   (was: 1.5.0)

 For spark.ml Classifiers, automatically index labels if they are not yet 
 indexed
 

 Key: SPARK-7126
 URL: https://issues.apache.org/jira/browse/SPARK-7126
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Now that we have StringIndexer, we could have 
 spark.ml.classification.Classifier (the abstraction) automatically handle 
 label indexing if the labels are not yet indexed.
 This would require a bit of design:
 * Should predict() output the original labels or the indices?
 * How should we notify users that the labels are being automatically indexed?
 * How should we provide that index to the users?
 * If multiple parts of a Pipeline automatically index labels, what do we need 
 to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7126) For spark.ml Classifiers, automatically index labels if they are not yet indexed

2015-07-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625708#comment-14625708
 ] 

Joseph K. Bradley commented on SPARK-7126:
--

I agree we should emulate scikit-learn.  I've spoken with [~mengxr], who 
strongly supports having transform() maintain the current semantics of using 
0-based label indices.

This means that, to solve this JIRA, we will need to add a new method analogous 
to fit() which returns a PipelineModel rather than a specific model (like 
LogisticRegressionModel).  That PipelineModel can include indexing and 
de-indexing labels, and perhaps other transformations as well.  This addition 
to the API will require some significant design, which we hope to do before 
long...but maybe not for 1.5.  I'll remove that target version.

 For spark.ml Classifiers, automatically index labels if they are not yet 
 indexed
 

 Key: SPARK-7126
 URL: https://issues.apache.org/jira/browse/SPARK-7126
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Now that we have StringIndexer, we could have 
 spark.ml.classification.Classifier (the abstraction) automatically handle 
 label indexing if the labels are not yet indexed.
 This would require a bit of design:
 * Should predict() output the original labels or the indices?
 * How should we notify users that the labels are being automatically indexed?
 * How should we provide that index to the users?
 * If multiple parts of a Pipeline automatically index labels, what do we need 
 to do to make sure they are consistent?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities

2015-07-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625710#comment-14625710
 ] 

Joseph K. Bradley commented on SPARK-6884:
--

Once [SPARK-7131] gets merged, then we can extend trees (and then forests) to 
provide class probabilities.  I'd watch that JIRA to get pinged when it's 
merged.  Thanks!

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625716#comment-14625716
 ] 

Apache Spark commented on SPARK-8998:
-

User 'zhangjiajin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7383

 Collect enough frequent prefixes before projection in PrefixSpan
 

 Key: SPARK-8998
 URL: https://issues.apache.org/jira/browse/SPARK-8998
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Zhang JiaJin
   Original Estimate: 48h
  Remaining Estimate: 48h

 The implementation in SPARK-6487 might have scalability issues when the 
 number of frequent items is very small. In this case, we can generate 
 candidate sets of higher orders using Apriori-like algorithms and count them, 
 until we collect enough prefixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8998:
---

Assignee: Zhang JiaJin  (was: Apache Spark)

 Collect enough frequent prefixes before projection in PrefixSpan
 

 Key: SPARK-8998
 URL: https://issues.apache.org/jira/browse/SPARK-8998
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Zhang JiaJin
   Original Estimate: 48h
  Remaining Estimate: 48h

 The implementation in SPARK-6487 might have scalability issues when the 
 number of frequent items is very small. In this case, we can generate 
 candidate sets of higher orders using Apriori-like algorithms and count them, 
 until we collect enough prefixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8998) Collect enough frequent prefixes before projection in PrefixSpan

2015-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8998:
---

Assignee: Apache Spark  (was: Zhang JiaJin)

 Collect enough frequent prefixes before projection in PrefixSpan
 

 Key: SPARK-8998
 URL: https://issues.apache.org/jira/browse/SPARK-8998
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
   Original Estimate: 48h
  Remaining Estimate: 48h

 The implementation in SPARK-6487 might have scalability issues when the 
 number of frequent items is very small. In this case, we can generate 
 candidate sets of higher orders using Apriori-like algorithms and count them, 
 until we collect enough prefixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback

2015-07-13 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-9026:
-

 Summary: SimpleFutureAction.onComplete should not tie up a 
separate thread for each callback
 Key: SPARK-9026
 URL: https://issues.apache.org/jira/browse/SPARK-9026
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen


As [~zsxwing] points out at 
https://github.com/apache/spark/pull/7276#issuecomment-121097747, 
SimpleFutureAction currently blocks a separate execution context thread for 
each callback registered via onComplete:

{code}
  override def onComplete[U](func: (Try[T]) = U)(implicit executor: 
ExecutionContext) {
executor.execute(new Runnable {
  override def run() {
func(awaitResult())
  }
})
  }
{code}

We should fix this so that callbacks do not steal threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9026) SimpleFutureAction.onComplete should not tie up a separate thread for each callback

2015-07-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-9026:
-

Assignee: Josh Rosen

 SimpleFutureAction.onComplete should not tie up a separate thread for each 
 callback
 ---

 Key: SPARK-9026
 URL: https://issues.apache.org/jira/browse/SPARK-9026
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 As [~zsxwing] points out at 
 https://github.com/apache/spark/pull/7276#issuecomment-121097747, 
 SimpleFutureAction currently blocks a separate execution context thread for 
 each callback registered via onComplete:
 {code}
   override def onComplete[U](func: (Try[T]) = U)(implicit executor: 
 ExecutionContext) {
 executor.execute(new Runnable {
   override def run() {
 func(awaitResult())
   }
 })
   }
 {code}
 We should fix this so that callbacks do not steal threads.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9015) Maven cleanup / Clean Project Import in scala-ide

2015-07-13 Thread Jan Prach (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Prach updated SPARK-9015:
-
Comment: was deleted

(was: PR #7375)

 Maven cleanup / Clean Project Import in scala-ide
 -

 Key: SPARK-9015
 URL: https://issues.apache.org/jira/browse/SPARK-9015
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Jan Prach

 Cleanup maven for a clean import in scala-ide / eclipse.
 The outstanging PR contains things like removal of groovy plugin and some 
 more maven cleanup goes here.
 In order to make it a seamless experience two more things have to be merged 
 upstream:
 1) ide automatically generate jva sources from idl - 
 https://issues.apache.org/jira/browse/AVRO-1671
 2) set scala version in ide based on maven config - 
 https://github.com/sonatype/m2eclipse-scala/issues/30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6319) DISTINCT doesn't work for binary type

2015-07-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6319:
--
Priority: Critical  (was: Major)

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian
Priority: Critical

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6319) DISTINCT doesn't work for binary type

2015-07-13 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625045#comment-14625045
 ] 

Josh Rosen commented on SPARK-6319:
---

I think that we should revisit this issue.  It seems that we currently return 
wrong answers for groupBy queries involving binary typed columns.  If we're not 
going to support this properly, then I think we should fail-fast with an 
analysis error rather than returning an incorrect answer.

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8907) Speed up path construction in DynamicPartitionWriterContainer.outputWriterForRow

2015-07-13 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625150#comment-14625150
 ] 

Ilya Ganelin commented on SPARK-8907:
-

[~rxin] The code for this in master has eliminated usage of zip and map as of 
[SPARK-8961|https://github.com/apache/spark/commit/33630883685eafcc3ee4521ea8363be342f6e6b4].
 Do you think this can be further optimized and if so, how? There doesn't seem 
to be much within the existing catalyst expressions that would facilitate this, 
but I could be wrong. 

The relevant code fragment is below:
{code}
val partitionPath = {
  val partitionPathBuilder = new StringBuilder
  var i = 0

  while (i  partitionColumns.length) {
val col = partitionColumns(i)
val partitionValueString = {
  val string = row.getString(i)
  if (string.eq(null)) defaultPartitionName else 
PartitioningUtils.escapePathName(string)
}

if (i  0) {
  partitionPathBuilder.append(Path.SEPARATOR_CHAR)
}

partitionPathBuilder.append(s$col=$partitionValueString)
i += 1
  }

  partitionPathBuilder.toString()
}
{code}

 Speed up path construction in 
 DynamicPartitionWriterContainer.outputWriterForRow
 

 Key: SPARK-8907
 URL: https://issues.apache.org/jira/browse/SPARK-8907
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 Don't use zip and scala collection methods to avoid garbage collection
 {code}
 val partitionPath = partitionColumns.zip(row.toSeq).map { case (col, 
 rawValue) =
   val string = if (rawValue == null) null else String.valueOf(rawValue)
   val valueString = if (string == null || string.isEmpty) {
 defaultPartitionName
   } else {
 PartitioningUtils.escapePathName(string)
   }
   s/$col=$valueString
 }.mkString.stripPrefix(Path.SEPARATOR)
 {code}
 We can probably use catalyst expressions themselves to construct the path, 
 and then we can leverage code generation to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4362) Make prediction probability available in NaiveBayesModel

2015-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625157#comment-14625157
 ] 

Apache Spark commented on SPARK-4362:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7376

 Make prediction probability available in NaiveBayesModel
 

 Key: SPARK-4362
 URL: https://issues.apache.org/jira/browse/SPARK-4362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jatinpreet Singh
Priority: Minor
  Labels: naive-bayes

 There is currently no way to get the posterior probability of a prediction 
 with Naive Baye's model during prediction. This should be made available 
 along with the label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8954) Building Docker Images Fails in 1.4 branch

2015-07-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-8954.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7346
[https://github.com/apache/spark/pull/7346]

 Building Docker Images Fails in 1.4 branch
 --

 Key: SPARK-8954
 URL: https://issues.apache.org/jira/browse/SPARK-8954
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
 Environment: Docker
Reporter: Pradeep Bashyal
 Fix For: 1.5.0


 Docker build on branch 1.4 fails when installing the jdk. It expects 
 tzdata-java as a dependency but adding that to the apt-get install list 
 doesn't help.
 ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/   
   ◼
 Sending build context to Docker daemon 3.072 kB
 Sending build context to Docker daemon
 Step 0 : FROM ubuntu:precise
  --- 78cef618c77e
 Step 1 : RUN echo deb http://archive.ubuntu.com/ubuntu precise main 
 universe  /etc/apt/sources.list
  --- Using cache
  --- 2017472bec85
 Step 2 : RUN apt-get update
  --- Using cache
  --- 86b8911ead16
 Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools 
 vim-tiny sudo openssh-server
  --- Running in dc8197a0ea31
 Reading package lists...
 Building dependency tree...
 Reading state information...
 Some packages could not be installed. This may mean that you have
 requested an impossible situation or if you are using the unstable
 distribution that some required packages have not yet been created
 or been moved out of Incoming.
 The following information may help to resolve the situation:
 The following packages have unmet dependencies:
  openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be 
 installed
 E: Unable to correct problems, you have held broken packages.
 INFO[0004] The command [/bin/sh -c apt-get install -y less 
 openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a 
 non-zero code: 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8991) Update SharedParamsCodeGen's Generated Documentation

2015-07-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8991.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7367
[https://github.com/apache/spark/pull/7367]

 Update SharedParamsCodeGen's Generated Documentation
 

 Key: SPARK-8991
 URL: https://issues.apache.org/jira/browse/SPARK-8991
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Feynman Liang
Priority: Trivial
  Labels: Starter
 Fix For: 1.5.0


 We no longer need
 Specifically, the [generated documentation in 
 SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala#L137]
  should be modified from
 {{code}}
   |/**
   | * (private[ml]) Trait for shared param $name$defaultValueDoc.
   | */
 {{code}}
 to
 {{code}}
   |/**
   | * Trait for shared param $name$defaultValueDoc.
   | */
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9017) More timers for MLlib algorithms

2015-07-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9017:


 Summary: More timers for MLlib algorithms
 Key: SPARK-9017
 URL: https://issues.apache.org/jira/browse/SPARK-9017
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


It is useful to provide more instrumentation to MLlib algorithms, like training 
time for each stage in k-means. This is an umbrella JIRA for implementing more 
timers to MLlib algorithms. The first PR would be a generic timer utility based 
on the one used in trees. Then we can distribute the work. It is also helpful 
for contributors to understand the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9018) Implement a generic Timer utility for ML algorithms

2015-07-13 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-9018:


 Summary: Implement a generic Timer utility for ML algorithms
 Key: SPARK-9018
 URL: https://issues.apache.org/jira/browse/SPARK-9018
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng


The Timer utility should be based on the one implemented in trees. In 
particular, we should offer two versions:

1. a global timer that is initialized on the driver and use accumulator to 
aggregate time
2. a local timer that is initialized on the worker, and only provide per task 
measurement.

1) needs some performance benchmark and guidance on the granularity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2

2015-07-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9005:
-
Shepherd: Joseph K. Bradley
Assignee: Feynman Liang

 RegressionMetrics computing incorrect explainedVariance and r2
 --

 Key: SPARK-9005
 URL: https://issues.apache.org/jira/browse/SPARK-9005
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Feynman Liang
Assignee: Feynman Liang

 {{RegressionMetrics}} currently computes explainedVariance using 
 {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia 
 definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] 
 uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two 
 coincide only when the predictor is unbiased (e.g. an intercept term is 
 included in a linear model), but this is not always the case. We should 
 change to be consistent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8954) Building Docker Images Fails in 1.4 branch

2015-07-13 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-8954:
--
Assignee: Yong Tang

 Building Docker Images Fails in 1.4 branch
 --

 Key: SPARK-8954
 URL: https://issues.apache.org/jira/browse/SPARK-8954
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.4.0
 Environment: Docker
Reporter: Pradeep Bashyal
Assignee: Yong Tang
 Fix For: 1.5.0


 Docker build on branch 1.4 fails when installing the jdk. It expects 
 tzdata-java as a dependency but adding that to the apt-get install list 
 doesn't help.
 ~/S/s/d/spark-test git:branch-1.4 ❯❯❯ docker build -t spark-test-base base/   
   ◼
 Sending build context to Docker daemon 3.072 kB
 Sending build context to Docker daemon
 Step 0 : FROM ubuntu:precise
  --- 78cef618c77e
 Step 1 : RUN echo deb http://archive.ubuntu.com/ubuntu precise main 
 universe  /etc/apt/sources.list
  --- Using cache
  --- 2017472bec85
 Step 2 : RUN apt-get update
  --- Using cache
  --- 86b8911ead16
 Step 3 : RUN apt-get install -y less openjdk-7-jre-headless net-tools 
 vim-tiny sudo openssh-server
  --- Running in dc8197a0ea31
 Reading package lists...
 Building dependency tree...
 Reading state information...
 Some packages could not be installed. This may mean that you have
 requested an impossible situation or if you are using the unstable
 distribution that some required packages have not yet been created
 or been moved out of Incoming.
 The following information may help to resolve the situation:
 The following packages have unmet dependencies:
  openjdk-7-jre-headless : Depends: tzdata-java but it is not going to be 
 installed
 E: Unable to correct problems, you have held broken packages.
 INFO[0004] The command [/bin/sh -c apt-get install -y less 
 openjdk-7-jre-headless net-tools vim-tiny sudo openssh-server] returned a 
 non-zero code: 100



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8991) Update SharedParamsCodeGen's Generated Documentation

2015-07-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8991:
-
Assignee: Vinod KC

 Update SharedParamsCodeGen's Generated Documentation
 

 Key: SPARK-8991
 URL: https://issues.apache.org/jira/browse/SPARK-8991
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Feynman Liang
Assignee: Vinod KC
Priority: Trivial
  Labels: Starter
 Fix For: 1.5.0


 We no longer need
 Specifically, the [generated documentation in 
 SharedParamsCodeGen|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala#L137]
  should be modified from
 {{code}}
   |/**
   | * (private[ml]) Trait for shared param $name$defaultValueDoc.
   | */
 {{code}}
 to
 {{code}}
   |/**
   | * Trait for shared param $name$defaultValueDoc.
   | */
 {{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8838) Add config to enable/disable merging part-files when merging parquet schema

2015-07-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8838:

Shepherd: Cheng Lian

 Add config to enable/disable merging part-files when merging parquet schema
 ---

 Key: SPARK-8838
 URL: https://issues.apache.org/jira/browse/SPARK-8838
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh

 Currently all part-files are merged when merging parquet schema. However, in 
 case there are many part-files and we can make sure that all the part-files 
 have the same schema as their summary file. If so, we provide a configuration 
 to disable merging part-files when merging parquet schema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6319) DISTINCT doesn't work for binary type

2015-07-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625082#comment-14625082
 ] 

Michael Armbrust commented on SPARK-6319:
-

+1 to throwing an {{AnalysisException}}

 DISTINCT doesn't work for binary type
 -

 Key: SPARK-6319
 URL: https://issues.apache.org/jira/browse/SPARK-6319
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
Reporter: Cheng Lian
Priority: Critical

 Spark shell session for reproduction:
 {noformat}
 scala import sqlContext.implicits._
 scala import org.apache.spark.sql.types._
 scala Seq(1, 1, 2, 2).map(i = Tuple1(i.toString)).toDF(c).select($c 
 cast BinaryType).distinct.show()
 ...
 CAST(c, BinaryType)
 [B@43f13160
 [B@5018b648
 [B@3be22500
 [B@476fc8a1
 {noformat}
 Spark SQL uses plain byte arrays to represent binary values. However, arrays 
 are compared by reference rather than by value. On the other hand, the 
 DISTINCT operator uses a {{HashSet}} and its {{.contains}} method to check 
 for duplicated values. These two facts together cause the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8646) PySpark does not run on YARN

2015-07-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625087#comment-14625087
 ] 

Marcelo Vanzin commented on SPARK-8646:
---

[~j_houg] could you also run the command with the SPARK_PRINT_LAUNCH_COMMAND=1 
env variable set, and post the command logged to stderr?

 PySpark does not run on YARN
 

 Key: SPARK-8646
 URL: https://issues.apache.org/jira/browse/SPARK-8646
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 1.4.0
 Environment: SPARK_HOME=local/path/to/spark1.4install/dir
 also with
 SPARK_HOME=local/path/to/spark1.4install/dir
 PYTHONPATH=$SPARK_HOME/python/lib
 Spark apps are submitted with the command:
 $SPARK_HOME/bin/spark-submit outofstock/data_transform.py 
 hdfs://foe-dev/DEMO_DATA/FACT_POS hdfs:/user/juliet/ex/ yarn-client
 data_transform contains a main method, and the rest of the args are parsed in 
 my own code.
Reporter: Juliet Hougland
 Attachments: executor.log, pi-test.log, 
 spark1.4-SPARK_HOME-set-PYTHONPATH-set.log, 
 spark1.4-SPARK_HOME-set-inline-HADOOP_CONF_DIR.log, 
 spark1.4-SPARK_HOME-set.log, spark1.4-verbose.log, verbose-executor.log


 Running pyspark jobs result in a no module named pyspark when run in 
 yarn-client mode in spark 1.4.
 [I believe this JIRA represents the change that introduced this error.| 
 https://issues.apache.org/jira/browse/SPARK-6869 ]
 This does not represent a binary compatible change to spark. Scripts that 
 worked on previous spark versions (ie comands the use spark-submit) should 
 continue to work without modification between minor versions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8950) Correct the calculation of SchedulerDelayTime in StagePage

2015-07-13 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-8950.
---
   Resolution: Fixed
 Assignee: Carson Wang
Fix Version/s: 1.5.0

 Correct the calculation of SchedulerDelayTime in StagePage 
 ---

 Key: SPARK-8950
 URL: https://issues.apache.org/jira/browse/SPARK-8950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.4.0
Reporter: Carson Wang
Assignee: Carson Wang
Priority: Minor
 Fix For: 1.5.0


 In StagePage, the SchedulerDelay is calculated as totalExecutionTime - 
 executorRunTime - executorOverhead - gettingResultTime.
 But the totalExecutionTime is calculated in the way that doesn't include the 
 gettingResultTime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9016) Make the random forest classifiers implement classification trait

2015-07-13 Thread holdenk (JIRA)
holdenk created SPARK-9016:
--

 Summary: Make the random forest classifiers implement 
classification trait
 Key: SPARK-9016
 URL: https://issues.apache.org/jira/browse/SPARK-9016
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: holdenk
Priority: Minor


This is a blocking issue for https://issues.apache.org/jira/browse/SPARK-8069 . 
Since we want to add thresholding/cutoff support to RandomForest and we wish to 
do this in a general way we should move RandomForest over to the Clasisfication 
trait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >