[jira] [Comment Edited] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704 ] Yanbo Liang edited comment on SPARK-9003 at 7/12/15 8:19 AM: - Yes, I can provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) was (Author: yanboliang): Yes, I can provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArry.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704 ] Yanbo Liang commented on SPARK-9003: Yes, I can provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArry.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame
Davies Liu created SPARK-9006: - Summary: TimestampType may loss a microsecond after a round trip in Python DataFrame Key: SPARK-9006 URL: https://issues.apache.org/jira/browse/SPARK-9006 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker This bug causes SQLTests.test_time_with_timezone flaky. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-9006: -- Description: This bug causes SQLTests.test_time_with_timezone flaky in Python 3. (was: This bug causes SQLTests.test_time_with_timezone flaky.) TimestampType may loss a microsecond after a round trip in Python DataFrame --- Key: SPARK-9006 URL: https://issues.apache.org/jira/browse/SPARK-9006 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker This bug causes SQLTests.test_time_with_timezone flaky in Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9006: --- Assignee: Apache Spark (was: Davies Liu) TimestampType may loss a microsecond after a round trip in Python DataFrame --- Key: SPARK-9006 URL: https://issues.apache.org/jira/browse/SPARK-9006 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker This bug causes SQLTests.test_time_with_timezone flaky in Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9006: --- Assignee: Davies Liu (was: Apache Spark) TimestampType may loss a microsecond after a round trip in Python DataFrame --- Key: SPARK-9006 URL: https://issues.apache.org/jira/browse/SPARK-9006 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker This bug causes SQLTests.test_time_with_timezone flaky in Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624144#comment-14624144 ] Apache Spark commented on SPARK-9006: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/7363 TimestampType may loss a microsecond after a round trip in Python DataFrame --- Key: SPARK-9006 URL: https://issues.apache.org/jira/browse/SPARK-9006 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker This bug causes SQLTests.test_time_with_timezone flaky in Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9007) start-slave.sh changed API in 1.4 and the documentation got updated to mention the old API
Jesper Lundgren created SPARK-9007: -- Summary: start-slave.sh changed API in 1.4 and the documentation got updated to mention the old API Key: SPARK-9007 URL: https://issues.apache.org/jira/browse/SPARK-9007 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.4.0 Reporter: Jesper Lundgren In Spark version 1.4 start-slave.sh accepted two parameters. worker# and a list of master addresses. With Spark 1.4 the start-slave.sh worker# parameter was removed, which broke our custom standalone cluster setup. With Spark 1.4 the documentation was also updated to mention spark-slave.sh (not previously mentioned), but it describes the old API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface
Jesper Lundgren created SPARK-9008: -- Summary: Stop and remove driver from supervised mode in spark-master interface Key: SPARK-9008 URL: https://issues.apache.org/jira/browse/SPARK-9008 Project: Spark Issue Type: New Feature Reporter: Jesper Lundgren The cluster will automatically restart failing drivers when launched in supervised cluster mode. However there is no official way for a operation team to stop and remove a driver from restarting in case it is malfunctioning. I know there is bin/spark-class org.apache.spark.deploy.Client kill but this is undocumented and does not always work so well. It would be great if there was a way to remove supervised mode to allow kill -9 to work on a driver program. The documentation surrounding this could also see some improvements. It would be nice to have some best practice examples on how to work with supervised mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will end with wrong exit code and trigger restart when using supervised mode unless you change the exit code in the application logic) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5571) LDA should handle text as well
[ https://issues.apache.org/jira/browse/SPARK-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624152#comment-14624152 ] Feynman Liang commented on SPARK-5571: -- [~a...@jivesoftware.com], are you still working on this? I wanted to point out [CountVectorizer|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala] was recently merged and seems appropriate for this task. If you aren't working on this anymore, I would be happy to take this task. LDA should handle text as well -- Key: SPARK-5571 URL: https://issues.apache.org/jira/browse/SPARK-5571 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Latent Dirichlet Allocation (LDA) currently operates only on vectors of word counts. It should also supporting training and prediction using text (Strings). This plan is sketched in the [original LDA design doc|https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo/edit?usp=sharing]. There should be: * runWithText() method which takes an RDD with a collection of Strings (bags of words). This will also index terms and compute a dictionary. * dictionary parameter for when LDA is run with word count vectors * prediction/feedback methods returning Strings (such as describeTopicsAsStrings, which is commented out in LDA currently) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch
[ https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624151#comment-14624151 ] Jesper Lundgren commented on SPARK-8941: I've created two new JIRA tickets, can you review them to see if they are OK? https://issues.apache.org/jira/browse/SPARK-9007 https://issues.apache.org/jira/browse/SPARK-9008 Thanks! Standalone cluster worker does not accept multiple masters on launch Key: SPARK-8941 URL: https://issues.apache.org/jira/browse/SPARK-8941 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.4.0, 1.4.1 Reporter: Jesper Lundgren Priority: Critical Before 1.4 it was possible to launch a worker node using a comma separated list of master nodes. ex: sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078 starting org.apache.spark.deploy.worker.Worker, logging to /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/07/09 12:33:06 INFO Utils: Shutdown hook called Spark 1.2 and 1.3.1 accepts multiple masters in this format. update: start-slave.sh only expects master lists in 1.4 (no instance number) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9008) Stop and remove driver from supervised mode in spark-master interface
[ https://issues.apache.org/jira/browse/SPARK-9008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jesper Lundgren updated SPARK-9008: --- Description: The cluster will automatically restart failing drivers when launched in supervised cluster mode. However there is no official way for a operation team to stop and remove a driver from restarting in case it is malfunctioning. I know there is bin/spark-class org.apache.spark.deploy.Client kill but this is undocumented and does not always work so well. It would be great if there was a way to remove supervised mode to allow kill -9 to work on a driver program. The documentation surrounding this could also see some improvements. It would be nice to have some best practice examples on how to work with supervised mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will end with an exit code that triggers restart in supervised mode unless you change the exit code in the application logic) was: The cluster will automatically restart failing drivers when launched in supervised cluster mode. However there is no official way for a operation team to stop and remove a driver from restarting in case it is malfunctioning. I know there is bin/spark-class org.apache.spark.deploy.Client kill but this is undocumented and does not always work so well. It would be great if there was a way to remove supervised mode to allow kill -9 to work on a driver program. The documentation surrounding this could also see some improvements. It would be nice to have some best practice examples on how to work with supervised mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will end with wrong exit code and trigger restart when using supervised mode unless you change the exit code in the application logic) Stop and remove driver from supervised mode in spark-master interface - Key: SPARK-9008 URL: https://issues.apache.org/jira/browse/SPARK-9008 Project: Spark Issue Type: New Feature Reporter: Jesper Lundgren The cluster will automatically restart failing drivers when launched in supervised cluster mode. However there is no official way for a operation team to stop and remove a driver from restarting in case it is malfunctioning. I know there is bin/spark-class org.apache.spark.deploy.Client kill but this is undocumented and does not always work so well. It would be great if there was a way to remove supervised mode to allow kill -9 to work on a driver program. The documentation surrounding this could also see some improvements. It would be nice to have some best practice examples on how to work with supervised mode, how to manage graceful shutdown and catch TERM signals. (TERM signal will end with an exit code that triggers restart in supervised mode unless you change the exit code in the application logic) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9006) TimestampType may loss a microsecond after a round trip in Python DataFrame
[ https://issues.apache.org/jira/browse/SPARK-9006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-9006. --- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 7363 [https://github.com/apache/spark/pull/7363] TimestampType may loss a microsecond after a round trip in Python DataFrame --- Key: SPARK-9006 URL: https://issues.apache.org/jira/browse/SPARK-9006 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.5.0 This bug causes SQLTests.test_time_with_timezone flaky in Python 3. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8488) HOG Feature Transformer
[ https://issues.apache.org/jira/browse/SPARK-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-8488: - Description: Histogram of oriented gradients (HOG) is method utilizing local orientation (gradients and edges) to transform images into dense image descriptors (Dalal Triggs, CVPR 2005, http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf). HOG in Spark ML pipelines can be implemented as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the transformer should output an ArrayArray[[Numeric]] of the HOG features for the provided image. HOG and SIFT are similar in that the both represent images using local orientation histograms. In contrast to SIFT, however, HOG uses overlapping spatial blocks and is computed densely across all pixels. was: Histogram of oriented gradients (HOG) is method utilizing local orientation (gradients and edges) to transform images into dense image descriptors (Dalal Triggs, CVPR 2005, http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf). HOG in Spark ML pipelines can be implemented as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the SIFT transformer should output an ArrayArray[[Numeric]] of the HOG features for the provided image. HOG and SIFT are similar in that the both represent images using local orientation histograms. In contrast to SIFT, however, HOG uses overlapping spatial blocks and is computed densely across all pixels. HOG Feature Transformer --- Key: SPARK-8488 URL: https://issues.apache.org/jira/browse/SPARK-8488 Project: Spark Issue Type: Sub-task Components: ML Reporter: Feynman Liang Priority: Minor Histogram of oriented gradients (HOG) is method utilizing local orientation (gradients and edges) to transform images into dense image descriptors (Dalal Triggs, CVPR 2005, http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf). HOG in Spark ML pipelines can be implemented as a org.apache.spark.ml.Transformer. Given an image Array[Array[Numeric]], the transformer should output an ArrayArray[[Numeric]] of the HOG features for the provided image. HOG and SIFT are similar in that the both represent images using local orientation histograms. In contrast to SIFT, however, HOG uses overlapping spatial blocks and is computed densely across all pixels. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623956#comment-14623956 ] Josh Rosen commented on SPARK-4879: --- [~darabos], do you think that this issue might have been resolved in an earlier Spark version but inadvertently broken in the upgrade to 1.4.0? If you have an easy reproduction, it might be helpful to see whether the problem occurs on 1.3.1. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} One interesting thing to note about this stack trace: if we look at {{FileOutputCommitter.java:160}} ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]), this point in the execution seems to correspond to a case where a task completes, attempts to commit its output, fails for some reason, then deletes the destination file, tries again, and fails: {code} if
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624000#comment-14624000 ] Daniel Darabos commented on SPARK-4879: --- Good idea! I'll try with 1.3.1 next week. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} One interesting thing to note about this stack trace: if we look at {{FileOutputCommitter.java:160}} ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]), this point in the execution seems to correspond to a case where a task completes, attempts to commit its output, fails for some reason, then deletes the destination file, tries again, and fails: {code} if (fs.isFile(taskOutput)) { 152 Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 153 getTempTaskOutputPath(context)); 154 if (!fs.rename(taskOutput,
[jira] [Created] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
Feynman Liang created SPARK-9005: Summary: RegressionMetrics computing incorrect explainedVariance and r2 Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8941) Standalone cluster worker does not accept multiple masters on launch
[ https://issues.apache.org/jira/browse/SPARK-8941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623952#comment-14623952 ] Josh Rosen commented on SPARK-8941: --- SGTM; do you want to open a new JIRA to follow up on the documentation issues, plus separate issues for the other problems you've identified? If you do this, just link the issues here and I'll close this one out. Thanks! Standalone cluster worker does not accept multiple masters on launch Key: SPARK-8941 URL: https://issues.apache.org/jira/browse/SPARK-8941 Project: Spark Issue Type: Bug Components: Deploy, Documentation Affects Versions: 1.4.0, 1.4.1 Reporter: Jesper Lundgren Priority: Critical Before 1.4 it was possible to launch a worker node using a comma separated list of master nodes. ex: sbin/start-slave.sh 1 spark://localhost:7077,localhost:7078 starting org.apache.spark.deploy.worker.Worker, logging to /Users/jesper/Downloads/spark-1.4.0-bin-cdh4/sbin/../logs/spark-jesper-org.apache.spark.deploy.worker.Worker-1-Jespers-MacBook-Air.local.out failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/07/09 12:33:06 INFO Utils: Shutdown hook called Spark 1.2 and 1.3.1 accepts multiple masters in this format. update: start-slave.sh only expects master lists in 1.4 (no instance number) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623944#comment-14623944 ] Feynman Liang commented on SPARK-9005: -- I will be working on this. RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8450) PySpark write.parquet raises Unsupported datatype DecimalType()
[ https://issues.apache.org/jira/browse/SPARK-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623999#comment-14623999 ] Peter Hoffmann commented on SPARK-8450: --- I have tried it with todays spark-1.5.0-SNAPSHOT-bin-hadoop2.6 daily build from http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/ and was able to save DecimalType(16,2) as parquet in python Thanks for the quick fix! PySpark write.parquet raises Unsupported datatype DecimalType() --- Key: SPARK-8450 URL: https://issues.apache.org/jira/browse/SPARK-8450 Project: Spark Issue Type: Bug Components: PySpark, SQL Environment: Spark 1.4.0 on Debian Reporter: Peter Hoffmann Assignee: Davies Liu Fix For: 1.5.0 I'm getting an Exception when I try to save a DataFrame with a DeciamlType as an parquet file Minimal Example: {code} from decimal import Decimal from pyspark.sql import SQLContext from pyspark.sql.types import * sqlContext = SQLContext(sc) schema = StructType([ StructField('id', LongType()), StructField('value', DecimalType())]) rdd = sc.parallelize([[1, Decimal(0.5)],[2, Decimal(2.9)]]) df = sqlContext.createDataFrame(rdd, schema) df.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite') {code} Stack Trace {code} --- Py4JJavaError Traceback (most recent call last) ipython-input-19-a77dac8de5f3 in module() 1 sr.write.parquet(hdfs://srv:9000/user/ph/decimal.parquet, 'overwrite') /home/spark/spark-1.4.0-bin-hadoop2.6/python/pyspark/sql/readwriter.pyc in parquet(self, path, mode) 367 :param mode: one of `append`, `overwrite`, `error`, `ignore` (default: error) 368 -- 369 return self._jwrite.mode(mode).parquet(path) 370 371 @since(1.4) /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, -- 538 self.target_id, self.name) 539 540 for temp_arg in temp_args: /home/spark/spark-1.4.0-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. -- 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o361.parquet. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.insert(commands.scala:138) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.run(commands.scala:114) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:68) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:88) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:87) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:939) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:939) at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:332) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:281) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at
[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored
[ https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624086#comment-14624086 ] Patrick Wendell commented on SPARK-2089: Yeah - we can open it again later if someone who maintains this code is wanting to work on this feature. I just want to have this JIRA reflect the current status (i.e. for 5 versions there hasn't been any action in Spark) which is that it is not actively being fixed and make sure the documentation correctly reflects what we have now, to discourage the use of a feature that does not work. With YARN, preferredNodeLocalityData isn't honored --- Key: SPARK-2089 URL: https://issues.apache.org/jira/browse/SPARK-2089 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza Priority: Critical When running in YARN cluster mode, apps can pass preferred locality data when constructing a Spark context that will dictate where to request executor containers. This is currently broken because of a race condition. The Spark-YARN code runs the user class and waits for it to start up a SparkContext. During its initialization, the SparkContext will create a YarnClusterScheduler, which notifies a monitor in the Spark-YARN code that . The Spark-Yarn code then immediately fetches the preferredNodeLocationData from the SparkContext and uses it to start requesting containers. But in the SparkContext constructor that takes the preferredNodeLocationData, setting preferredNodeLocationData comes after the rest of the initialization, so, if the Spark-YARN code comes around quickly enough after being notified, the data that's fetched is the empty unset version. The occurred during all of my runs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9005: - Description: {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two coincide only when the predictor is unbiased (e.g. an intercept term is included in a linear model), but this is not always the case. We should change to be consistent. (was: {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two coincide only when the predictor is unbiased (e.g. an intercept term is included in a linear model), but this is not always the case. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2.) RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two coincide only when the predictor is unbiased (e.g. an intercept term is included in a linear model), but this is not always the case. We should change to be consistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed
[ https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624113#comment-14624113 ] Apache Spark commented on SPARK-8743: - User 'nssalian' has created a pull request for this issue: https://github.com/apache/spark/pull/7362 Deregister Codahale metrics for streaming when StreamingContext is closed -- Key: SPARK-8743 URL: https://issues.apache.org/jira/browse/SPARK-8743 Project: Spark Issue Type: Sub-task Components: Streaming Affects Versions: 1.4.1 Reporter: Tathagata Das Assignee: Neelesh Srinivas Salian Labels: starter Currently, when the StreamingContext is closed, the registered metrics are not deregistered. If another streaming context is started, it throws a warning saying that the metrics are already registered. The solution is to deregister the metrics when streamingcontext is stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8880) Fix confusing Stage.attemptId member variable
[ https://issues.apache.org/jira/browse/SPARK-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-8880. --- Resolution: Fixed Fix Version/s: 1.5.0 Fix confusing Stage.attemptId member variable - Key: SPARK-8880 URL: https://issues.apache.org/jira/browse/SPARK-8880 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor Fix For: 1.5.0 This variable very confusingly refers to the *next* stageId that should be used, making this code especially hard to understand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8956) Rollup produces incorrect result when group by contains expressions
[ https://issues.apache.org/jira/browse/SPARK-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624121#comment-14624121 ] Cheng Hao commented on SPARK-8956: -- Sorry, I didn't notice this jira issue when I created this issue SPARK-8972. Rollup produces incorrect result when group by contains expressions --- Key: SPARK-8956 URL: https://issues.apache.org/jira/browse/SPARK-8956 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yana Kadiyska Rollup produces incorrect results when group clause contains an expression {code}case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(“select count(*) as cnt, key % 100 as key,GROUPING__ID from foo group by key%100 with rollup”).show(100) {code} As a workaround, this works correctly: {code} val df1=df.withColumn(newkey,df(key)%100) df1.registerTempTable(foo1) sqlContext.sql(select count(*) as cnt, newkey as key,GROUPING__ID as grp from foo1 group by newkey with rollup).show(100) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8972) Incorrect result for rollup
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-8972: - Description: {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} After checking with the code, seems we does't support the complex expressions (not just simple column names) for GROUP BY keys for rollup, as well as the cube. And it even will not report it if we have complex expression in the rollup keys, hence we get very confusing result as the example above. was: {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} Incorrect result for rollup --- Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Priority: Critical {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 5).map(i=KeyValue(i, i.toString)).toDF df.registerTempTable(foo) sqlContext.sql(select count(*) as cnt, key % 100,GROUPING__ID from foo group by key%100 with rollup).show(100) // output +---+---++ |cnt|_c1|GROUPING__ID| +---+---++ | 1| 4| 0| | 1| 4| 1| | 1| 5| 0| | 1| 5| 1| | 1| 1| 0| | 1| 1| 1| | 1| 2| 0| | 1| 2| 1| | 1| 3| 0| | 1| 3| 1| +---+---++ {code} After checking with the code, seems we does't support the complex expressions (not just simple column names) for GROUP BY keys for rollup, as well as the cube. And it even will not report it if we have complex expression in the rollup keys, hence we get very confusing result as the example above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock
[ https://issues.apache.org/jira/browse/SPARK-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624058#comment-14624058 ] Josh Rosen commented on SPARK-8415: --- I figured out how to configure AMPLab Jenkins to use a separate ivy cache for each pull request builder workspace. In the Jenkins environment / properties injection, I adeded the following lines {code} HOME=/home/sparkivy/${JOB_NAME}_${EXECUTOR_NUMBER} SBT_OPTS=-Duser.home=/home/sparkivy/${JOB_NAME}_${EXECUTOR_NUMBER} -Dsbt.ivy.home=/home/sparkivy/${JOB_NAME}_${EXECUTOR_NUMBER}/.ivy2 {code} Here, {{/home/sparkivy}} is a directory that's outside of the build workspace so it won't be deleted by the {{git clean -fdx}} in our Jenkins build. The substitutions ensure that each build gets its own independent directory. I'm going to mark this issue as resolved since I'm switching the main SparkPullRequestBuilder to use this configuration change. Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock --- Key: SPARK-8415 URL: https://issues.apache.org/jira/browse/SPARK-8415 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Josh Rosen When watching a pull request build, I noticed that the compilation + packaging + test compilation phases spent huge amounts of time waiting to acquire the Ivy cache lock. We should see whether we can tell SBT to skip the resolution steps for some of these commands, since this could speed up the compilation process when Jenkins is heavily loaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock
[ https://issues.apache.org/jira/browse/SPARK-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-8415. --- Resolution: Fixed Assignee: Josh Rosen Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock --- Key: SPARK-8415 URL: https://issues.apache.org/jira/browse/SPARK-8415 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Josh Rosen Assignee: Josh Rosen When watching a pull request build, I noticed that the compilation + packaging + test compilation phases spent huge amounts of time waiting to acquire the Ivy cache lock. We should see whether we can tell SBT to skip the resolution steps for some of these commands, since this could speed up the compilation process when Jenkins is heavily loaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8415) Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock
[ https://issues.apache.org/jira/browse/SPARK-8415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624061#comment-14624061 ] Josh Rosen commented on SPARK-8415: --- Oh, and I also added a {{mkdir -p $HOME}} to the execute shell command. Jenkins compilation spends lots of time re-resolving dependencies and waiting to acquire Ivy cache lock --- Key: SPARK-8415 URL: https://issues.apache.org/jira/browse/SPARK-8415 Project: Spark Issue Type: Bug Components: Build, Project Infra Reporter: Josh Rosen Assignee: Josh Rosen When watching a pull request build, I noticed that the compilation + packaging + test compilation phases spent huge amounts of time waiting to acquire the Ivy cache lock. We should see whether we can tell SBT to skip the resolution steps for some of these commands, since this could speed up the compilation process when Jenkins is heavily loaded. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9005: --- Assignee: (was: Apache Spark) RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624077#comment-14624077 ] Apache Spark commented on SPARK-9005: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7361 RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9005: --- Assignee: Apache Spark RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang Assignee: Apache Spark {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9005) RegressionMetrics computing incorrect explainedVariance and r2
[ https://issues.apache.org/jira/browse/SPARK-9005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feynman Liang updated SPARK-9005: - Description: {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two coincide only when the predictor is unbiased (e.g. an intercept term is included in a linear model), but this is not always the case. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. was: {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. RegressionMetrics computing incorrect explainedVariance and r2 -- Key: SPARK-9005 URL: https://issues.apache.org/jira/browse/SPARK-9005 Project: Spark Issue Type: Bug Components: MLlib Reporter: Feynman Liang {{RegressionMetrics}} currently computes explainedVariance using {{summary.variance(1)}} (variance of the residuals) where the [Wikipedia definition|https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained] uses the residual sum of squares {{math.pow(summary.normL2(1), 2)}}. The two coincide only when the predictor is unbiased (e.g. an intercept term is included in a linear model), but this is not always the case. We should change to be consistent. The computation for r2 is also currently incorrect. Multiplying by {{summary.count - 1}} appears to be trying to compute an adjusted r2, but the lack of a DoF adjustment in the numerator makes the computation inconsistent with [Wikipedia's definition|https://en.wikipedia.org/wiki/Coefficient_of_determination]. Since {{RegresionMetrics}} is not given the number of regression variables, we should modify and explicitly document that this computes unadjusted R2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8997) Improve LocalPrefixSpan performance
[ https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623292#comment-14623292 ] Feynman Liang edited comment on SPARK-8997 at 7/12/15 11:43 PM: Why PrimitiveKeyOpenHashMap if keys will be Array[Int] (and later Array[Array[Item]]), which are not primitive and will not benefit from @specialized annotations? I'm also not clear on what is meant by 3; aren't list and array both eager (did you mean to use a Stream (lazy) or ArrayBuffer (in-place update))? Which part of the code exactly are you referring to? was (Author: fliang): Why PrimitiveKeyOpenHashMap if keys will be Array[Int] (and later Array[Array[Item]]), which are not primitive and will not benefit from @specialized annotations? I'm also not clear on what is meant by 3; aren't list and array both eager (did you mean to use a Stream)? Which part of the code exactly are you referring to? Improve LocalPrefixSpan performance --- Key: SPARK-8997 URL: https://issues.apache.org/jira/browse/SPARK-8997 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Feynman Liang Original Estimate: 24h Remaining Estimate: 24h We can improve the performance by: 1. run should output Iterator instead of Array 2. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap 3. We can use list to avoid materialize frequent sequences -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8997) Improve LocalPrefixSpan performance
[ https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8997: --- Assignee: Apache Spark (was: Feynman Liang) Improve LocalPrefixSpan performance --- Key: SPARK-8997 URL: https://issues.apache.org/jira/browse/SPARK-8997 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Apache Spark Original Estimate: 24h Remaining Estimate: 24h We can improve the performance by: 1. run should output Iterator instead of Array 2. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap 3. We can use list to avoid materialize frequent sequences -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8997) Improve LocalPrefixSpan performance
[ https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624068#comment-14624068 ] Apache Spark commented on SPARK-8997: - User 'feynmanliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7360 Improve LocalPrefixSpan performance --- Key: SPARK-8997 URL: https://issues.apache.org/jira/browse/SPARK-8997 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Feynman Liang Original Estimate: 24h Remaining Estimate: 24h We can improve the performance by: 1. run should output Iterator instead of Array 2. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap 3. We can use list to avoid materialize frequent sequences -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8997) Improve LocalPrefixSpan performance
[ https://issues.apache.org/jira/browse/SPARK-8997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8997: --- Assignee: Feynman Liang (was: Apache Spark) Improve LocalPrefixSpan performance --- Key: SPARK-8997 URL: https://issues.apache.org/jira/browse/SPARK-8997 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.5.0 Reporter: Xiangrui Meng Assignee: Feynman Liang Original Estimate: 24h Remaining Estimate: 24h We can improve the performance by: 1. run should output Iterator instead of Array 2. Local count shouldn't use groupBy, which creates too many arrays. We can use PrimitiveKeyOpenHashMap 3. We can use list to avoid materialize frequent sequences -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
kumar ranganathan created SPARK-9009: Summary: SPARK Encryption FileNotFoundException for truststore Key: SPARK-9009 URL: https://issues.apache.org/jira/browse/SPARK-9009 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.4.0 Reporter: kumar ranganathan I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to spark.ssl.enabled as false then the job completed successfully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9009) SPARK Encryption FileNotFoundException for truststore
[ https://issues.apache.org/jira/browse/SPARK-9009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar ranganathan updated SPARK-9009: - Description: I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:569) at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) 15/07/13 09:38:50 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified)) 15/07/13 09:38:50 INFO util.Utils: Shutdown hook called {code} If i change the truststore file location to different drive (d:\spark_conf\spark.truststore) then getting exception as {code} java.io.FileNotFoundException: D:\Spark_conf\spark.truststore (The device is not ready) {code} This exception throws from SecurityManager.scala at the line of openstream() shown below {code:title=SecurityManager.scala|borderStyle=solid} val trustStoreManagers = for (trustStore - fileServerSSLOptions.trustStore) yield { val input = Files.asByteSource(fileServerSSLOptions.trustStore.get).openStream() try { {code} The same problem occurs for the keystore file when removed truststore property in spark-defaults.conf. When disabled the encryption property to set spark.ssl.enabled as false then the job completed successfully. was: I got FileNotFoundException in the application master when running the SparkPi example in windows machine. The problem is that the truststore file found in C:\Spark\conf\spark.truststore location but getting below exception as {code} 15/07/13 09:38:50 ERROR yarn.ApplicationMaster: Uncaught exception: java.io.FileNotFoundException: C:\Spark\conf\spark.truststore (The system cannot find the path specified) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124) at org.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:261) at org.apache.spark.SecurityManager$$anonfun$4.apply(SecurityManager.scala:254) at scala.Option.map(Option.scala:145) at org.apache.spark.SecurityManager.init(SecurityManager.scala:254) at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:132) at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:571) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65) at
[jira] [Assigned] (SPARK-8761) Master.removeApplication is not thread-safe but is called from multiple threads
[ https://issues.apache.org/jira/browse/SPARK-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8761: --- Assignee: (was: Apache Spark) Master.removeApplication is not thread-safe but is called from multiple threads --- Key: SPARK-8761 URL: https://issues.apache.org/jira/browse/SPARK-8761 Project: Spark Issue Type: Bug Components: Deploy Reporter: Shixiong Zhu Master.removeApplication is not thread-safe. But it's called both in the message loop of Master and MasterPage.handleAppKillRequest which runs in threads of the Web server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8761) Master.removeApplication is not thread-safe but is called from multiple threads
[ https://issues.apache.org/jira/browse/SPARK-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624236#comment-14624236 ] Apache Spark commented on SPARK-8761: - User 'vinodkc' has created a pull request for this issue: https://github.com/apache/spark/pull/7364 Master.removeApplication is not thread-safe but is called from multiple threads --- Key: SPARK-8761 URL: https://issues.apache.org/jira/browse/SPARK-8761 Project: Spark Issue Type: Bug Components: Deploy Reporter: Shixiong Zhu Master.removeApplication is not thread-safe. But it's called both in the message loop of Master and MasterPage.handleAppKillRequest which runs in threads of the Web server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8761) Master.removeApplication is not thread-safe but is called from multiple threads
[ https://issues.apache.org/jira/browse/SPARK-8761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8761: --- Assignee: Apache Spark Master.removeApplication is not thread-safe but is called from multiple threads --- Key: SPARK-8761 URL: https://issues.apache.org/jira/browse/SPARK-8761 Project: Spark Issue Type: Bug Components: Deploy Reporter: Shixiong Zhu Assignee: Apache Spark Master.removeApplication is not thread-safe. But it's called both in the message loop of Master and MasterPage.handleAppKillRequest which runs in threads of the Web server. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9003: --- Description: MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. was: MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704 ] Yanbo Liang edited comment on SPARK-9003 at 7/12/15 8:38 AM: - Yes, I agree that this is not supposed to become yet another vector/matrix libaray. But I think map/update function is important enough to become the interface of vector just like the foreachActive which is supported at present. I can also provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) However, MLlib/Matrix has already supported map/update/foreachActive function, and we can compare two Matrices use ~== effortless. was (Author: yanboliang): Yes, I can provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623703#comment-14623703 ] Sean Owen commented on SPARK-8981: -- I think it's OK if you can do this via the slf4j API and it doesn't add overhead. I am not sure Logging is actually going to be removed; it's not to be used by apps though. Logging can't use a SparkContext; it's not used where a SparkContext is used, always. I don't think that's important. MDC has static methods. Are you proposing to change the default log message or just make these values available? it might be less intrusive to not change the log output Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623704#comment-14623704 ] Yanbo Liang edited comment on SPARK-9003 at 7/12/15 9:36 AM: - Yes, I agree that this is not supposed to become yet another vector/matrix libaray. But I think map/update function is important enough to become the interface of vector just like foreachActive which is supported at present. I can also provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) However, MLlib/Matrix has already supported map/update/foreachActive function, and we can compare two Matrices use ~== effortless. was (Author: yanboliang): Yes, I agree that this is not supposed to become yet another vector/matrix libaray. But I think map/update function is important enough to become the interface of vector just like the foreachActive which is supported at present. I can also provide an example which may be benefit of these function. For example: val originalPrediction = Vectors.dense(Array(1, 2, 3)) val expected = Vectors.dense(Array(10, 20, 30)) In some cases, we can use ~== to compare two Vector/Matrix which is defined in org.apache.spark.mllib.util.TestingUtils. So currently we can only code as following: val prediction = Vectors.dense(originalPrediction.toArray.map(x = x*10)) assert(prediction ~== expected absTol 0.01, prediction error) If we support map/update for Vector, we can code as: assert(originalPrediction.map(x = x*10) ~== expected absTol 0.01, prediction error) However, MLlib/Matrix has already supported map/update/foreachActive function, and we can compare two Matrices use ~== effortless. Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and fromBreeze make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9003: --- Assignee: Apache Spark Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Assignee: Apache Spark Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and fromBreeze make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9004) Add s3 bytes read/written metrics
Abhishek Modi created SPARK-9004: Summary: Add s3 bytes read/written metrics Key: SPARK-9004 URL: https://issues.apache.org/jira/browse/SPARK-9004 Project: Spark Issue Type: Improvement Affects Versions: 1.4.0 Reporter: Abhishek Modi Priority: Minor s3 read/write metrics can be pretty useful in finding the total aggregate data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abhishek Modi updated SPARK-9004: - Affects Version/s: (was: 1.4.0) Add s3 bytes read/written metrics - Key: SPARK-9004 URL: https://issues.apache.org/jira/browse/SPARK-9004 Project: Spark Issue Type: Improvement Reporter: Abhishek Modi Priority: Minor s3 read/write metrics can be pretty useful in finding the total aggregate data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9003: --- Description: MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. was: MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get a Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9003) Add map/update function to MLlib/Vector
Yanbo Liang created SPARK-9003: -- Summary: Add map/update function to MLlib/Vector Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get a Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9003: --- Description: MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. was: MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623719#comment-14623719 ] Walter Petersen commented on SPARK-3155: Ok, fine. Thanks a lot. Support DecisionTree pruning Key: SPARK-3155 URL: https://issues.apache.org/jira/browse/SPARK-3155 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Improvement: accuracy, computation Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning. Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed. A naive implementation prunes as follows: (1) Train a depth K tree using a training set. (2) Compute the optimal prediction at each node (including internal nodes) based on the training set. (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).) (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error. A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch. It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623726#comment-14623726 ] Sean Owen commented on SPARK-8981: -- The constructors? Have a look through org.apache.spark.executor. The app ID should be in env.conf Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8671) Add isotonic regression to the pipeline API
[ https://issues.apache.org/jira/browse/SPARK-8671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623786#comment-14623786 ] Martin Zapletal commented on SPARK-8671: I am on it. Add isotonic regression to the pipeline API --- Key: SPARK-8671 URL: https://issues.apache.org/jira/browse/SPARK-8671 Project: Spark Issue Type: New Feature Components: ML Reporter: Xiangrui Meng Original Estimate: 48h Remaining Estimate: 48h It is useful to have IsotonicRegression under the pipeline API for score calibration. The parameters should be the same as the implementation in spark.mllib package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9004: - Target Version/s: (was: 1.4.0) Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Don't set Target version. This sounds specific to S3 though. Where are you proposing to change this? Add s3 bytes read/written metrics - Key: SPARK-9004 URL: https://issues.apache.org/jira/browse/SPARK-9004 Project: Spark Issue Type: Improvement Reporter: Abhishek Modi Priority: Minor s3 read/write metrics can be pretty useful in finding the total aggregate data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623712#comment-14623712 ] Sean Owen commented on SPARK-8981: -- Can MDC methods be invoked during executor initialization? where the app name is available? Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623721#comment-14623721 ] Walter Petersen commented on SPARK-3155: Ok, fine. Thanks a lot [~josephkb]. Support DecisionTree pruning Key: SPARK-3155 URL: https://issues.apache.org/jira/browse/SPARK-3155 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Improvement: accuracy, computation Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning. Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed. A naive implementation prunes as follows: (1) Train a depth K tree using a training set. (2) Compute the optimal prediction at each node (including internal nodes) based on the training set. (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).) (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error. A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch. It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3155) Support DecisionTree pruning
[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walter Petersen updated SPARK-3155: --- Comment: was deleted (was: Ok, fine. Thanks a lot.) Support DecisionTree pruning Key: SPARK-3155 URL: https://issues.apache.org/jira/browse/SPARK-3155 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Improvement: accuracy, computation Summary: Pruning is a common method for preventing overfitting with decision trees. A smart implementation can prune the tree during training in order to avoid training parts of the tree which would be pruned eventually anyways. DecisionTree does not currently support pruning. Pruning: A “pruning” of a tree is a subtree with the same root node, but with zero or more branches removed. A naive implementation prunes as follows: (1) Train a depth K tree using a training set. (2) Compute the optimal prediction at each node (including internal nodes) based on the training set. (3) Take a held-out validation set, and use the tree to make predictions for each validation example. This allows one to compute the validation error made at each node in the tree (based on the predictions computed in step (2).) (4) For each pair of leafs with the same parent, compare the total error on the validation set made by the leafs’ predictions with the error made by the parent’s predictions. Remove the leafs if the parent has lower error. A smarter implementation prunes during training, computing the error on the validation set made by each node as it is trained. Whenever two children increase the validation error, they are pruned, and no more training is required on that branch. It is common to use about 1/3 of the data for pruning. Note that pruning is important when using a tree directly for prediction. It is less important when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623720#comment-14623720 ] Paweł Kopiczko commented on SPARK-8981: --- I think so. Would you mind pointing me to executor initialization code? Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623700#comment-14623700 ] Paweł Kopiczko commented on SPARK-8981: --- slf4j supports MDC as well: http://www.slf4j.org/api/org/slf4j/MDC.html I've analysed how {{Logging}} trait is implemented. If I'm correct every executor process calls {{initializeLogging}} method because of transient {{log_}} field. It looks to me that right now it's impossible to pass there {{SparkContext}} instance (or any other value) without breaking the API. Do you agree? Have you any idea how to bypass that? Int terms of this comment: ??This will likely be changed or removed in future releases.??, are you considering any change right now? Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623702#comment-14623702 ] Sean Owen commented on SPARK-9003: -- I think the idea was that this is not supposed to become yet another vector/matrix library, and that you can manipulate the underlying breeze vector if needed. I don't know how strong that convention is. The use case you show doesn't really benefit except for maybe saving a method call; is there a case where this would be a bigger win? Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector is short of map/update function which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623707#comment-14623707 ] Paweł Kopiczko commented on SPARK-8981: --- ??MDC has static methods?? Yes, but I'm not sure how to invoke these in executor thread. Any idea? ??Are you proposing to change the default log message or just make these values available??? Available only. I think it may be needed especially by standalone mode users. YARN users don't need that functionality, because CM stores logs in HDFS by applicationId. I'm not familiar with Mesos, but probably it has ability to store separated logs for each container. I believe the overhead is minimal since it's only two String values in a static map. Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623707#comment-14623707 ] Paweł Kopiczko edited comment on SPARK-8981 at 7/12/15 8:26 AM: ??MDC has static methods?? Yes, but I'm not sure how to invoke these in executor thread. Any idea? ??Are you proposing to change the default log message or just make these values available??? Available only. I think they may be needed especially by standalone mode users. YARN users don't need that functionality, because CM stores logs in HDFS by applicationId. I'm not familiar with Mesos, but probably it has ability to store separated logs for each container. I believe the overhead is minimal since it's only two String values in a static map. was (Author: kopiczko): ??MDC has static methods?? Yes, but I'm not sure how to invoke these in executor thread. Any idea? ??Are you proposing to change the default log message or just make these values available??? Available only. I think it may be needed especially by standalone mode users. YARN users don't need that functionality, because CM stores logs in HDFS by applicationId. I'm not familiar with Mesos, but probably it has ability to store separated logs for each container. I believe the overhead is minimal since it's only two String values in a static map. Set applicationId and appName in log4j MDC -- Key: SPARK-8981 URL: https://issues.apache.org/jira/browse/SPARK-8981 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Paweł Kopiczko Priority: Minor It would be nice to have, because it's good to have logs in one file when using log agents (like logentires) in standalone mode. Also allows configuring rolling file appender without a mess when multiple applications are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-9003: --- Description: MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and fromBreeze make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. was: MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and fromBreeze make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623724#comment-14623724 ] Apache Spark commented on SPARK-9003: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/7357 Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and fromBreeze make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9003) Add map/update function to MLlib/Vector
[ https://issues.apache.org/jira/browse/SPARK-9003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-9003: --- Assignee: (was: Apache Spark) Add map/update function to MLlib/Vector --- Key: SPARK-9003 URL: https://issues.apache.org/jira/browse/SPARK-9003 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Yanbo Liang Priority: Minor MLlib/Vector only support foreachActive function and is short of map/update which is inconvenience for some Vector operations. For example: val a = Vectors.dense(...) If we want to compute math.log for each elements of a and get Vector as return value, we can only code as: val b = Vectors.dense(a.toArray.map(math.log)) or we can use toBreeze and fromBreeze make transformation with breeze API. The code snippet is not elegant, we want it can implement: val c = a.map(math.log) Also currently MLlib/Matrix has implemented map/update/foreachActive function. I think Vector should also has map/update. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623834#comment-14623834 ] Abhishek Modi commented on SPARK-9004: -- Hadoop separates HDFS bytes, local filesystem bytes and S3 bytes in counters. Spark combines all of them in its metrics. Separating them could give a better idea of IO distribution. Here's how it works in MR: 1. Client creates a Job object (org.apache.hadoop.mapreduce.Job). It submits to the RM which then launches the AM etc. 2. After job submission, Client continuously monitors the job to see if it is finished. 3. Once the job is finished, the client gets the counters of the job via the getCounters() function. 4. It logs on the client using Counters= format. I don't really know how to implement it. Can it be done by modifying NewHadoopRDD because i guess that's where the Job object is being used ? Add s3 bytes read/written metrics - Key: SPARK-9004 URL: https://issues.apache.org/jira/browse/SPARK-9004 Project: Spark Issue Type: Improvement Reporter: Abhishek Modi Priority: Minor s3 read/write metrics can be pretty useful in finding the total aggregate data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9004) Add s3 bytes read/written metrics
[ https://issues.apache.org/jira/browse/SPARK-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-9004: - Component/s: Input/Output The metrics are tracked by the InputFormat / OutputFormat right? that might already be available then since Spark uses the same classes. I think you'd have to investigate and propose a PR if you want this done. NewHadoopRDD is not specific to S3, no. Add s3 bytes read/written metrics - Key: SPARK-9004 URL: https://issues.apache.org/jira/browse/SPARK-9004 Project: Spark Issue Type: Improvement Components: Input/Output Reporter: Abhishek Modi Priority: Minor s3 read/write metrics can be pretty useful in finding the total aggregate data processed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8982) Worker hostnames not showing in Master web ui when launched with start-slaves.sh
[ https://issues.apache.org/jira/browse/SPARK-8982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8982: - Target Version/s: (was: 1.4.0) Worker hostnames not showing in Master web ui when launched with start-slaves.sh Key: SPARK-8982 URL: https://issues.apache.org/jira/browse/SPARK-8982 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Ben Zimmer Priority: Minor If a --host argument is not provided to Worker, WorkerArguments uses Utils.localHostName to find the host name. SPARK-6440 changed the functionality of Utils.localHostName to retrieve the local IP address instead of host name. Since start-slave.sh does not provide the --host argument, clusters started with start-slaves.sh now show IP addresses instead of hostnames in the Master web UI. This is inconvenient when starting and debugging small clusters. A simple fix would be to find the local machine's hostname in start-slave.sh and pass it as the --host argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5159: --- Assignee: (was: Apache Spark) Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5159: --- Assignee: Apache Spark Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray Assignee: Apache Spark I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623898#comment-14623898 ] Apache Spark commented on SPARK-5159: - User 'ilovesoup' has created a pull request for this issue: https://github.com/apache/spark/pull/7358 Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623909#comment-14623909 ] Ma Xiaoyu commented on SPARK-5159: -- Above is my first PR to spark. New to Spark and scala. Please advise. Thrift server does not respect hive.server2.enable.doAs=true Key: SPARK-5159 URL: https://issues.apache.org/jira/browse/SPARK-5159 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Andrew Ray I'm currently testing the spark sql thrift server on a kerberos secured cluster in YARN mode. Currently any user can access any table regardless of HDFS permissions as all data is read as the hive user. In HiveServer2 the property hive.server2.enable.doAs=true causes all access to be done as the submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org